## Seminar note

During the live seminar we’ll implement the following for ordinary least squares. Completing the implementation for ridge regression might be on the next problem set (and/or some simple mathematical analysis of it could be examinable, so this is good practice)

## Ridge regression

(You can delete this section if you get error messages about LaTeX)

Consider the loss function

$L(\beta) = \| \mathbf y - \mathbf X \beta \|^2 + \lambda \| \beta \|^2$ You might also want to consider writing it this way

$L(\beta) = \sum_{i=1}^n (y_i - \mathbf x_i^T \beta)^2 + \sum_{j=1}^p \beta_j^2$ Compute the gradient $$\nabla L$$ of this loss function, and use your answer to write an R function below that inputs $$(X, y, \beta, \lambda)$$ and outputs the value of the gradient. Also write a function that takes the same inputs and outputs the value of the ridge loss.

(Don’t delete below here)

# Gradient of the ridge loss
least_squares_gradient <- function(x, y, beta) {
-2 * t(x) %*% (y - x %*% beta)
}

# Loss function
least_squares_loss <- function(x, y, beta) {
sum((y - x %*% beta)^2)
}

## Simulate data

Write a function that inputs a coefficient vector beta and design matrix, and uses these the generate a vector of outcomes from a linear model with normal noise

set.seed(1)
n <- 100
p <- 10
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 3)
y <- X %*% beta + rnorm(n)
least_squares_gradient(X, y, beta)
##             [,1]
##  [1,]  -7.655822
##  [2,]  29.799193
##  [3,]   7.017735
##  [4,]  -2.586372
##  [5,] -10.374698
##  [6,]   1.528508
##  [7,] -17.181227
##  [8,] -41.579952
##  [9,]  12.879127
## [10,] -21.106348

Now generate a design matrix and coefficient vector and use these with the previous function to generate an outcome

# Choose n and p, consider making p large, maybe even larger than n
# Also think about the size of the coefficients, and sparsity

• Implement gradient descent to optimize the ridge loss function
• Apply your implementation to the data you generated in the previous section
• Compare the estimated coefficients to the true coefficients
beta0 <- rep(0, p)
previous_loss <- least_squares_loss(X, y, beta0)
beta1 <- beta0 - 0.0001 * grad0
next_loss <- least_squares_loss(X, y, beta1)
previous_beta <- beta1
steps <- 1

while (abs(previous_loss - next_loss) > 0.0001) {
if (steps %% 10 == 0) print(previous_loss)
steps <- steps + 1
previous_beta <- next_beta
previous_loss <- next_loss
next_loss <- least_squares_loss(X, y, next_beta)
}
## [1] 1536.472
## [1] 159.3278
## [1] 151.77
## [1] 145.2942
## [1] 139.7435
## [1] 135.0087
## [1] 130.9636
## [1] 125.5322
## [1] 110.2006
## [1] 108.4356
## [1] 106.9456
## [1] 105.6849
## [1] 104.6159
## [1] 103.7073
## [1] 102.9331
## [1] 102.2718
## [1] 101.7055
## [1] 101.0657
## [1] 99.00849
## [1] 98.3756
## [1] 98.16812
## [1] 97.99213
## [1] 97.8425
## [1] 97.71496
## [1] 97.60598
## [1] 97.51188
## [1] 97.00878
## [1] 96.96367
## [1] 96.92572
## [1] 96.89373
## [1] 96.8667
## [1] 96.84383
## [1] 96.82441
## [1] 96.8079
## [1] 96.79381
## [1] 96.78178
## [1] 96.77146
## [1] 96.71617
## [1] 96.71431
## [1] 96.70984
## [1] 96.70606
## [1] 96.70285
## [1] 96.70012
## [1] 96.69778
## [1] 96.69579
## [1] 96.69121
## [1] 96.68443
## [1] 96.68362
## [1] 96.68293
## [1] 96.68235
## [1] 96.68187
## [1] 96.68145
## [1] 96.6811
## [1] 96.6808
## [1] 96.68055
## [1] 96.68033
## [1] 96.68014
## [1] 96.67907
## [1] 96.67898
## [1] 96.67891
## [1] 96.67884
## [1] 96.67879
## [1] 96.67875
## [1] 96.67871
## [1] 96.67868
## [1] 96.67865
## [1] 96.67863
## [1] 96.67861
## [1] 96.67859
c(steps, previous_loss, next_loss)
## [1] 695.00000  96.67844  96.67854
beta - previous_beta
##               [,1]
##  [1,] -0.028475498
##  [2,]  0.177765248
##  [3,] -0.004290968
##  [4,] -0.017128255
##  [5,] -0.054355122
##  [6,] -0.050542705
##  [7,] -0.047081517
##  [8,] -0.169549647
##  [9,]  0.086593620
## [10,] -0.106931045
mean((beta - previous_beta)^2)
## [1] 0.008812863
coef(lm(y ~ X - 1)) - previous_beta
##                [,1]
##  [1,] -1.845182e-04
##  [2,]  6.105159e-04
##  [3,]  1.512364e-04
##  [4,]  1.368817e-04
##  [5,]  2.888292e-04
##  [6,] -5.384685e-05
##  [7,]  4.033052e-04
##  [8,]  3.873572e-04
##  [9,]  3.729351e-05
## [10,]  5.589654e-05
• Now change the value of lambda and repeat the previous steps
• Which value of lambda produced estimates closer to the truth?

Optional:

• Does your answer to the previous question change depending on something about the data you generated, for example the size or sparsity of the true coefficients?