## Seminar note

During the live seminar we’ll implement the following for ordinary least squares. Completing the implementation for ridge regression might be on the next problem set (and/or some simple mathematical analysis of it could be examinable, so this is good practice)

## Ridge regression

(You can delete this section if you get error messages about LaTeX)

Consider the loss function

$L(\beta) = \| \mathbf y - \mathbf X \beta \|^2 + \lambda \| \beta \|^2$ You might also want to consider writing it this way

$L(\beta) = \sum_{i=1}^n (y_i - \mathbf x_i^T \beta)^2 + \sum_{j=1}^p \beta_j^2$ Compute the gradient $$\nabla L$$ of this loss function, and use your answer to write an R function below that inputs $$(X, y, \beta, \lambda)$$ and outputs the value of the gradient. Also write a function that takes the same inputs and outputs the value of the ridge loss.

(Don’t delete below here)

# Gradient of the ridge loss
least_squares_gradient <- function(x, y, beta) {
-2 * t(x) %*% (y - x %*% beta)
}

# Loss function
least_squares_loss <- function(x, y, beta) {
sum((y - x %*% beta)^2)
}

## Simulate data

Write a function that inputs a coefficient vector beta and design matrix, and uses these the generate a vector of outcomes from a linear model with normal noise

set.seed(1)
n <- 100
p <- 10
X <- matrix(rnorm(n*p), nrow = n)
beta <- rpois(p, lambda = 3)
y <- X %*% beta + rnorm(n)
least_squares_gradient(X, y, beta)
##             [,1]
##  [1,]  -7.655822
##  [2,]  29.799193
##  [3,]   7.017735
##  [4,]  -2.586372
##  [5,] -10.374698
##  [6,]   1.528508
##  [7,] -17.181227
##  [8,] -41.579952
##  [9,]  12.879127
## [10,] -21.106348

Now generate a design matrix and coefficient vector and use these with the previous function to generate an outcome

# Choose n and p, consider making p large, maybe even larger than n
# Also think about the size of the coefficients, and sparsity

• Implement gradient descent to optimize the ridge loss function
• Apply your implementation to the data you generated in the previous section
• Compare the estimated coefficients to the true coefficients
beta0 <- rep(0, p)
previous_loss <- least_squares_loss(X, y, beta0)
beta1 <- beta0 - 0.0001 * grad0
next_loss <- least_squares_loss(X, y, beta1)
previous_beta <- beta1
steps <- 1

while (abs(previous_loss - next_loss) > 0.0001) {
if (steps %% 10 == 0) print(previous_loss)
steps <- steps + 1
previous_beta <- next_beta
previous_loss <- next_loss
next_loss <- least_squares_loss(X, y, next_beta)
}
##  1536.472
##  159.3278
##  151.77
##  145.2942
##  139.7435
##  135.0087
##  130.9636
##  125.5322
##  110.2006
##  108.4356
##  106.9456
##  105.6849
##  104.6159
##  103.7073
##  102.9331
##  102.2718
##  101.7055
##  101.0657
##  99.00849
##  98.3756
##  98.16812
##  97.99213
##  97.8425
##  97.71496
##  97.60598
##  97.51188
##  97.00878
##  96.96367
##  96.92572
##  96.89373
##  96.8667
##  96.84383
##  96.82441
##  96.8079
##  96.79381
##  96.78178
##  96.77146
##  96.71617
##  96.71431
##  96.70984
##  96.70606
##  96.70285
##  96.70012
##  96.69778
##  96.69579
##  96.69121
##  96.68443
##  96.68362
##  96.68293
##  96.68235
##  96.68187
##  96.68145
##  96.6811
##  96.6808
##  96.68055
##  96.68033
##  96.68014
##  96.67907
##  96.67898
##  96.67891
##  96.67884
##  96.67879
##  96.67875
##  96.67871
##  96.67868
##  96.67865
##  96.67863
##  96.67861
##  96.67859
c(steps, previous_loss, next_loss)
##  695.00000  96.67844  96.67854
beta - previous_beta
##               [,1]
##  [1,] -0.028475498
##  [2,]  0.177765248
##  [3,] -0.004290968
##  [4,] -0.017128255
##  [5,] -0.054355122
##  [6,] -0.050542705
##  [7,] -0.047081517
##  [8,] -0.169549647
##  [9,]  0.086593620
## [10,] -0.106931045
mean((beta - previous_beta)^2)
##  0.008812863
coef(lm(y ~ X - 1)) - previous_beta
##                [,1]
##  [1,] -1.845182e-04
##  [2,]  6.105159e-04
##  [3,]  1.512364e-04
##  [4,]  1.368817e-04
##  [5,]  2.888292e-04
##  [6,] -5.384685e-05
##  [7,]  4.033052e-04
##  [8,]  3.873572e-04
##  [9,]  3.729351e-05
## [10,]  5.589654e-05
• Now change the value of lambda and repeat the previous steps
• Which value of lambda produced estimates closer to the truth?

Optional:

• Does your answer to the previous question change depending on something about the data you generated, for example the size or sparsity of the true coefficients?