- Hypothesis tests for coefficients
- Confidence intervals for coefficients
- Model diagnostic plots

`head(candy_rankings)`

```
## # A tibble: 6 x 13
## competitorname chocolate fruity caramel peanutyalmondy nougat
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 100 Grand T F T F F
## 2 3 Musketeers T F F F T
## 3 One dime F F F F F
## 4 One quarter F F F F F
## 5 Air Heads F T F F F
## 6 Almond Joy T F F T F
## # ... with 7 more variables: crispedricewafer <lgl>, hard <lgl>,
## # bar <lgl>, pluribus <lgl>, sugarpercent <dbl>, pricepercent <dbl>,
## # winpercent <dbl>
```

- We’ll use the
`sugarpercent`

variable to predict`winpercent`

`lm`

functionTo compute a linear model, first use the function `lm(outcome_var ~ preditor_var, data = name_of_data)`

- Change
`outcome_var`

to the name of the variable you are trying to predict - Change
`predictor_var`

to the name of the variable you are using as a predictor - Change
`name_of_data`

to the name of the data set that has your variables - Store the output of this function in a new variable so we can reuse it multiple times:

`candy_model <- lm(winpercent ~ sugarpercent, data = candy_rankings)`

Look at the basic information about the fitted model, including the estimated coefficients:

`candy_model`

```
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy_rankings)
##
## Coefficients:
## (Intercept) sugarpercent
## 44.6 11.9
```

Use `summary()`

on the output from `lm()`

to get (1) p-values for the coefficients and (2) the R squared measure of goodness of fit

`summary(candy_model)`

```
##
## Call:
## lm(formula = winpercent ~ sugarpercent, data = candy_rankings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.92 -11.07 -1.17 9.25 36.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.61 3.09 14.46 <0.0000000000000002 ***
## sugarpercent 11.92 5.56 2.14 0.035 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.4 on 83 degrees of freedom
## Multiple R-squared: 0.0525, Adjusted R-squared: 0.0411
## F-statistic: 4.6 on 1 and 83 DF, p-value: 0.0349
```

- \(p\)-values for hypothesis tests of null hypotheses saying a variable’s true coefficient is 0 \[ H_{0,j} : \beta_j = 0 \]
- Each coefficient is tested individually. The test statistic \[ T = \frac{\hat \beta_j}{\text{SE}(\hat \beta_j)} \] is compared to a \(t\)-distribution with \(n - p\) degrees of freedom, where \(p\) is the number of coefficients (e.g. \(p = 2\) for the simple linear model with one intercept and one slope)
- The tail areas of the \(t\)-distribution are used to calculate the probability of observing a value of \(T\) at least as extreme as the value in the data, assuming the null hypothesis \(\beta_j = 0\) is true
The default is to give \(p\)-values for two-sided alternatives

- The R-squared and Adjusted R-squared values measure how well the linear model explains the outcome variable
- \(R^2\) values should be between 0 and 1
- Values close to 1 indicate the model fits the data very well
- Values close to 0 indicate the model fits the data poorly
What counts as a good value? This depends on the application. In social sciences, biology, and finance, it may be common for low values of \(R^2\) to still be considered impressive

- It’s possible, but rare, for Adjusted \(R^2\) to be negative
- Adjusted \(R^2\) cannot be larger than \(R^2\). It’s usually better to focus on the Adjusted one.
Why is called “adjusted”? It’s possible to increase \(R^2\) by just putting more and more predictors into the model, even if they’re not very good predictors. The adjusted \(R^2\) is penalized for the number of predictors. We’ll come back to this kind of idea soon when we talk about model selection

- Use
`confint()`

to get confidence intervals for the coefficients - Optionally specify a different confidence level with
`confint(model, level = 0.99)`

for example

`confint(candy_model)`

```
## 2.5 % 97.5 %
## (Intercept) 38.471 50.75
## sugarpercent 0.866 22.98
```

`confint(candy_model, level = .99)`

```
## 0.5 % 99.5 %
## (Intercept) 36.473 52.75
## sugarpercent -2.733 26.58
```

- First, if you have not already installed it, install the
`ggfortify`

package with`install.packages("ggfortify")`

and remember to load the package with`library(ggfortify)`

- Use the
`autoplot()`

function from the`ggfortify`

package to see multiple diagnostic plots for the fitted model:

`autoplot(candy_model) `

**Residuals vs Fitted**(top left) - check for two things- obvious patterns between the residuals and fitted values indicate possible non-linear relationships that the linear model has not captured

- the vertical spread of points should be roughly the same throughout the graph. If the points are more vertically spread in one part of the plot compared to another (like a triangle pointing left or right) this indicates
*heteroscedastic errors*

- the vertical spread of points should be roughly the same throughout the graph. If the points are more vertically spread in one part of the plot compared to another (like a triangle pointing left or right) this indicates
**Normal Q-Q**(top right) points far away from the diagonal line indicate residuals that are not normally distributed. The best case is to see the points lined up perfectly on the dashed diagonal line**Scale-Location**(bottom left) interpret similarly to top left**Residuals vs Leverage**(bottom right) points far toward the right have high leverage, and if the values of their residuals are also large (far from 0 in either direction) then those points may be outliers

One common solution: look for non-linear relationships between predictors and the outcome. Create scatterplots and look for things like polynomial (quadratic, cubic) or exponential/logarithmic relationships. For example, if there appears to be a quadratic relationship between \(x\) and \(y\) then add \(x^2\) to your predictors. Is the outcome variable highly skewed, like something involving money? Then consider predicting the logarithm of that variable instead–transform \(y\) to \(log(y)\).

- These next solutions are advanced topics, and it is optional (not required) to read about them
Other possibilities, especially if the first solution above has failed: consider “robust” methods like sandwich estimators or the wild bootstrap