- Can’t control and randomize
- Confounder: associated with both exposure and outcome
- Can (try to) control for confounders when they are measured
- Unobserved confounders are always a major limitation of observational studies
- RCTs are a higher standard of evidence

We’ve got a bunch of data. What can we do with it? Look at a giant spreadsheet?

*How do we extract meaning from a collection of (many) observations?*

- Measures of central tendency
- Measures of dispersion

- Data as a list of numbers
- Number of observations \(n\) is length of the list
- Write \(x_1\) for the first number, \(x_2\) for the second, …
- In general \(x_i\) denotes the \(i\)th number in the list.
- What’s the last number? \(x_n\)

- The
**mean**or (arithmetic) average of the data is sum of all the \(x_i\)’s divided by \(n\) \[\bar x := \frac{1}{n}\sum_{i=1}^n x_i\] **Order statistics**:- Write \(x_{(1)}\) for the smallest number, \(x_{(2)}\) for the second smallest, …
- What’s the largest? \(x_{(n)}\)
- \(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\) is a sorted list
- The
**median**is the middle one if \(n\) is odd, or average of middle two if \(n\) is even \[m := \frac{1}{2}(x_{(n/2)}+x_{(n/2+1)}) \quad \text{or} \quad x_{((n+1)/2)}\] - Half of the data points are below \(m\) and half above
- Counting how many times each unique number occurs gives a
**table** - A
**mode**is a number which occurs the most number of times

*The mean is the most commonly used, but the others are more useful/meaningful in certain scenarios*

**Range**: \(x_{(n)} - x_{(1)}\)- The (sample)
**standard deviation**(or SD): \[ s := \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2} \] - (Why \(n-1\)? Otherwise \(s\) would be too small. More on this another day)
- This is like an
*average distance from each point to the overall average* **Quartiles**: 25% of the data points are below \(Q_1\) and 75% above.- \(Q_2 = m\), and 25% of the data points are above \(Q_3\).
- The
**interquartile range**\(IQR := Q_3 - Q_1\) contains the middle 50% of the data - Another average distance from the center is
**median absolute deviation**\[ MAD := \text{median}(|x_i - m|) \] - This means: make a new list \(|x_1 - m|, |x_2 - m|, ...\) and then find its median

*SD is the most commonly used*

- How far apart can \(\bar x\) and \(m\) be? Not too far: \(| \bar x - m| \leq s\)
- What % of the \(x_i\)’s are within distance \(s\) from \(\bar x\)?
*Usually*at least 68% **68-95-99 rule**: (*usually*) at least what % of the \(x_i\)’s are within \(s\), \(2s\), \(3s\) from \(\bar x\), respectively- e.g.
*usually*at least 99% of the \(x_i\)’s will satisfy \[ \bar x - 3s \leq x_i \leq \bar x + 3s \] - Does this justify using measures of central tendency and spread as summaries of the data?

```
data <- read.csv("class_survey.csv", header=T, stringsAsFactors = F)
names(data) <- c("time", "major", "job_title", "stat_relevant", "math_proficiency",
"stat_excited", "comp_proficiency", "skills", "takeaway", "topics",
"learning_value", "suggestion", "hobby", "hours_study", "hours_play",
"hours_work", "age", "gender", "distance")
names(data)
```

```
## [1] "time" "major" "job_title"
## [4] "stat_relevant" "math_proficiency" "stat_excited"
## [7] "comp_proficiency" "skills" "takeaway"
## [10] "topics" "learning_value" "suggestion"
## [13] "hobby" "hours_study" "hours_play"
## [16] "hours_work" "age" "gender"
## [19] "distance"
```

`mean(data$age, na.rm = TRUE)`

`## [1] 18.89831`

`median(data$age, na.rm = TRUE)`

`## [1] 19`

`table(data$age)`

```
##
## 18 19 20 21 32
## 29 20 8 1 1
```

You can ignore this part. Some survey respondents entered travel times in different formats, so I wrote some code to try to put them all in the same format.

`library(lubridate)`

```
##
## Attaching package: 'lubridate'
```

```
## The following object is masked from 'package:base':
##
## date
```

```
has_numbers_in_it <- data$distance[grep("([[:digit:]])", data$distance)]
times <- gsub("^([0-9]*)$", "\\1:00", has_numbers_in_it)
times <- gsub("hour", ":00", fixed = T, times)
times <- gsub(" ", "", times)
times <- gsub("[[:alpha:]]", "", times)
times <- hm(times)
```

```
## Warning in .parse_hms(..., order = "HM", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
```

```
times <- 60*hour(times) + minute(times)
head(cbind(has_numbers_in_it, times))
```

```
## has_numbers_in_it times
## [1,] "2:07" "127"
## [2,] "3 hours by plane" "180"
## [3,] "11:01" "661"
## [4,] "2:00" "120"
## [5,] "1:09" "69"
## [6,] "4:00" "240"
```

`summary(times)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.00 70.75 280.00 1607.02 930.00 58200.00 1
```

Removing NA’s and outliers:

```
times <- times[!is.na(times)]
times <- times[times < 10000]
```

Now let’s look at the summary statistics:

`mean(times)`

`## [1] 578.0545`

`median(times)`

`## [1] 260`

`sd(times)`

`## [1] 686.6592`

`summary(times)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 70.5 260.0 578.1 910.0 3000.0
```

Let’s try out the 68-95-99 rule

```
x <- times
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
```

```
## [,1]
## [1,] 0.8545455
## [2,] 0.9454545
## [3,] 0.9818182
```

Well that’s awkward… or is it?

```
x <- data$hours_study
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
```

```
## [,1]
## [1,] 0.6833333
## [2,] 0.9666667
## [3,] 1.0000000
```