#### Review: Observational studies

• Can’t control and randomize
• Confounder: associated with both exposure and outcome
• Can (try to) control for confounders when they are measured
• Unobserved confounders are always a major limitation of observational studies
• RCTs are a higher standard of evidence

#### Summarizing data

We’ve got a bunch of data. What can we do with it? Look at a giant spreadsheet?

How do we extract meaning from a collection of (many) observations?

##### Today we will talk about
• Measures of central tendency
• Measures of dispersion
##### Notation for data
• Data as a list of numbers
• Number of observations $$n$$ is length of the list
• Write $$x_1$$ for the first number, $$x_2$$ for the second, …
• In general $$x_i$$ denotes the $$i$$th number in the list.
• What’s the last number? $$x_n$$
##### Measures of central tendency
• The mean or (arithmetic) average of the data is sum of all the $$x_i$$’s divided by $$n$$ $\bar x := \frac{1}{n}\sum_{i=1}^n x_i$
• Order statistics:
• Write $$x_{(1)}$$ for the smallest number, $$x_{(2)}$$ for the second smallest, …
• What’s the largest? $$x_{(n)}$$
• $$x_{(1)}, x_{(2)}, \ldots, x_{(n)}$$ is a sorted list
• The median is the middle one if $$n$$ is odd, or average of middle two if $$n$$ is even $m := \frac{1}{2}(x_{(n/2)}+x_{(n/2+1)}) \quad \text{or} \quad x_{((n+1)/2)}$
• Half of the data points are below $$m$$ and half above
• Counting how many times each unique number occurs gives a table
• A mode is a number which occurs the most number of times

The mean is the most commonly used, but the others are more useful/meaningful in certain scenarios

#### Measures of dispersion

• Range: $$x_{(n)} - x_{(1)}$$
• The (sample) standard deviation (or SD): $s := \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2}$
• (Why $$n-1$$? Otherwise $$s$$ would be too small. More on this another day)
• This is like an average distance from each point to the overall average
• Quartiles: 25% of the data points are below $$Q_1$$ and 75% above.
• $$Q_2 = m$$, and 25% of the data points are above $$Q_3$$.
• The interquartile range $$IQR := Q_3 - Q_1$$ contains the middle 50% of the data
• Another average distance from the center is median absolute deviation $MAD := \text{median}(|x_i - m|)$
• This means: make a new list $$|x_1 - m|, |x_2 - m|, ...$$ and then find its median

SD is the most commonly used

#### A few interesting facts

• How far apart can $$\bar x$$ and $$m$$ be? Not too far: $$| \bar x - m| \leq s$$
• What % of the $$x_i$$’s are within distance $$s$$ from $$\bar x$$? Usually at least 68%
• 68-95-99 rule: (usually) at least what % of the $$x_i$$’s are within $$s$$, $$2s$$, $$3s$$ from $$\bar x$$, respectively
• e.g. usually at least 99% of the $$x_i$$’s will satisfy $\bar x - 3s \leq x_i \leq \bar x + 3s$
• Does this justify using measures of central tendency and spread as summaries of the data?

#### Some class survey responses

data <- read.csv("class_survey.csv", header=T, stringsAsFactors = F)
names(data) <- c("time", "major", "job_title", "stat_relevant", "math_proficiency",
"stat_excited", "comp_proficiency", "skills", "takeaway", "topics",
"learning_value", "suggestion", "hobby", "hours_study", "hours_play",
"hours_work", "age", "gender", "distance")
names(data)
##  [1] "time"             "major"            "job_title"
##  [4] "stat_relevant"    "math_proficiency" "stat_excited"
##  [7] "comp_proficiency" "skills"           "takeaway"
## [10] "topics"           "learning_value"   "suggestion"
## [13] "hobby"            "hours_study"      "hours_play"
## [16] "hours_work"       "age"              "gender"
## [19] "distance"
##### Summary of age
mean(data$age, na.rm = TRUE) ## [1] 18.89831 median(data$age, na.rm = TRUE)
## [1] 19
table(data$age) ## ## 18 19 20 21 32 ## 29 20 8 1 1 ##### Summary of distance You can ignore this part. Some survey respondents entered travel times in different formats, so I wrote some code to try to put them all in the same format. library(lubridate) ## ## Attaching package: 'lubridate' ## The following object is masked from 'package:base': ## ## date has_numbers_in_it <- data$distance[grep("([[:digit:]])", data$distance)] times <- gsub("^([0-9]*)$", "\\1:00", has_numbers_in_it)
times <- gsub("hour", ":00", fixed = T, times)
times <- gsub(" ", "", times)
times <- gsub("[[:alpha:]]", "", times)
times <- hm(times)
## Warning in .parse_hms(..., order = "HM", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
times <- 60*hour(times) + minute(times)
head(cbind(has_numbers_in_it, times))
##      has_numbers_in_it  times
## [1,] "2:07"             "127"
## [2,] "3 hours by plane" "180"
## [3,] "11:01"            "661"
## [4,] "2:00"             "120"
## [5,] "1:09"             "69"
## [6,] "4:00"             "240"
summary(times)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's
##    25.00    70.75   280.00  1607.02   930.00 58200.00        1

Removing NA’s and outliers:

times <- times[!is.na(times)]
times <- times[times < 10000]

Now let’s look at the summary statistics:

mean(times)
## [1] 578.0545
median(times)
## [1] 260
sd(times)
## [1] 686.6592
summary(times)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    25.0    70.5   260.0   578.1   910.0  3000.0

Let’s try out the 68-95-99 rule

x <- times
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
##           [,1]
## [1,] 0.8545455
## [2,] 0.9454545
## [3,] 0.9818182

Well that’s awkward… or is it?

x <- data\$hours_study
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
##           [,1]
## [1,] 0.6833333
## [2,] 0.9666667
## [3,] 1.0000000