Review: Observational studies

Summarizing data

We’ve got a bunch of data. What can we do with it? Look at a giant spreadsheet?

How do we extract meaning from a collection of (many) observations?

Today we will talk about
  • Measures of central tendency
  • Measures of dispersion
Notation for data
  • Data as a list of numbers
  • Number of observations \(n\) is length of the list
  • Write \(x_1\) for the first number, \(x_2\) for the second, …
  • In general \(x_i\) denotes the \(i\)th number in the list.
  • What’s the last number? \(x_n\)
Measures of central tendency
  • The mean or (arithmetic) average of the data is sum of all the \(x_i\)’s divided by \(n\) \[\bar x := \frac{1}{n}\sum_{i=1}^n x_i\]
  • Order statistics:
  • Write \(x_{(1)}\) for the smallest number, \(x_{(2)}\) for the second smallest, …
  • What’s the largest? \(x_{(n)}\)
  • \(x_{(1)}, x_{(2)}, \ldots, x_{(n)}\) is a sorted list
  • The median is the middle one if \(n\) is odd, or average of middle two if \(n\) is even \[m := \frac{1}{2}(x_{(n/2)}+x_{(n/2+1)}) \quad \text{or} \quad x_{((n+1)/2)}\]
  • Half of the data points are below \(m\) and half above
  • Counting how many times each unique number occurs gives a table
  • A mode is a number which occurs the most number of times

The mean is the most commonly used, but the others are more useful/meaningful in certain scenarios

Measures of dispersion

SD is the most commonly used

A few interesting facts

Some class survey responses

data <- read.csv("class_survey.csv", header=T, stringsAsFactors = F)
names(data) <- c("time", "major", "job_title", "stat_relevant", "math_proficiency",
                 "stat_excited", "comp_proficiency", "skills", "takeaway", "topics",
                 "learning_value", "suggestion", "hobby", "hours_study", "hours_play",
                 "hours_work", "age", "gender", "distance")
names(data)
##  [1] "time"             "major"            "job_title"       
##  [4] "stat_relevant"    "math_proficiency" "stat_excited"    
##  [7] "comp_proficiency" "skills"           "takeaway"        
## [10] "topics"           "learning_value"   "suggestion"      
## [13] "hobby"            "hours_study"      "hours_play"      
## [16] "hours_work"       "age"              "gender"          
## [19] "distance"
Summary of age
mean(data$age, na.rm = TRUE)
## [1] 18.89831
median(data$age, na.rm = TRUE)
## [1] 19
table(data$age)
## 
## 18 19 20 21 32 
## 29 20  8  1  1
Summary of distance

You can ignore this part. Some survey respondents entered travel times in different formats, so I wrote some code to try to put them all in the same format.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
has_numbers_in_it <- data$distance[grep("([[:digit:]])", data$distance)]
times <- gsub("^([0-9]*)$", "\\1:00", has_numbers_in_it)
times <- gsub("hour", ":00", fixed = T, times)
times <- gsub(" ", "", times)
times <- gsub("[[:alpha:]]", "", times)
times <- hm(times)
## Warning in .parse_hms(..., order = "HM", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
times <- 60*hour(times) + minute(times)
head(cbind(has_numbers_in_it, times))
##      has_numbers_in_it  times
## [1,] "2:07"             "127"
## [2,] "3 hours by plane" "180"
## [3,] "11:01"            "661"
## [4,] "2:00"             "120"
## [5,] "1:09"             "69" 
## [6,] "4:00"             "240"
summary(times)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    25.00    70.75   280.00  1607.02   930.00 58200.00        1

Removing NA’s and outliers:

times <- times[!is.na(times)]
times <- times[times < 10000]

Now let’s look at the summary statistics:

mean(times)
## [1] 578.0545
median(times)
## [1] 260
sd(times)
## [1] 686.6592
summary(times)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    70.5   260.0   578.1   910.0  3000.0

Let’s try out the 68-95-99 rule

x <- times
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
##           [,1]
## [1,] 0.8545455
## [2,] 0.9454545
## [3,] 0.9818182

Well that’s awkward… or is it?

x <- data$hours_study
xbar <- mean(x)
s <- sd(x)
rbind(
mean(abs(x - xbar) <= s),
mean(abs(x - xbar) <= 2*s),
mean(abs(x - xbar) <= 3*s))
##           [,1]
## [1,] 0.6833333
## [2,] 0.9666667
## [3,] 1.0000000