R Markdown

The R language has many functions. First, in the “base,” and many more in thousands of packages developed by researchers, companies, hobbyists, etc (including me). Probably the most popular packages are those in the “tidyverse,” like ggplot2, dplyr, and so on. To use those functions, you must first install the packages on your computer (install.packages(“packagename”)). This may need to be repeated if you update to a new version of R. Second, each time you open R and want to use a function from a package, you must first load the package using “library(packagename)”. It’s best to put this at the top of the file.

# This is a comment. Comments can be helpful for explaining how the code works.
# These are useful not only for other people to understand your code...
# but for you to understand your own code some time later, because you
# will almost certainly forget what you were doing!
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##     date
# If there are confusing messages when you load packages, you can sometimes just ignore them
# (exception: if they say the package is not installed or not available).

Next, we load the output from the class survey from the file “class_survey.csv” into R so we can start analyzing it. We also rename the variables for each column in the spreadsheet to something shorter and easier to read and type.

class_survey <- read.csv("class_survey.csv", header=T, stringsAsFactors = F)
names(class_survey) <- c("time", "major", "job_title", "stat_relevant", "math_proficiency",
                 "stat_excited", "comp_proficiency", "skills", "takeaway", "topics",
                 "learning_value", "suggestion", "hobby", "hours_study", "hours_play",
                 "hours_work", "age", "gender", "distance")
##  [1] "time"             "major"            "job_title"       
##  [4] "stat_relevant"    "math_proficiency" "stat_excited"    
##  [7] "comp_proficiency" "skills"           "takeaway"        
## [10] "topics"           "learning_value"   "suggestion"      
## [13] "hobby"            "hours_study"      "hours_play"      
## [16] "hours_work"       "age"              "gender"          
## [19] "distance"

Now we have to do some data cleaning and processing, since some of the travel time responses have the format “x hours and y minutes” and some have the format “x:y,” we must first put them all in the same format then create a new variable converting them all to total minutes.

df <- class_survey
df <- df[grep("([[:digit:]])", df$distance),]
df$minutes <- gsub("^([0-9]*)$", "\\1:00", df$distance)
df$minutes <- gsub("hour", ":00", fixed = T, df$minutes)
df$minutes <- gsub(" ", "", df$minutes)
df$minutes <- gsub("[[:alpha:]]", "", df$minutes)
df$minutes <- hm(df$minutes)
## Warning in .parse_hms(..., order = "HM", quiet = quiet): Some strings
## failed to parse, or all strings are NAs
df$minutes <- 60*hour(df$minutes) + minute(df$minutes)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    25.00    70.75   280.00  1607.02   930.00 58200.00        1

What can we tell from the summary? Consider the relationship between the median and the mean.

df <- df[!is.na(df$age),]
abs(mean(df$age) - median(df$age)) <= sd(df$age)
## [1] TRUE

To avoid typing the df$ part so many times, the with() function is useful

with(df, abs(mean(age) - median(age)) <= sd(age))
## [1] TRUE

with(df, …) means to run the code … using the variables in the data.frame called “df”.


Summaries tell us a limited amount of information. They reduce a whole set of data to only one or a few numbers. Pictures can tell us a lot more. This is called a histogram. The horizontal axis has different possible values of the variable we’re studying, and the height of the vertical bars at each point show how many times that value occurred in the data.

p <- ggplot(df, aes(age))

p <- p + stat_count()

p + theme_minimal()

## 18 19 20 21 32 
## 28 19  7  1  1

Now let’s look at travel times.

ggplot(df, aes(minutes)) + stat_count(width = 10) + theme_minimal()
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Warning: position_stack requires non-overlapping x intervals

Unlike age, where there are a small number of values and most of them occur many times, there are now many different values and most of them occur only once or a small number of times. Computing a table of occurrences is not a very helpful summary. In this case, we create blocks or bins each containing a range of values and aggregate the number of occurrences within each bin. This is what is generally referred to as a histogram.

ggplot(df, aes(minutes)) + geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Oh no, something awful has happened. The data seems to have an outlier, a travel time over 900 hours. Let’s take it out (assuming nobody has to travel more than 100 hours) and see if the plot looks more sensible.

df <- df[df$minutes < 6000,]
df <- df[!is.na(df$minutes),]
p <- ggplot(df, aes(minutes))
p + geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.