Background

This lab is written in R markdown. This format promotes reproducibility and facilitates collaborative sharing of statistical analyses. You can run it either on RStudio cloud or locally on your laptop by installing RStudio (download here).

During this lab, we will get familiar with the following concepts:

R and R markdown
Plotting in R using package ggplot2
Computer simulations in R using function replicate
Construct bootstrap confidence intervals using function quantile
Fitting regularized regression models in R using package glmfit

R and R markdown

Before we start using R, we need to install all the necessary packages. We need to first install them if there are not already in our system, then, we need to load them into our current session.

Here is a code chunk that you can reuse for your own analyses. Just make sure to replace the package names on the first line. The rest of the code stays the same no matter your analysis.

pkgs_needed = c("tidyverse", "magrittr", "glmnet", "bootstrap", 
                "lars", "broom", "cowplot", "ggthemes")
letsinstall = setdiff(pkgs_needed, installed.packages()) 
if (length(letsinstall) > 0) {
  install.packages(letsinstall)
}

Once they are install, you can load them into your current R session. This will make it possible to use functions from these packages.

library("tidyverse")
library("magrittr")
library("glmnet")
library("bootstrap")
library("lars")
library("broom")
library("cowplot")
library("ggthemes")
theme_set(theme_few())

To compile this notebook into a html report use Knit icon above the RStudio code editor. At this point the compilation will fail. To make it work you have to fix the TODO that are spread out in the document. Additionally, there are a total of 12 questions and one bonus question.

Plotting

Try to calculate the number \(\pi\) yourself using computer simulations. It is usually very helpful to keep plotting every step of your simulation to make sure that your code is correct. I recommend using the R package ggplot2 for plotting. It implements the grammar of graphics that allows you build plots in a modular way. Here is an example to plot points. We use the data set mpg from package ggplot2.

mpg

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Now plot highway miles per gallon hwy as a function of engine displacement, in liters displ. To query the dataset you can use ?mpg in R.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

For more background use the online book: R for Data Science. Plotting is very important in data science. It helps you debug your code and your data. Sometimes you may also discover something unexpected! Such as batch effects, which might be hard to detect in other ways.

Question 1: Try to add a smooth function through the data points in the previous plot. Hint: You don’t have to compute the smoothing curve yourself. There is a geom_* function for that. Try to find it in the book chapter R for Data Science.

ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Computer Simulations

Try to approximate \(\pi\) from the lecture using function sample. The code below will help you get started. The table tb_pi (called tibble in R) has the x and y coordinates for 1000 uniformly sampled points on the unit square.

set.seed(25)
n = 1000
sample_pi_table = function() {
  tibble(
    x = runif(n),
    y = runif(n)
  )  
}
tb_pi = sample_pi_table()
tb_pi

## # A tibble: 1,000 x 2
##         x      y
##     <dbl>  <dbl>
##  1 0.416  0.896 
##  2 0.695  0.167 
##  3 0.149  0.940 
##  4 0.897  0.0373
##  5 0.124  0.227 
##  6 0.985  0.797 
##  7 0.626  0.595 
##  8 0.338  0.318 
##  9 0.0668 0.320 
## 10 0.282  0.860 
## # … with 990 more rows

Question 2: Why do we need to call set.seed? Hint:

Comment out the set.seed function like this #set.seed(25)
Execute the code chunk twice and compare the results
Then uncomment set.seed again
Execute the code chunk twice and compare the results
What changes?

Answer: set.seed defines the starting point of generating random numbers. By setting the seed you will be able to get the same sequence of random number every time. This is useful for reproducibility.

You can also learn what set.seed does by reading the R documentation like this ?set.seed (type in console).

Question 3: How accurate is your approximation? Hint:

Use the function replicate to get 100 tb_pi tables
Calculate \(\pi\) for each table
Quantify approximation uncertainty using standard deviations sd over all \(\pi\)
Plot histogram using geom_histogram()

Here is a template that you can as a starting point:

sample_pi = function() {
  sample_pi_table() %>% 
    mutate(inside = ifelse(sqrt(x^2 + y^2) < 1, "yes", "no")) %>%
    summarize(mean(inside == "yes") * 4) %>%
    as.numeric()
}
sim_res = replicate(1000, sample_pi())
mean(sim_res)

## [1] 3.139256

sd(sim_res)

## [1] 0.05208131

ggplot(tibble(pi = sim_res), aes(x = pi)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that I used the pipe operator here %>%. This is very common in R programming. It’s often a more convenient way of writing encapsulated functions. For instance, you can rewrite this sequence of commands:

fun_f(fun_g(fun_h(data_x)))

using the pipe operator:

data_x %>% fun_h %>% fun_g %>% fun_f

Bootstrap Confidence Intervals

During the lecture, we looked at the bootstrap distribution of height of the 18 year old Dutch men. We compared the observed average height to the bootstrap distribution. Sometimes it is helpful to further summarize the bootstrap distribution. One way is to compute confidence intervals. A 95% confidence interval tells you that if you repeated the survey many times, then the mean height would be within the interval 95% of the time.

Question 4: First draw a random sample of size \(n = 1000\) from a normal distribution with mean \(\mu = 182\) and standard deviation \(\sigma = 7\). In R, this can be done with rnorm.

set.seed(41)
n = 1000
mu = 182
sigma = 7
tb_survey = tibble(
  height = rnorm(n = n, mean = mu, sd = sigma) 
  )

Question 5: Now plot a histogram of the survey data using ggplot2.

ggplot(tb_survey, aes(height)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let’s repeat the survey many times and compute the mean height every time to obtain the bootstrap distribution. The key functions that you will need for this is replicate and sample.

Consider the function sample first:

sample(x, size, replace = FALSE)

The first argument x is a vector of items that we want to sample from.
The second argument size is the sample size.
The third argument replace is a bit trickier. Let’s look at some simulations to understand it better:

set.seed(34)
try1 = sample(1:10, size = 10, replace = FALSE)
try1

##  [1]  5  9  8  2  7 10  6  3  1  4

table(try1)

## try1
##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

try2 = sample(1:10, size = 10, replace = TRUE)
try2

##  [1] 4 5 8 1 2 9 9 1 5 5

table(try2)

## try2
## 1 2 4 5 8 9 
## 2 1 1 3 1 2

Question 6: What is the difference?

Answer: replace = FALSE picks every element at most ones. replace = TRUE can pick some elements multiple times.

Question 7: Now that we understand sample a bit better, we can continue and use it many times using replicate. Hint: To obtain a valid bootstrap sample, you will need to sample with replacement.

draw_bootstrap_sample = function() {
  n = length(tb_survey$height)
  bootstrap_sample = sample(x = tb_survey$height, size = n, replace = TRUE)
  mean(bootstrap_sample)
}
B = 5000
mean_star = replicate(B, draw_bootstrap_sample())
tb_height = tibble(height = mean_star)

Question 8: Calculate the 95% confidence interval using function quantile.

low = quantile(mean_star, probs = 0.025)
high = quantile(mean_star, probs = 0.975)

Question 9: Add them to your plot using geom_vline.

ggplot(tb_height, aes(height)) + 
  geom_histogram(bins = 40) +
  geom_vline(xintercept = low, color = "red") +
  geom_vline(xintercept = high, color = "red")

Question 10: There is a nicer way to color the histogram by adding an addition column (using mutate) to the height table indicating whether the value is within the 95% confidence intervals. Then you can use this additional column to color the histogram using the fill argument in the aesthetics aes.

tb_height %<>% mutate(
  CI = ifelse(height > high | height < low, "tail", "bulk")
  )
ggplot(tb_height, aes(height, fill = CI)) + 
  geom_histogram(bins = 40)

Bonus Question: Try to calculate and plot confidence intervals for the Law School example from the lecture.

data(law)
theta_hat = cor(law$LSAT, law$GPA)
theta_hat

## [1] 0.7763745

draw_bootstrap_sample = function() {
  n = dim(law)[1]
  ind = sample(n, replace = TRUE)
  return(cor(law[ind,]$LSAT, law[ind,]$GPA))
}
B = 40000
theta_star = replicate(B, draw_bootstrap_sample())
quantile(theta_star, probs = c(0.025, 0.975))

##      2.5%     97.5% 
## 0.4610565 0.9623387

Fitting Models

We will now try fit a ridge and sparse regression model to the diabetes dataset from the lecture.

data("diabetes")
X = diabetes$x
y = diabetes$y

First, we try to fit the ridge regression model using the function glmnet from package glmnet.

ridge_fit = glmnet(x = X, y = y, alpha = 0)

The argument alpha = 0 indicates that we are using the ridge regression model. To use the Lasso model, we can change it to alpha = 1 instead. It is also possible to fit a model that is a compromise between both by choosing a number between 0 and 1.

Now let’s try to find the optimal regularization parameter lambda by cross-validation. Luckily, cross-validation is already build into the glmnet package. The syntax is as follows:

set.seed(23)
ridge_fit_cv = cv.glmnet(x = X, y = y, alpha = 0)
tidied = tidy(ridge_fit) %>% filter(term != "(Intercept)")
p_ridge = ggplot(tidied, aes(log(lambda), estimate, group = term, color = term)) +
  geom_hline(yintercept = 0, size = 1) +
  geom_line(size = 1) +
  ylab("beta hat")
p_ridge

To explore the results, we can use two functions tidy and glance from the broom package.

tidied_cv = tidy(ridge_fit_cv)
glance_cv = glance(ridge_fit_cv)

Then we can plot results using ggplot.

p_ridge_cv = ggplot(tidied_cv, aes(log(lambda), estimate)) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) +
  geom_vline(xintercept = log(glance_cv$lambda.min)) +
  geom_vline(xintercept = log(glance_cv$lambda.1se), lty = 2) +
  ylab("Mean-Squared Error")
p_ridge = p_ridge + 
  geom_vline(xintercept = log(glance_cv$lambda.min)) +
  geom_vline(xintercept = log(glance_cv$lambda.1se), lty = 2)
plot_grid(p_ridge_cv, p_ridge)

The same procedure works for Lasso regression by changing alpha = 0 to alpha = 1.

Question 11: Try to fit the Lasso model to the diabetes dataset.

set.seed(23)
lasso_fit = glmnet(x = X, y = y, alpha = 1)
tidied = tidy(lasso_fit) %>% filter(term != "(Intercept)")
lasso_fit_cv = cv.glmnet(x = X, y = y, alpha = 1)
tidied_cv = tidy(lasso_fit_cv)
glance_cv = glance(lasso_fit_cv)
p_lasso = ggplot(tidied, aes(log(lambda), estimate, group = term, color = term)) +
  geom_hline(yintercept = 0, size = 1) +
  geom_line(size = 1) +
  ylab("beta hat") + 
  geom_vline(xintercept = log(glance_cv$lambda.min)) +
  geom_vline(xintercept = log(glance_cv$lambda.1se), lty = 2)

Question 12: Compare ridge and Lasso fit by plotting them side-by-side using plot_grid.

plot_grid(p_ridge, p_lasso)

Session Info for Reproducibility

At the end of a R markdown notebook, you should always include the session information. This contains the operating system, version numbers of R, and R packages that you used during the analysis. Your collaborator or yourself in a few month will be thankful for this information as it will allow her or him to rerun the analysis without much additional effort.

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggthemes_4.1.1   cowplot_0.9.4    broom_0.5.2      lars_1.2        
##  [5] bootstrap_2019.6 glmnet_2.0-18    foreach_1.4.4    Matrix_1.2-17   
##  [9] magrittr_1.5     forcats_0.4.0    stringr_1.4.0    dplyr_0.8.1     
## [13] purrr_0.3.2      readr_1.3.1      tidyr_0.8.3      tibble_2.1.3    
## [17] ggplot2_3.1.1    tidyverse_1.2.1 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.6         haven_2.1.0      lattice_0.20-38 
##  [5] vctrs_0.1.0      colorspace_1.4-1 generics_0.0.2   htmltools_0.3.6 
##  [9] yaml_2.2.0       utf8_1.1.4       rlang_0.3.4      pillar_1.4.1    
## [13] glue_1.3.1       withr_2.1.2      modelr_0.1.4     readxl_1.3.1    
## [17] plyr_1.8.4       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
## [21] rvest_0.3.3      codetools_0.2-16 evaluate_0.13    labeling_0.3    
## [25] knitr_1.22       fansi_0.4.0      Rcpp_1.0.1       scales_1.0.0    
## [29] backports_1.1.4  jsonlite_1.6     hms_0.4.2        digest_0.6.19   
## [33] stringi_1.4.3    grid_3.5.1       cli_1.1.0        tools_3.5.1     
## [37] lazyeval_0.2.2   zeallot_0.1.0    crayon_1.3.4     pkgconfig_2.0.2 
## [41] xml2_1.2.0       lubridate_1.7.4  assertthat_0.2.1 rmarkdown_1.12  
## [45] httr_1.4.0       rstudioapi_0.10  iterators_1.0.10 R6_2.4.0        
## [49] nlme_3.1-139     compiler_3.5.1

Lab 1: Data Science in R

Christof Seiler