The Bootstrap (Part 2)

Stanford University, Spring 2016, STATS 205

Previous Lecture

Introduced bootstrap
Looked at couple of example applications:
- confidence intervals
- hypotheses testing
Two sources of errors:
- sampling variability
- bootstrap resampling variability
Complete enumerations for exhaustive bootstrap

Today

Example comparison between boostrap using
- Monte Carlo simulations (\(B\) is large)
- exhaustive bootstrap (\(B\) is infinity)
Smarter way to enumerate boostrap samples using Gray codes
Exploring the tails of a bootstrap distribution

Law Schools Example

library(bootstrap); data(law); t(law)

##        1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
## LSAT 576 635 558 578 666 580 555 661 651 605 653 575 545 572 594
## GPA  339 330 281 303 344 307 300 343 336 313 312 274 276 288 296

Law Schools Example

Sample correlation coefficient:

## [1] 0.7763745

How accurate is this estimate. Let's look at the bootstrap samples:

draw.bootstrap.sample = function() {
  n = dim(law)[1]
  ind = sample(n,replace = TRUE)
  return(cor(law[ind,]$LSAT,law[ind,]$GPA))
}
B = 40000
thetastar = replicate(B,draw.bootstrap.sample())

Law Schools Example

Evaluate the correlation coefficient using a Monte Carlo simulation:

Law Schools Example

Create matrix of all \(\binom{2n-1}{n-1}\) enumerations:

library(partitions)
n = 15
allCompositions = compositions(n,n)

allCompositions[,1:10]

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   15   14   13   12   11   10    9    8    7     6
##  [2,]    0    1    2    3    4    5    6    7    8     9
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    0     0
## [11,]    0    0    0    0    0    0    0    0    0     0
## [12,]    0    0    0    0    0    0    0    0    0     0
## [13,]    0    0    0    0    0    0    0    0    0     0
## [14,]    0    0    0    0    0    0    0    0    0     0
## [15,]    0    0    0    0    0    0    0    0    0     0

Law Schools Example

Check number of compositions:

library(assertthat)
nCompositions = dim(allCompositions)[2]
nCompositions

## [1] 77558760

# Check if we get the same value as in theory
nCompositionsTheory = choose(2*n-1,n-1) 
if(!assert_that(nCompositions == nCompositionsTheory)) {
  print("there is a problem with the number of compositions")
} else {
  print("the number of compositions is correct")
}

## [1] "the number of compositions is correct"

Law Schools Example

Evaluate the correlation coefficient for all \(\binom{2n-1}{n-1}\) bootstrap samples.

library(parallel)
if(!file.exists("enumData.Rdata")) {
  ptm = proc.time()
  enumData = mclapply(1:nCompositions, function(i) {
    ind = allCompositions[,i]
    law.list = lapply(1:n,function(j) matrix(rep(law[j,],ind[j]),ncol = 2 ,byrow = TRUE))
    newLaw = do.call(rbind, law.list)
    c(cor(unlist(newLaw[,1]),unlist(newLaw[,2])),dmultinom(ind,prob = rep(1,n)))
    }, mc.cores = 4)
  proc.time() - ptm
  enumData = t(simplify2array(enumData))
  colnames(enumData) = c("cor","weight")
  save(enumData,file = "enumData.Rdata")  
} else {
  load("enumData.Rdata")
}

Law Schools Example

Same plot as before but now with all bootstrap samples:

Law Schools Example

Use the library parallel to speedup computations

library(parallel)
enumData = mclapply(1:nCompositions, function(i) { ... }, mc.cores = 4 )

On a laptop with 4 cores this takes about 5 hours to compute
We can speedup enumeration by changing only one coordinates at the time using Gray codes
For example, if our estimate is the sample mean for \(n = 3\)
- \(s(\boldsymbol{x}^{*1}) = ( x_1 + x_2 + x_3)/3\)
- \(s(\boldsymbol{x}^{*2}) = ( x_1 + x_2 + x_2)/3\)
To calculate \(s(\boldsymbol{x}^{*2})\) from \(s(\boldsymbol{x}^{*1})\)
- \(s(\boldsymbol{x}^{*1}) - x_3/3 + x_2/3\)

Speedup for Enumerations using Gray Codes

Gray codes are ordered lists of binary \(n\)-tuples
They are ordered so that success values only differ in a single space
For instance, for \(n = 3\), the list is:
000, 001, 011, 010, 110, 111, 101, 100
Notice, a computer scientist might intuitively want to write this:
000, 001, 010, 011, 100, 101, 110, 111
However, this sequence differes in many spaces
Better than trying to reorder the wrong elements, we can define recursive algorithm to generate a valid list

Speedup for Enumerations using Gray Codes

Start with a list with two entries 0,1
Get two list:
- put a zero before each entry in \(L_n\)
- put a one before each entry in \(L_n\)
To get \(L_{n+1}\) concatenate the two list by first followed by second in reversed order
So to move from \(L_1\) to \(L_2\)
00, 01
10, 11
and concatenate
00, 01, 11, 10
repeat…

Speedup for Enumerations using Gray Codes

We can also just add one successor at the time
As we saw before, a computer scientist would encode an integer as \(m = \sum \epsilon_i 2^i\) using the binary sequence \[ \dots \, \epsilon_3 \, \epsilon_2 \, \epsilon_1 \, \epsilon_0 \]
This can be mapped to gray codes \[ \dots \, e_3 \, e_2 \, e_1 \, e_0 \] using \(e_i = \epsilon_i + \epsilon_{i+1} \, (\bmod 2) \text{ for } (i = 0,1,2,\dots)\)

Speedup for Enumerations using Gray Codes

When \(n = 4\), the integer 7 in binary code is 0111 and using the mapping
\(e_0 = 1 + 1 = 0\)
\(e_1 = 1 + 1 = 0\)
\(e_2 = 1 + 0 = 1\)
\(e_3 = 0 + 0 = 0\)
we get the corresponding gray code 0100
Next, integer 8 in binary code is 1000 and using the mapping
\(e_0 = 0 + 0 = 0\)
\(e_1 = 0 + 0 = 0\)
\(e_2 = 0 + 1 = 1\)
\(e_3 = 1 + 0 = 1\)
we get the corresponding gray code 1100

Exploring the Tails of a Bootstrap Distribution

Using Markov chain Monte Carlo (MCMC) to inject a small amount of randomness (something between Monte Carlo and complete enumerations)
We will use MCMC to deriving large deviations estimate such as \(P( T \ge t)\)
To see how it works, we let's have a look at the Metropolis algorithm
Consider a the finite sample space
Think of it as a state space, where each outcome corresponds to a state
Metropolis can sample from an unormalized probability \(\pi(x)\) on finite state space \(\mathcal{X}\)
Define a Markov transition matrix \(J(x,y)\) that defines nonzero probabilities of moving from \(x\) to \(y\) and \(y\) to \(x\)
Metropolis changes \(J(x,y)\) to a new matrix \(K(x,y)\) that corresponds to a possibly unnormalized version of \(\pi(x)\)

Exploring the Tails of a Bootstrap Distribution

Recipe:
- Pick initial point in sample space \(x_0\)
- Pick potential next move from \(J(x,y)\) with \(J(x,y) > 0\) and \(J(y,x) > 0\)
- Evaluate \[A(x,y) = \frac{\pi(y)J(y,x)}{\pi(x)J(x,y)}\]
- If \(A(x,y) \ge 1\) move to \(y\)
- If \(A(x,y) < 1\) flip a coing with this success probability
  - and move to \(y\) if success
  - otherwise stay at \(x\)

Exploring the Tails of a Bootstrap Distribution

We can write this in matrix form \[ K(x,y) = \begin{cases} J(x, y) & \text{if } x \ne y, A(x, y) ≥ 1 \\ J(x, y)A(x, y) & \text{if } x \ne y, A(x, y) < 1 \\ J(x, y) + \sum_{z:A(x,z)<1} J(x, z)(1 − A(x, z)) & \text{if } x = y \end{cases} \]
Then we can use the Fundamental Theorem of Markov Chains to prove that \[ K^n(x, y) \to \pi(y) \text{ for each } x,y \in \mathcal{X} \]
or in other words, the matrix \(K^n = K^1 K^2 \cdots K^n\) converges to a matrix \[\pi K = \lambda \pi \leftrightarrow \pi K = \pi\] with one left eigenvector \(\pi\) and one eigenvalue \(\lambda = 1\)

Exploring the Tails of a Bootstrap Distribution

To sample from \(\pi\), apply \(J\) to the left of the current sample position \(x_t\)
This will give us the next sample \(x_{t+1}\)
A nice introduction to MCMC is here

Exploring the Tails of a Bootstrap Distribution

We construct a Markov chain to sample from \(P( T \ge t)\):

Pick \(I\) between \(1 \le I \le n\) uniformly, and
Replace \(x^*_I\) with new value from origin data \(\{ x_1, x_2, \dots, x_n \}\)
If new sample vector \(\tilde{\boldsymbol{x}}^*\) satisfies \(T(\tilde{\boldsymbol{x}}^*) \ge t\) then change is made
otherwise the chain stays at previous sample vector

Exploring the Tails of a Bootstrap Distribution

Then to estimate \(P( T \ge t)\):

Choose a grid \(t_0 < t_1 < \dots < t_l < t\) with \(t_0\) chosen in the middle of distribution of \(T\) and \(t_i\) chosen so that \(P(T \ge t_{i+1} | T \ge t_i)\) is not too small
Estimate \(P( T \ge t_0 )\) by ordinary Monte Carlo
Estimate \(P( T \ge t_1 | T \ge t_0)\) by running the Markov chain on \(\{ \tilde{\boldsymbol{x}}^*: T(\tilde{\boldsymbol{x}}^*) \ge t_0) \}\) and count what proportion of values satisfy the constrain \(T \ge t_1\)
Multiplying these estimates gives the conditional probability: \[ \widehat{P}(T \ge t) = \widehat{P}(T \ge t_0) \widehat{P}(T \ge t_1 | T \ge t_0) \cdots \widehat{P}(T \ge t | T \ge t_l) \] and joint probability \(\widehat{P}(T \ge t) = \widehat{P}(T \ge t_0,T \ge t_1, \dots, T \ge t_l, T \ge t)\)

Theoretical Underpinnings of the Bootstrap

For background on statistical functionals and asymptotics of the boostrap:

Lehmann 1999
Elements of Large-Sample Theory
Chapter 6
Stanford library link
Van der Vaart 1998
Asymptotic Statistics
Chapter 23
Stanford library link