- Introduced bootstrap
- Looked at couple of example applications:
- confidence intervals
- hypotheses testing

- Two sources of errors:
- sampling variability
- bootstrap resampling variability

- Complete enumerations for exhaustive bootstrap

Stanford University, Spring 2016, STATS 205

- Introduced bootstrap
- Looked at couple of example applications:
- confidence intervals
- hypotheses testing

- Two sources of errors:
- sampling variability
- bootstrap resampling variability

- Complete enumerations for exhaustive bootstrap

- Example comparison between boostrap using
- Monte Carlo simulations (\(B\) is large)
- exhaustive bootstrap (\(B\) is infinity)

- Smarter way to enumerate boostrap samples using Gray codes
- Exploring the tails of a bootstrap distribution

library(bootstrap); data(law); t(law)

## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## LSAT 576 635 558 578 666 580 555 661 651 605 653 575 545 572 594 ## GPA 339 330 281 303 344 307 300 343 336 313 312 274 276 288 296

- Sample correlation coefficient:

## [1] 0.7763745

- How accurate is this estimate. Let's look at the bootstrap samples:

draw.bootstrap.sample = function() { n = dim(law)[1] ind = sample(n,replace = TRUE) return(cor(law[ind,]$LSAT,law[ind,]$GPA)) } B = 40000 thetastar = replicate(B,draw.bootstrap.sample())

Evaluate the correlation coefficient using a Monte Carlo simulation:

Create matrix of all \(\binom{2n-1}{n-1}\) enumerations:

library(partitions) n = 15 allCompositions = compositions(n,n)

allCompositions[,1:10]

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 15 14 13 12 11 10 9 8 7 6 ## [2,] 0 1 2 3 4 5 6 7 8 9 ## [3,] 0 0 0 0 0 0 0 0 0 0 ## [4,] 0 0 0 0 0 0 0 0 0 0 ## [5,] 0 0 0 0 0 0 0 0 0 0 ## [6,] 0 0 0 0 0 0 0 0 0 0 ## [7,] 0 0 0 0 0 0 0 0 0 0 ## [8,] 0 0 0 0 0 0 0 0 0 0 ## [9,] 0 0 0 0 0 0 0 0 0 0 ## [10,] 0 0 0 0 0 0 0 0 0 0 ## [11,] 0 0 0 0 0 0 0 0 0 0 ## [12,] 0 0 0 0 0 0 0 0 0 0 ## [13,] 0 0 0 0 0 0 0 0 0 0 ## [14,] 0 0 0 0 0 0 0 0 0 0 ## [15,] 0 0 0 0 0 0 0 0 0 0

Check number of compositions:

library(assertthat) nCompositions = dim(allCompositions)[2] nCompositions

## [1] 77558760

# Check if we get the same value as in theory nCompositionsTheory = choose(2*n-1,n-1) if(!assert_that(nCompositions == nCompositionsTheory)) { print("there is a problem with the number of compositions") } else { print("the number of compositions is correct") }

## [1] "the number of compositions is correct"

Evaluate the correlation coefficient for all \(\binom{2n-1}{n-1}\) bootstrap samples.

library(parallel) if(!file.exists("enumData.Rdata")) { ptm = proc.time() enumData = mclapply(1:nCompositions, function(i) { ind = allCompositions[,i] law.list = lapply(1:n,function(j) matrix(rep(law[j,],ind[j]),ncol = 2 ,byrow = TRUE)) newLaw = do.call(rbind, law.list) c(cor(unlist(newLaw[,1]),unlist(newLaw[,2])),dmultinom(ind,prob = rep(1,n))) }, mc.cores = 4) proc.time() - ptm enumData = t(simplify2array(enumData)) colnames(enumData) = c("cor","weight") save(enumData,file = "enumData.Rdata") } else { load("enumData.Rdata") }

Same plot as before but now with all bootstrap samples:

Use the library parallel to speedup computations

library(parallel) enumData = mclapply(1:nCompositions, function(i) { ... }, mc.cores = 4 )

- On a laptop with 4 cores this takes about 5 hours to compute
- We can speedup enumeration by changing only one coordinates at the time using Gray codes
- For example, if our estimate is the sample mean for \(n = 3\)
- \(s(\boldsymbol{x}^{*1}) = ( x_1 + x_2 + x_3)/3\)
- \(s(\boldsymbol{x}^{*2}) = ( x_1 + x_2 + x_2)/3\)

- To calculate \(s(\boldsymbol{x}^{*2})\) from \(s(\boldsymbol{x}^{*1})\)
- \(s(\boldsymbol{x}^{*1}) - x_3/3 + x_2/3\)

- Gray codes are ordered lists of binary \(n\)-tuples
- They are ordered so that success values only differ in a single space
- For instance, for \(n = 3\), the list is:
`000, 001, 011, 010, 110, 111, 101, 100`

- Notice, a computer scientist might intuitively want to write this:
`000, 001, 010, 011, 100, 101, 110, 111`

- However, this sequence differes in many spaces
- Better than trying to reorder the wrong elements, we can define recursive algorithm to generate a valid list

- Start with a list with two entries
`0,1`

- Get two list:
- put a zero before each entry in \(L_n\)
- put a one before each entry in \(L_n\)

- To get \(L_{n+1}\) concatenate the two list by first followed by second in reversed order
- So to move from \(L_1\) to \(L_2\)
`00, 01`

`10, 11`

- and concatenate
`00, 01, 11, 10`

- repeat…

- We can also just add one successor at the time
- As we saw before, a computer scientist would encode an integer as \(m = \sum \epsilon_i 2^i\) using the binary sequence \[ \dots \, \epsilon_3 \, \epsilon_2 \, \epsilon_1 \, \epsilon_0 \]
- This can be mapped to gray codes \[ \dots \, e_3 \, e_2 \, e_1 \, e_0 \] using \(e_i = \epsilon_i + \epsilon_{i+1} \, (\bmod 2) \text{ for } (i = 0,1,2,\dots)\)

- When \(n = 4\), the integer 7 in binary code is
`0111`

and using the mapping

\(e_0 = 1 + 1 = 0\)

\(e_1 = 1 + 1 = 0\)

\(e_2 = 1 + 0 = 1\)

\(e_3 = 0 + 0 = 0\) we get the corresponding gray code

`0100`

- Next, integer 8 in binary code is
`1000`

and using the mapping

\(e_0 = 0 + 0 = 0\)

\(e_1 = 0 + 0 = 0\)

\(e_2 = 0 + 1 = 1\)

\(e_3 = 1 + 0 = 1\) we get the corresponding gray code

`1100`

- Using Markov chain Monte Carlo (MCMC) to inject a small amount of randomness (something between Monte Carlo and complete enumerations)
- We will use MCMC to deriving large deviations estimate such as \(P( T \ge t)\)
- To see how it works, we let's have a look at the Metropolis algorithm
- Consider a the finite sample space
- Think of it as a state space, where each outcome corresponds to a state
- Metropolis can sample from an unormalized probability \(\pi(x)\) on finite state space \(\mathcal{X}\)
- Define a Markov transition matrix \(J(x,y)\) that defines nonzero probabilities of moving from \(x\) to \(y\) and \(y\) to \(x\)
- Metropolis changes \(J(x,y)\) to a new matrix \(K(x,y)\) that corresponds to a possibly unnormalized version of \(\pi(x)\)

- Recipe:
- Pick initial point in sample space \(x_0\)
- Pick potential next move from \(J(x,y)\) with \(J(x,y) > 0\) and \(J(y,x) > 0\)
- Evaluate \[A(x,y) = \frac{\pi(y)J(y,x)}{\pi(x)J(x,y)}\]
- If \(A(x,y) \ge 1\) move to \(y\)
- If \(A(x,y) < 1\) flip a coing with this success probability
- and move to \(y\) if success
- otherwise stay at \(x\)

- We can write this in matrix form \[ K(x,y) = \begin{cases} J(x, y) & \text{if } x \ne y, A(x, y) â‰¥ 1 \\ J(x, y)A(x, y) & \text{if } x \ne y, A(x, y) < 1 \\ J(x, y) + \sum_{z:A(x,z)<1} J(x, z)(1 âˆ’ A(x, z)) & \text{if } x = y \end{cases} \]
- Then we can use the Fundamental Theorem of Markov Chains to prove that \[ K^n(x, y) \to \pi(y) \text{ for each } x,y \in \mathcal{X} \]
- or in other words, the matrix \(K^n = K^1 K^2 \cdots K^n\) converges to a matrix \[\pi K = \lambda \pi \leftrightarrow \pi K = \pi\] with one left eigenvector \(\pi\) and one eigenvalue \(\lambda = 1\)

- To sample from \(\pi\), apply \(J\) to the left of the current sample position \(x_t\)
- This will give us the next sample \(x_{t+1}\)
- A nice introduction to MCMC is here

We construct a Markov chain to sample from \(P( T \ge t)\):

- Pick \(I\) between \(1 \le I \le n\) uniformly, and
- Replace \(x^*_I\) with new value from origin data \(\{ x_1, x_2, \dots, x_n \}\)
- If new sample vector \(\tilde{\boldsymbol{x}}^*\) satisfies \(T(\tilde{\boldsymbol{x}}^*) \ge t\) then change is made
- otherwise the chain stays at previous sample vector

Then to estimate \(P( T \ge t)\):

- Choose a grid \(t_0 < t_1 < \dots < t_l < t\) with \(t_0\) chosen in the middle of distribution of \(T\) and \(t_i\) chosen so that \(P(T \ge t_{i+1} | T \ge t_i)\) is not too small
- Estimate \(P( T \ge t_0 )\) by ordinary Monte Carlo
- Estimate \(P( T \ge t_1 | T \ge t_0)\) by running the Markov chain on \(\{ \tilde{\boldsymbol{x}}^*: T(\tilde{\boldsymbol{x}}^*) \ge t_0) \}\) and count what proportion of values satisfy the constrain \(T \ge t_1\)
- Multiplying these estimates gives the conditional probability: \[ \widehat{P}(T \ge t) = \widehat{P}(T \ge t_0) \widehat{P}(T \ge t_1 | T \ge t_0) \cdots \widehat{P}(T \ge t | T \ge t_l) \] and joint probability \(\widehat{P}(T \ge t) = \widehat{P}(T \ge t_0,T \ge t_1, \dots, T \ge t_l, T \ge t)\)

For background on statistical functionals and asymptotics of the boostrap:

- Lehmann 1999
*Elements of Large-Sample Theory*

Chapter 6

Stanford library link - Van der Vaart 1998
*Asymptotic Statistics*

Chapter 23

Stanford library link