Q1: 20
Q2: Complete enumeration implementation(30) + Analysis of results(10, including difference between Monte Carlo and Complete enumeration, which point(s) to remove to get similar results)+experimentally/theoreticallly comparision with grey code implemntation(10)
Q3: 30
We want to determine which is the most likely bootstrap sample and its associated probability in \(n = 3\), \(n = 5\), \(n = 10\), \(n = 12\), \(n = 15\), and \(20\).
We can model the probability of a bootstrap sample of size n as rolling a k-sided dice where k is the number of distinct values there are in the sample. We see that the probability of randomly drawing a specific observation is \(\frac{1}{n}\). This means that the more a certain value appears in our original sample, the more likely it is for us to pick that value in our bootstrap sample.
Using the dice comparison above we see that the probability of each bootstrap sample of size n from our original sample is given by a multinomial distribution. A multinomial distribution models the probability of counts for when there are n independent trials of k different outcomes. The PMF of a multinomial distribution gives the likelihood of any combination of “successes” for the k categories.
Let’s say that we have an original sample of \(\{ x_1 \cdots x_n \}\). We want to draw a bootstrap sample \(\{ s_1 \cdots s_n \}\).
The formula for the probability mass function of a multinomial distribution with n trials and k categories is \(\frac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}\). Again, n represents the number of trials or the size of the bootstrap sample. k represents the number of outcomes that the sample of 1 observation can take. In our case, we see that we have n different outcomes. Thus, we can say that there are n different values we can draw from. This transforms the multinomial PMF into \(\frac{n!}{s_1!\cdots s_n!} p_1^{s_1} \cdots p_k^{s_n} = \frac{n!}{s_1!\cdots s_n!} (\frac{1}{n})^{s_1 + \cdots + s_n}\). Note that the probability of drawing any particular sample is \(\frac{1}{n}\) and the actual number of each original obervation \(x_i\) drawn is \(s_i\).
If we look at the formula for the PMF which is \(\frac{n!}{s_1!\cdots s_n!} (\frac{1}{n})^{s_1 + \cdots + s_n}\), we see that we can maximize this probability by maximizing the numerator \(n!\) or by minimizing the denominator \(s_1!\cdots s_n!\). For the most part, the sample size n is fixed. Thus, we want to minimize the denominator. We see that the denominator \(s_1!\cdots s_n!\) is minimized by setting all \(s_i\) for \(i = \{1 \cdots n \}\) equal to 1. We have the constraint of \(\sum_{i = 1}^n s_i = 1\), so we cannot obtain a value of \(s_1!\cdots s_n!\) smaller than 1.
This means that \(\textbf{the most common bootstrap sample is actually the original sample itself}\).
Below, we calculate the probability of obtaining the original sample for the bootstrap sample using a multinomial distribution.
n <- c(3, 5, 10, 12, 12, 15, 20)
mult_probs <- array(0, length(n))
df <- cbind(n, mult_probs)
for (i in 1:length(n)) {
x <- df[i, 1]
df[i, 2] <- dmultinom(x = rep(1, x), size = x, prob = rep(1/x, x))
}
We also calculate the probabilities manually using the PMF to see if they match the ones obtained above.
n <- c(3, 5, 10, 12, 12, 15, 20)
mult_probs <- array(0, length(n))
df_manual <- cbind(n, mult_probs)
for (i in 1:length(n)) {
x <- df_manual[i, 1]
df_manual[i, 2] <- factorial(x)*(1/x)^x
}
It seems like the results obtained using both methods are the same since the probabilities are the same.
The probability associated with the most common bootstrap sample (the original sample) in \(n = 3\), \(n = 5\), \(n = 10\), \(n = 12\), \(n = 12\), \(n = 15\), and \(20\) is listed below. The probabilities range from \(0.22\) with \(n = 3\) to \(0.0000000232\) with \(n = 20\).
df
In this exercise, we want to note the differences behaviors of metrics between using an Monte Carlo simulation and a complete enumeration method. We will first load the data and our required packages.
library(ggplot2)
library(bootstrap)
library(dplyr)
library(parallel)
data(law)
dim(law)
We want to figure out which observation(s) do we need to remove from the law data in order to make the Monte Carlo simulation and complete enumerations simulation look more similar. Intuitively, we want to remove the outliers. In the law
data case, we see that Observations 1 and 11 don’t quite lie around the diagonal. We will pick one of these two points to rerun the complete enumerations simulation and then compare the new distribution to the one obtained using the bootstrap Monte Carlo simulation. This way, we can isolate the effect of the change in the distribution and attribute it to the removal of that particular outlier.
\(\textbf{We see that by removing the outler, the correlation of the data set increases overall. We hypothesize that this will pull the center of the bootstrap distribution up to the true sample correlation. This will remove the subtle bump in that distribution such that the Monte Carlo simulation will look more similar to the complete enumerations simulations.}\)
Note that if we remove both Observation 1 and Observation 11, the correlation of the data is higher than removing only one.
Below, we create a new data set such that we remove Observation 11. We check the correlation of the new sample with regards to the old law data set. We see that there is an increase in correlation rom \(0.7763\) to \(0.8929\).
law2 <- data.frame(Observation = 1:dim(law)[1], law)
# Given picture in the homework
law2 %>%
ggplot() +
geom_text(mapping = aes(x = LSAT, y = GPA, label = Observation),
hjust = 0, vjust = 0)
law2 <- law2 %>%
filter(Observation != 1) %>%
select(-Observation)
cor(law$LSAT, law$GPA)
cor(law2$LSAT, law2$GPA)
Here, we generate gray codes in order to get all possible bootstrap samples of the law
data. We take around 2-3 minutes to generate the gray codes.
# Generating gray codes
# ptm <- proc.time()
# n <- length(law2$LSAT)
# i <- 1
# total <- choose(2*n - 1, n - 1)
# gray_codes <- matrix(0, nrow = total, ncol = n)
#
# r <- array(0, n)
# r[1] <- n
# t <- n
# h <- 0
# gray_codes[i, ] <- r
# i <- i + 1
#
# while (r[n] != n) {
# if (t != 1) {
# h <- 0
# }
# h <- h + 1
# t <- r[h]
# r[h] <- 0
# r[1] <- t - 1
# r[h + 1] <- r[h + 1] + 1
# gray_codes[i, ] <- r
# i <- i + 1
# }
# proc.time() - ptm
Now, we recompute the complete enumerations bootstrap after removing Observation 11. We compute both the new correlation of the bootstrap sample as well as its multinomial probability of obtaining that particular bootstrap sample. We see that the distribution of using the complete enumeration is about the same as the one obtained in the lecture slides.
The distribution and the code is shown below.
# Complete Enumerations Simulation
# ptm <- proc.time()
# enumData <- mclapply(1:total, function(i) {
# index <- t(gray_codes)[,i]
# law_list <- lapply(1:n, function(j) matrix(rep(law2[j,], index[j]),
# ncol = 2 , byrow = TRUE))
# newLaw <- do.call(rbind, law_list)
# c(cor(unlist(newLaw[,1]), unlist(newLaw[,2])), dmultinom(index, prob = rep(1,n)))
# }, mc.cores = 4)
# proc.time() - ptm
#
# enumData <- t(simplify2array(enumData))
# colnames(enumData) = c("cor","weight")
#
# enumData %>%
# as.data.frame() %>%
# tbl_df() %>%
# ggplot() +
# geom_histogram(mapping = aes(x = cor), binwidth = 0.013, na.rm = T) +
# geom_vline(mapping = aes(xintercept = cor(law2$LSAT, law2$GPA), colour = "red")) +
# labs(x = "Correlation", y = "Count", title = "Complete Enumerations Distribution") +
# guides(colour = F)
Here we look at the bootstrap of the law data and compare both distributions.
We see a change in the histogram compared to the one we had in class.
\(\textbf{We see that the histogram of the bootstrap samples became more smooth and more similar to the smoother distribution of the complete enumerations}\). This is probably due to the shift in correlation. The removal of the outlier Observation 1 increased the correlation of the law
data. This, in turn, will push the overall correlation distribution to the right. This will then cover the awkward bump in the Monte Carlo simulation distribution as the mass of the data will be shifted right. The removal of the outlier will pull the correlation of the bootstrap samples to the right.
The distribution and the code is shown below.
# Bootstrap Monte Carlo Simulation
# ptm <- proc.time()
# j <- length(law2$LSAT)
# B <- 40000
# mc_cors <- array(0, B)
#
# for (i in seq_len(B)) {
# mc_samp <- sample(1:j, size = j, replace = T)
# law_samp <- law2[mc_samp,]
# mc_cors[i] <- cor(law_samp$LSAT, law_samp$GPA)
# }
# proc.time() - ptm
#
# boot_data <- cbind(mc_cors, 1)
#
# boot_data %>%
# as.data.frame() %>%
# tbl_df() %>%
# ggplot() +
# geom_histogram(mapping = aes(x = mc_cors), binwidth = 0.013) +
# geom_vline(mapping = aes(xintercept = cor(law2$LSAT, law2$GPA), colour = "red")) +
# labs(x = "Correlation", y = "Count", title = "Bootstrap Simulation Distribution") +
# guides(colour = F)
Theoretically, \(\textbf{we find that Gray Codes speed up the computation by about 4 times faster}\). We start with analyzing the number of computations that the complete enumeration methods use. We will first assume that generating all possible bootstrap samples takes the same time for the complete enumeration and the speed-up.
For the full enumeration, we see that for each possible bootstrap sample, the apply() function goes through each entry in the gray code and generates the corresponding bootstrap sample. Then we take the correlation of that sample and output it.
For the Gray Code speed-up, we see that we take the first bootstrap sample, we go through each entry and generate the bootstrap sample as well as the correlation. Note that we pre-compute what additional correlation will be added to the final product. Since from each gray code to the next will remove an observation and its effect, then add the new observation and its effect, we do significantly less operations.
Assume that we have a sample size of \(n = 5\). This gives us a total of \({2n - 1 \choose n - 1}\), which evaluates to \({9 \choose 4} = 126\) total bootstrap samples. For the full enumeration, we go through 5 entries for each bootstrap sample, pull out 5 total entries from the data set, and then calculate the correlation. This gives about 11 operations over 126 samples (1386 operations total).
For the Gray Codes speed-up, we also have \(126\) total bootstrap samples. We still go through the gray codes to determine the transition from the previous sample to this sample. We identify two points of difference: the entry we are removing and the entry we are adding. This gives us a total of 10 operations for the first sample and 3 operations for the subsequent samples. Now, we have \(10 + 3*125 = 385\) operations.
Using the amount of operations, we can say that complete enumerations takes about \(\frac{1386}{385} = 3.6\) times longer than the Gray Codes speed-up.
The code below demonstrates the speed-up.
# Using the Gray Code speed-up iterative method
# lsat_mean <- mean(law2$LSAT)
# gpa_mean <- mean(law2$GPA)
#
# correlation <- (law2$LSAT)*(law2$GPA)
# lsat_square <- (law2$LSAT - lsat_mean)^2
# gpa_square <- (law2$GPA - gpa_mean)^2
#
# calc_cor <- matrix(0, nrow = total, ncol = 2)
#
# ptm <- proc.time()
# for (i in seq_len(total)) {
# boot_samp <- gray_codes[i, ]
# calc_cor[i, 1] <- (boot_samp%*%correlation -
# (sum(boot_samp%*%law2$LSAT)*sum(boot_samp%*%law2$GPA))/n)
# calc_cor[i, 1] <- calc_cor[i, 1]/((boot_samp%*%lsat_square)*(boot_samp%*%gpa_square))
# calc_cor[i, 2] <- dmultinom(x = boot_samp, prob = rep(1/n, n))
# }
# proc.time() - ptm
#
# samps <- sample(calc_cor[,1], size = 40000, prob = calc_cor[,2], replace = T)
# gc_data <- cbind(samps, 1)
#
# gc_data %>%
# as.data.frame() %>%
# tbl_df() %>%
# ggplot() +
# geom_histogram(mapping = aes(x = samps), binwidth = 0.01,
# na.rm = T) +
# geom_vline(xintercept = cor(law2$LSAT, law2$GPA))
In our experimental runthrough, we see that complete enumerations took around 800 seconds while the gray_codes speed up took around 200 seconds. This verifies our theoretical speed-up (albeit with a little noise).
It’s interesting how removing an outlier will change the bootstrap distribution so much.
Suppose in a poll of \(500\) registered voters, \(269\) responded that they would vote for candidate \(P\). We want to obtain a \(90%\) percentile bootstrap confidence interval for the true proportion of registered voters who plan to vote for \(P\).
We first start off defining our variables: the number of samples n, the actual number of voters who voted for the candidate, the confidence level alpha, number of bootstrap samples B, and the actual proportion of voters who voted for the candidate.
We start by sampling with replacement from our original sample. Our sample is denoted as each observation taking a binary value (0 or 1). 1 represents that the voter voted for candidate \(P\). 0 means that the voter did not vote for candidate \(P\).
After each sample, we recalculate the proportion of voters for voted for candidate \(P\). From our bootstrapped proportion, we calculate the sample standard deviation.
n <- 500
p <- 269
alpha <- 0.10
B <- 10000
actual_prop <- p/n
voters <- c(rep(1, p), rep(0, n - p))
boot_prop <- array(0, B)
set.seed(8)
for (i in seq_len(B)) {
boot_samp <- sample(x = voters, size = 500, replace = T)
boot_prop[i] <- sum(boot_samp)/n
}
samp_sd <- sd(voters)/sqrt(n)
upper <- actual_prop + qnorm(1-alpha/2)*samp_sd
lower <- actual_prop - qnorm(1-alpha/2)*samp_sd
(boot_conf_90 <- c(lower, upper))
From the sample standard deviation, we can use a normal distribution to obtain the upper and lower bounds of the confidence interval.
This allows us to obtain a bootstrapped 90% confidence interval of \((0.5013, 0.5747)\).