Exercise 1

In this exercise, we will explore the concepts of independence and exchangeability. Suppose the observations \((X_1, \cdots, X_K)\) are distributed in accordance with the multivariate normal probability density of \(\frac{\sqrt{|D|}}{(2\pi)^{K/2}}\exp\left(-\frac{1}{2}\sum\sum d_{ij}(x_i-\mu_i)(x_j-\mu_j)\right)\).

We are also given that the matrix \(D = (d_{ij})\) is positive definite; \(|D|\) is \(D\)’s determinant; \(\operatorname{E}(X_j) = \mu_j\); \(\operatorname{E}((X_j−\mu_j)(X_j−\mu_j)) = \sigma_{ij}\); and \((\sigma_{ij})= D^{−1}\). Note that \(D^{−1}\) denotes the inverse of the covariance matrix.

Assume that \(\sigma_{ij} = \sigma^2\) when \(i = j\) and \(\sigma_{ij} = \sigma_{12}\) when \(i \neq j\). Both of these values are constants and don’t change with respect to \(i\) or \(j\).

Are the observations independent?

By the definition of independence, the product of the observations respective densities should be equal to the multivariate normal probability density. We will begin by assuming that all of our variables have a normal distribution. The density function for a normal distribution is \(P(x_i) = \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{(x_i-\mu_i)^2}{2 \sigma^2}}\).

We end up seeing that \(\textbf{unless the covariances between each pair of observations}\) \((X_i, X_j)\) \(\textbf{are 0, then the observations are not independent}\).

We will try to disprove independence by a counterexample. We will first try to “prove” independence between two variables by multiplying their density functions together and comparing it to the multivariate normal density. We will assume that both variables are normally distributed, thus we will multiply the normal density function by the normal density function. We obtain the following product.

\begin{align} P(x_i) P(x_j) &= \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{(x_i-\mu_i)^2}{2 \sigma^2}} \frac{1}{\sigma \sqrt{2 \pi} } e^{ -\frac{(x_j-\mu_j)^2}{2 \sigma^2}} \\ &= \frac{1}{2 \pi\sigma^2} e^{ -\frac{(x_i-\mu_i)^2}{2 \sigma^2}} e^{ -\frac{(x_j-\mu_j)^2}{2 \sigma^2}} \\ &= \frac{1}{2 \pi\sigma^2} e^{ \frac{- (x_i-\mu_i)^2 - (x_j-\mu_j)^2}{2 \sigma^2}} \\ &= \frac{1}{2 \pi\sigma^2} \exp \left(-\frac{1}{2 \sigma^2} ((x_i-\mu_i)^2 + (x_j-\mu_j)^2) \right) \\ &= \frac{\sqrt{|D|}}{{2 \pi}^{\frac{2}{2}}} \exp\left(-\frac{1}{2}\sum_{j = i} d_{ii}(x_i-\mu_i)(x_j-\mu_j)\right) \\ \end{align}

Just given the form above, we see that the product is missing a lot of terms, namely the cross-term that includes the covariance of \(x_i\) and \(x_j\). Note that the above product is equal to the multivariate density only when the covariance matrix in the multivariate density function is a diagonal matrix (where the covariances and \(d_{ij}\) are 0, inverse of a 0 covariance is 0 in the matrix \(D\)).

We can expand the product above to be a product of K terms as well (again with the covariances equal to 0, since they are independent). Thus, unless the covariance matrix is a diagonal matrix, the observations are not independent in a multivariate normal.

Are the observations exchangeable?

Next, we want to determine if the observations are exchangeable.

From class, we know that dependent normally distributed random variables \(X_i\) for which the variance of \(X_i\) is a constant independent of \(i\) and the covariance of \(X_i\) and \(X_j\) is a constant independent of \(i\) and \(j\) are exchangeable.

In our example above, we know that the variance of \(X_i\) is \(\sigma_{ii} = \sigma^2\) and that the covariance of \(X_i = X_j\) is \(\sigma_{ij} = \sigma_{12}\). We can assume that \(\sigma^2\) is constant, since it does not depend on \(i\) or \(j\). Likewise, the covariance between any set of two variables \(X_i\) and \(X_j\) is \(\sigma_{12}\). \(\sigma_{12}\) is the constant covariance among all pairs of observations. We see that it does not depend on \(i\) or \(j\) either.

Thus, we can conclude that \(\textbf{if the means for each pair}\) \(X_i, X_j\) \(\textbf{such that}\) \(\mu_i = \mu_j\) \(\textbf{are the same, then the the observations are exchangeable}\). Otherwise, we see that from the multivariate density formula, we obtain different products if we were to switch around observations that have different means.

Note that if we subtract the means from each distributed variable, then the new “centered” variables are exchangeable, since the means are out of the equation.

Intuitively, assuming the same mean across all observations, we know that the multivariate normal distribution gives the probability of a certain set of observations. This means regardless of whether value A was “assigned”" to \(X_i\), value B “assigned”" to \(X_j\), or vice versa, the probability of having value A and B has a set is the same regardless of what observation it is “assigned” to (again, assuming the means are the same).

Exercise 2

In this problem, we want to explore the three measures of association and how they estimate different parameters. We will consider bivariate data \((X_i,Y_i)\) generated as follows:

\(Y_i = X_i + e_i, i = 1, 2, \cdots, n\)

where \(X_i\) has a standard Laplace (double exponential) distribution, \(e_i\) has a standard \(N(0, 1)\) distribution, and \(X_i\) and \(e_i\) are independent.

Part A

We will write an R script which generates this bivariate data. First, we will use the supplied R function rlaplace(n) in the OOmisc package to generate \(n\) i.i.d Laplace variates.

For \(n = 30\), we compute such a bivariate sample by calling the function rlaplace and storing it as a data frame in our variable sample. We also generate our random errors using a standard normal distribution \(N(0, 1)\). Lastly, we set the \(\beta\) parameter equal to 1 for our standard Laplace sample.

n <- 30
beta <- 1

set.seed(2)
random_noise <- rnorm(30)

set.seed(7)
sample <- data_frame(X = rlaplace(n, beta), Y = X + random_noise)

Next, we obtain the scatter plot and the association analyses based on the Pearson’s, Spearman’s, and Kendall’s procedures. We use the function cor.test() to calculate the association analyses.

In our scatter plot, we plotted the original X’s on the x-axis and the Y values on the y-axis. We see that there is a noticeable linear trend between the two variables.

sample %>%
  ggplot() +
  geom_point(mapping = aes(x = X, y = Y), size = 2.5)

From lecture, we know that the Pearson correlation coefficient uses each data point in calculating the sample standard deviations and how it relates to simple linear regression. We see that in our sample, the sample correlation estimate is the highest for the Pearson correlation. This test is more robust if we can assume that X and Y are normally distributed (which it is interestingly not).

cor.test(sample$X, sample$Y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  sample$X and sample$Y
## t = 11.146, df = 28, p-value = 8.333e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8051221 0.9533574
## sample estimates:
##       cor 
## 0.9033616

The Kendall \(\tau_k\) is a non-parametric measure of monotonicity between X and Y (strength of dependence). This measures how much of the relationship between the pairs \((X, Y)\) preserves or reverses the given order. Since we see somewhat of a linear pattern in our scatter plot, we see that the Kendall \(\tau_k\) measures how ordering one variable would preserve the ordering of the other as they increase. We see that the tau estimate is lower than the Pearson’s correlation estimate.

cor.test(sample$X, sample$Y, method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  sample$X and sample$Y
## T = 370, p-value = 1.649e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7011494

Lastly, the Spearman procedure measures the degree of association between the two variable X and Y. Likewise, it is a non-parametric test, so we do not make any assumptions about the data (other than the variables are measured on an ordinal scale). The Spearman \(\rho\) estimate is the second-highest out of all three tests.

cor.test(sample$X, sample$Y, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  sample$X and sample$Y
## S = 554, p-value = 4.587e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8767519

The Kendall, Spearman, and Pearson procedures all give evidence that may suggest a strong relationship between the variables X and Y.

Part B

To explore this idea further, we will write an R script which simulates the generation of these bivariate samples and collects the three estimates of association. We will run this script 10,000 times and obtain sample averages of these estimates, their corresponding standard errors, and approximate 95% confidence intervals.

B <- 10000
alpha <- 0.05
error <- qnorm(1 - (alpha/2))

pearson_est <- array(0, B)
spearman_est <- array(0, B)
kendall_est <- array(0, B)

set.seed(15)
for (i in seq_len(B)) {
  random_noise <- rnorm(n)
  sample <- data_frame(X = rlaplace(n, beta), Y = X + random_noise)
  pearson_est[i] <- cor(sample$X, sample$Y, method = "pearson")
  spearman_est[i] <- cor(sample$X, sample$Y, method = "spearman")
  kendall_est[i] <- cor(sample$X, sample$Y, method = "kendall")
}

Taking the mean over all of the 10,000 samples, we see that the Spearman estimate drops to the second-highest and the Pearson correlation is the highest. This makes sense, as Spearman is a non-parametric method, thus it depends on the data more heavily than Pearson’s. It might also be due to the fact that random chance picked a Laplace distribution that yielded a higher Spearman estimate.

The Kendall estimate has the lowest sample average with \(0.565\). The Spearman estimate is \(0.734\) and the Pearson correlation is estimated at \(0.803\).

(pearson_avg <- mean(pearson_est))
## [1] 0.8030645
(spearman_avg <- mean(spearman_est))
## [1] 0.7345325
(kendall_avg <- mean(kendall_est))
## [1] 0.5653747

Looking at the sample standard deviations of the estimates, the Pearson estimate seems to have the lowest variability. The Kendall tau estimate and the Spearman rho estimate have the highest variabilities This will project into a larger confidence interval for the two estimates.

(pearson_sd <- sd(pearson_est))
## [1] 0.08154731
(spearman_sd <- sd(spearman_est))
## [1] 0.09998199
(kendall_sd <- sd(kendall_est))
## [1] 0.09416267

Looking at the confidence intervals, we see that the Pearson 95% confidence interval is the most narrow, while the Spearman 95% confidence interval is the widest (as expected). Furthermore, we see that the Pearson confidence interval’s upper limit is very close to a correlation of 1 (\(0.962\)). The Spearman confidence interval also has an upper limit close to 1 with \(0.93\).

(pearson_conf <- c(pearson_avg - pearson_sd*error, 
                  pearson_avg + pearson_sd*error))
## [1] 0.6432348 0.9628943
(spearman_conf <- c(spearman_avg - spearman_sd*error, 
                  spearman_avg + spearman_sd*error))
## [1] 0.5385714 0.9304937
(kendall_conf <- c(kendall_avg - kendall_sd*error, 
                  kendall_avg + kendall_sd*error))
## [1] 0.3808193 0.7499302

All three confidence intervals show that there is a strong correlation between the X variable and Y response. All of the confidence intervals for the three methods are overlapping (which may suggest that it is not significant).

Exercise 3

In this problem, we will explore the regression toward the mean phenomenon. The electronic memory game Simon was first introduced in the late 1970’s. In the game there are four colored buttons which light up and produce a musical note. The device plays a sequence of light/note combinations and the goal is to play the sequence back by pressing the buttons. The game starts with one light/note and progressively adds one each time the player correctly recalls the sequence.

Suppose the game were played by a set of statistics students in two classes (two time slots). Each student played the game twice and recorded his or her longest sequence. The results are in the data set simon in package npsm.

Regression toward the mean is the phenomenon that if an observation is extreme on the first trial it will be closer to the average on the second trial. In other words, students that scored higher than average on the first trial would tend to score lower on the second trial and students who scored low on the first trial would tend to score higher on the second.

Part A

We will first obtain a scatter plot of the data. A a first glance, the points seem pretty scattered. There is a vague linear trend, which might contradict the regression toward the mean effect. We see that there is a student with scores lower than 5 for both games and students with sequences longer than 15 for both games.

Furthermore, there are a lot of points concentrated in the middle of the scatterplot, which signifies average performance for both games.

simon %>%
  dmap_at("class", as.factor) %>%
  rename("Class" = class) %>%
  ggplot() +
  geom_point(mapping = aes(x = game1, y = game2), size = 2.5) +
  labs(x = "Game 1 Longest Sequence", y = "Game 2 Longest Sequence") 

We look at the scatterplot separated by class. Class 2 seems to have a more prominent linear trend going on than Class 3. Class 3’s performance tends to be focused in the middle for Game 1 and a more variable range for Game 2. Class 3 might alter the entire cohort’s behavior.

simon %>%
  dmap_at("class", as.factor) %>%
  rename("Class" = class) %>%
  ggplot() +
  geom_point(mapping = aes(x = game1, y = game2, color = Class), size = 2.5) +
  facet_wrap(~ Class) +
  labs(x = "Game 1 Longest Sequence", y = "Game 2 Longest Sequence") 

Part B

Next, we overlay an R fit of the data (use package Rfit) using Wilcoxon scores (the default). We also want to overlay the symmetry line \(Y = X\). The R fit overlay is marked in red and the symmetry line is marked in black. It seems like for the R fit, there seem to be more points clustered around the middle of the graph (people who score average on the first test also score average on the second test). However, we don’t see students with low or high game 1 scores to have high or low game 2 scores respectively.

fit <- rfit(game2 ~ game1, data = simon, scores = Rfit::wscores)

simon %>%
  dmap_at("class", as.factor) %>%
  rename("Class" = class) %>%
  ggplot() +
  geom_point(mapping = aes(x = game1, y = game2, color = Class), size = 2.5) +
  geom_abline(intercept = 0, slope = 1, size = 1) +
  geom_line(mapping = aes(x = game1, y = fitted(fit)), color = "red",
            size = 1) +
  labs(x = "Game 1 Longest Sequence", y = "Game 2 Longest Sequence") 

Part C

Then, we obtain an R estimate of the slope of the regression line (R Fit line) as well as an associated confidence interval (for the R Fit line).

We see that the slope of the R fit line is around \(0.4999 = 0.5\) and the slope of the symmetry line is \(1\). THe R fit slope is less than the slope of the symmetry line, which may suggest a regression toward the mean effect. Looking at the graph, we see that there is exactly one point in which both of the regression lines pass through. This point is from class 2 with a game1 and game2 score of \(12\).

(slope_est <- coef(fit)[2] %>% unname())
## [1] 0.4999999

The 95% confidence interval for the R fit regression slope is \((0.19, 0.81)\). Again, we see that this shows that overall, Game 1 scores are expected to be higher than Game 2 scores under the R fit model.

slope_std <- coef(summary(fit))[2,2] %>% unname()
alpha <- 0.05
error <- qnorm(1 - (alpha/2))

conf_upper <- slope_est + error*slope_std
conf_lower <- slope_est - error*slope_std
(conf_int <- c(conf_lower, conf_upper))
## [1] 0.1912956 0.8087041

Part D

We see that these \(\textbf{data suggest a regression toward the mean effect}\). In our confidence interval calculated in the previous step, we see that it does not cross 1. Unless our R fit slope is close to or exactly 1, we can strongly suggest that there is a regression toward the mean effect.

Looking at more plots of Game 1 versus Game 2 scores, we see that for the top 5 Game 1 scores, the average Game 1 score is greater than its corresponding average Game 2 score.

simon %>%
  dmap_at("class", as.factor) %>%
  top_n(5, game1) %>%
  rename(`Game 1` = game1,
         `Game 2` = game2) %>%
  gather(Game, Score, `Game 1`:`Game 2`) %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = Game, y = Score)) 

Moreover, when we look the bottom 5 Game 1 scores, there is a huge increase in the range of scores for Game 2 towards the mean.

simon %>%
  dmap_at("class", as.factor) %>%
  arrange(-game1) %>%
  rename(`Game 1` = game1,
         `Game 2` = game2) %>%
  tail(5) %>%
  gather(Game, Score, `Game 1`:`Game 2`) %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = Game, y = Score)) 

Both of the previous plots give more evidence to the regression towards the mean effect in the simon dataset.

Exercise 4

We will show that the leave-one-out cross-validation

\(\widehat{R}(h) = \frac{1}{n}\sum_{i = 1}^n (Y_i−\widehat{r}_{(−i)}(x_i))^2\) is equal to \(\widehat{R}(h) = \frac{1}{n}\sum_{i = 1}^n (\frac{Y_i−\widehat{r}_{n}(x_i)}{1 - L_{ii}})^2\).

We will let \(\widehat{r}_{n}(x)\) to be the kernel regression estimator and \(\widehat{r}_{(-i)}(x)\) to be the kernel regression estimator obtained by leaving out \((x_i, Y_i)\). Also denote \(L_{ij} = \frac{K(x_i, x_j)}{\sum_{k = 1}^n K(x_i, x_k)}\).

Part A

We will first show that \(\widehat{r}_{(-i)}(x_i) = \sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ij} \widehat{r}_{(-1)}(x_i)\).

Given above, we know that since \(\widehat{r}_{n}(x)\) is a kernel regression estimator, it can be written in the form of \(\widehat{r}_{n}(x) = \sum_{j = 1}^n L_{ij} Y_j\), where L is denoted by a kernel function.

Likewise, for a kernel regression estimator obtained by leaving out \((x_i, Y_i)\), we can write it as \(\widehat{r}_{(-i)}(x) = \sum_{j \neq i}^n L_{ij} Y_j\). This can then be simplified further as shown below.

\begin{align} \widehat{r}_{(-i)}(x) &= \sum_{j \neq i}^n L_{ij} Y_j \\ &= \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} Y_j \end{align}

We end up with the expression \(\widehat{r}_{(-i)}(x) = \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} Y_j\).

Now we want to expand on the right hand side of the statement we are trying to prove. Starting from the first statement in part (a), we obtain the following using algebra.

\begin{align} \widehat{r}_{(-i)}(x_i) &= \sum_{j = 1}^n L_{ij} Z_j \\ &= \sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ij} \widehat{r}_{(-i)}(x_i) \\ &= \sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ii} \widehat{r}_{(-i)}(x_i) \\ \widehat{r}_{(-i)}(x_i) - L_{ii} \widehat{r}_{(-i)}(x_i) &= \sum_{j \neq i}^n L_{ij} Y_j \\ \widehat{r}_{(-i)}(x_i)(1 - L_{ii}) &= \sum_{j \neq i}^n L_{ij} Y_j \\ \end{align}

Because the two statements are equivalent, it is sufficient just to prove the last statement \(\widehat{r}_{(-i)}(x_i)(1 - L_{ii}) = \sum_{j \neq i}^n L_{ij} Y_j\).

Now we want to plug in the definitions for \(L_{ij}\) (which we defined in the beginning) and the leave-one-out kernel regression estimator.

\begin{align} \widehat{r}_{(-i)}(x_i)(1 - L_{ii}) &= \sum_{j \neq i}^n L_{ij} Y_j \\ \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (1 - \frac{K(x_i, x_i)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j &= \sum_{j \neq i}^n L_{ij} Y_j \\ \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (1 - \frac{K(x_i, x_i)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j &= \sum_{j \neq i}^n \frac{K(x_i, x_j)}{\sum_{k = 1}^n K(x_i, x_k)} Y_j \\ \end{align}

Now let’s work on simplifying only the left-hand side so we can make it equal to the right hand side.

\begin{align} \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (1 - \frac{K(x_i, x_i)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j &= \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (\frac{\sum_{k = 1}^n K(x_i, x_k) - K(x_i, x_i)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j \\ &= \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (\frac{\sum_{k \neq i}^n K(x_i, x_k)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j \\ &= \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k \neq i}^n K(x_i, x_k)} (\frac{\sum_{k \neq i}^n K(x_i, x_k)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j \\ &= \frac{\sum_{j \neq i}^n K(x_i, x_j)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j \\ &= \sum_{j \neq i}^n \frac{K(x_i, x_j)}{\sum_{k = 1}^n K(x_i, x_k)}) Y_j \\ &=\sum_{j \neq i}^n L_{ij} Y_j \end{align}

Since the denominator \(\sum_{k = 1}^n K(x_i, x_k)\) does not depend on \(j\), we can move the summation for the numerator out to obtain the following.

Thus, we see that the left-hand side evaluates to our right-hand side. We have shown that \(\widehat{r}_{(-i)}(x_i)(1 - L_{ii}) = \sum_{j \neq i}^n L_{ij} Y_j\) as needed.

Part B

Next, we want to show \(\widehat{r}_{n}(x_i) - \widehat{r}_{(-i)}(x_i) = L_{ii} (Y_i - \widehat{r}_{(-i)}(x_i))\). Hence, we can also show that \(Y_i - \widehat{r}_{(-i)}(x_i) = \frac{Y_i - \widehat{r}_n(x_i)}{1 - L_{ii}}\).

We will start by showing the first equality \(\widehat{r}_{n}(x_i) - \widehat{r}_{(-i)}(x_i) = L_{ii} (Y_i - \widehat{r}_{(-i)}(x_i))\).

Let’s start from the second equation and work backwards to the first equation.

\begin{align} Y_i - \widehat{r}_{(-i)}(x_i) &= \frac{Y_i - \widehat{r}_n(x_i)}{1 - L_{ii}} \\ (1 - L_{ii})(Y_i - \widehat{r}_{(-i)}(x_i)) &= Y_i - \widehat{r}_n(x_i) \\ (1 - L_{ii}) Y_i - (1 - L_{ii}) \widehat{r}_{(-i)}(x_i)) &= Y_i - \widehat{r}_n(x_i) \\ - (1 - L_{ii}) \widehat{r}_{(-i)}(x_i)) &= Y_i - \widehat{r}_n(x_i) - (1 - L_{ii}) Y_i \\ L_{ii} \widehat{r}_{(-i)}(x_i)) - \widehat{r}_{(-i)}(x_i)) &= Y_i - \widehat{r}_n(x_i) - (1 - L_{ii}) Y_i \\ L_{ii} \widehat{r}_{(-i)}(x_i)) - \widehat{r}_{(-i)}(x_i)) &= Y_i - \widehat{r}_n(x_i) - Y_i + L_{ii} Y_i \\ L_{ii} \widehat{r}_{(-i)}(x_i)) - \widehat{r}_{(-i)}(x_i)) &= - \widehat{r}_n(x_i) + L_{ii} Y_i \\ - \widehat{r}_{(-i)}(x_i)) &= - \widehat{r}_n(x_i) + L_{ii} Y_i - L_{ii} \widehat{r}_{(-i)}(x_i)) \\ \widehat{r}_n(x_i) - \widehat{r}_{(-i)}(x_i)) &= L_{ii} Y_i - L_{ii} \widehat{r}_{(-i)}(x_i)) \\ \widehat{r}_n(x_i) - \widehat{r}_{(-i)}(x_i)) &= L_{ii}(Y_i - \widehat{r}_{(-i)}(x_i)) \\ \end{align}

Now we want to show the last statement \(\widehat{r}_n(x_i) - \widehat{r}_{(-i)}(x_i)) = L_{ii}(Y_i - \widehat{r}_{(-i)}(x_i))\) also holds.

\begin{align} \widehat{r}_{n}(x_i) - \widehat{r}_{(-i)}(x_i) &= \widehat{r}_{n}(x_i) - (\sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ij} \widehat{r}_{(-1)}(x_i)) \\ &= \sum_{j = 1}^n L_{ij} Y_j - (\sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ii} \widehat{r}_{(-i)}(x_i)) \\ &= \sum_{j = 1}^n L_{ij} Y_j - (\sum_{j \neq i}^n L_{ij} Y_j + \sum_{j = i}^n L_{ii} \widehat{r}_{(-i)}(x_i)) \\ &= \sum_{j = 1}^n L_{ij} Y_j - \sum_{j \neq i}^n L_{ij} Y_j - \sum_{j = i}^n L_{ii} \widehat{r}_{(-i)}(x_i) \\ &= L_{ii}Y_i - \sum_{j = i}^n L_{ii} \widehat{r}_{(-i)}(x_i) \\ &= L_{ii}(Y_i - \widehat{r}_{(-i)}(x_i)) \\ \end{align}

Finally, we can conclude that the leave-one-out cross-validation equality holds. We have proven that for a single observation \((x_i, Y_i)\), \(Y_i−\widehat{r}_{(−i)}(x_i) = \frac{Y_i−\widehat{r}_{n}(x_i)}{1 - L_{ii}}\). By squaring both sides of the equation and taking the average of all the terms over all the observations \(x_i\), we arrive at the two equations we want to prove at the beginning of the problem. The two equations we wanted to prove are shown below.

\(\widehat{R}(h) = \frac{1}{n}\sum_{i = 1}^n (Y_i−\widehat{r}_{(−i)}(x_i))^2\) is equal to \(\widehat{R}(h) = \frac{1}{n}\sum_{i = 1}^n (\frac{Y_i−\widehat{r}_{n}(x_i)}{1 - L_{ii}})^2\)

We see that the second equation doesn’t include an estimator with a single observation left out. This result means that we are able to estimate the leave-one-out cross-validation with the entire data set as opposed to taking out each individual observation and calculating it that way.