Homework 4

Exercise 1

From Good 2005 (Exercise 4.8, #20, page 78, Stanford library link):

Suppose the observations \((X_1,\dots,X_K)\) are distributed in accordance with the multivariate normal probability density \[ \frac{\sqrt{|D|}}{(2\pi)^{K/2}} \exp \left( -\frac{1}{2} \sum \sum d_{ij} (x_i-\mu_i)(x_j-\mu_j) \right) \] where the matrix \(D = (d_{ij})\) is positive definite; \(|D|\) denotes its determinant; \(\operatorname{E}(X_j) = \mu_j\); \(\operatorname{E}((X_i−\mu_i)(X_j−\mu_j)) = \sigma_{ij}\); and \((\sigma_{ij})= D^{−1}\). If \(\sigma_{ij} = \sigma^2\) when \(i = j\) and \(\sigma_{ij} = \sigma_{12}\) when \(i \ne j\), are the observations independent? Exchangeable?

Exercise 2

From our textbook (Ex. 4.9.19) (Stanford library link):

Recall that, in general, the three measures of association estimate different parameters. Consider bivariate data \((X_i,Y_i)\) generated as follows:

\[Y_i = X_i + e_i, \hspace{0.5cm} i=1,2,\dots,n\]

where \(X_i\) has a standard Laplace (double exponential) distribution, \(e_i\) has a standard \(N(0,1)\) distribution, and \(X_i\) and \(e_i\) are independent.

Write an R script which generates this bivariate data. The supplied R function rlaplace(n) generates \(n\) iid Laplace variates. For \(n = 30\), compute such a bivariate sample. Then obtain the scatterplot and the association analyses based on the Pearson’s, Spearman’s, and Kendall’s procedures.
Next write an R script which simulates the generation of these bivariate samples and collects the three estimates of association. Run this script for 10,000 simulations and obtain the sample averages of these estimates, their corresponding standard errors, and approximate 95% confidence intervals. Comment on the results.

Exercise 3

From our textbook (Ex. 4.9.20) (Stanford library link):

The electronic memory game Simon was first introduced in the late 1970s. In the game there are four colored buttons which light up and produce a musical note. The device plays a sequence of light/note combinations and the goal is to play the sequence back by pressing the buttons. The game starts with one light/note and progressively adds one each time the player correctly recalls the sequence.

Suppose the game were played by a set of statistics students in two classes (time slots). Each student played the game twice and recorded his or her longest sequence. The results are in the dataset simon in package npsm.

Regression toward the mean is the phenomenon that if an observation is extreme on the first trial it will be closer to the average on the second trial. In other words, students that scored higher than average on the first trial would tend to score lower on the second trial and students who scored low on the first trial would tend to score higher on the second.

Obtain a scatterplot of the data.
Overlay an R fit of the data (use package Rfit). Use Wilcoxon scores. Also overlay the line \(y = x\).
Obtain an R estimate of the slope of the regression line as well as an associated confidence interval.
Do these data suggest a regression toward the mean effect?

Exercise 4

Show that the leave-one-out cross-validation \[\widehat{R}(h) = \frac{1}{n} \sum_{i=1}^n (Y_i - \widehat{r}_{(-i)}(x_i))^2 \hspace{1cm}\text{is equal to}\hspace{1cm} \widehat{R}(h) = \frac{1}{n} \sum_{i=1}^n \left( \frac{Y_i - \widehat{r}_n(x_i)}{1-L_{ii}} \right)^2.\] Let \(L_{ij} = \frac{K(x_i,x_j)}{\sum_{k=1}^n K(x_i,x_k)}.\) Let \(\widehat{r}_n(x)\) be the kernel regression estimator and \(\widehat{r}_{(-1)}(x)\) be the kernel regression estimator obtained by leaving out \((x_i,Y_i).\) Follow the following steps:

Show that \[\widehat{r}_{(-i)}(x_i) = \sum_{j=1}^n L_{ij} Z_j \hspace{1cm}\text{with}\hspace{1cm} Z_j = \begin{cases} Y_j & j \ne i \\ \widehat{r}_{(-i)}(x_i) & j = i \end{cases}.\]
Show that \(\widehat{r}_n(x_i) - \widehat{r}_{(-i)}(x_i) = L_{ii} (Y_i - \widehat{r}_{(-i)}(x_i))\). Hence, \(Y_i - \widehat{r}_{(-i)}(x_i) = \frac{Y_i - \widehat{r}_{n}(x_i)}{1-L_{ii}}.\) And finally conclude that the leave-one-out cross-validation equality holds.

Deadline

Please email your solutions in the form of a Rmd file to Nan Bi and Lexi Guan.

This homework is due on Tuesday, May 10th at 1:30 pm.

You are encouraged to work through this homework with your peers. But write your own code and explain the results in your own words.