Stanford University, Spring 2016, STATS 205

Overview

  • Smoothing or estimating curves
    • Density Estimation
    • Non-linear regression
  • Rank-based linear regression

Curve Estimation

  • A curve of interest can be a probability density function \(f\) or a regression function \(r\)
  • In density estimation, we observe \(X_1,\dots,X_n\) from some unknown cdf \(F\) with density \(f\) \[ X_1,\dots,X_n \sim f \] and the goal is to estimate density \(f\)
  • In regression, we observe pairs \((x_1,Y_1),\dots,(x_n,Y_n)\) that are related as \[ Y_i = r(x_i) + e_i \] with \(E(e_i) = 0\), and the goal is to estimate the regression function \(r\)

Density Estimation

Non-Linear Regression

Rank-Based Linear Regression

  • But today, we focus on linear regression
  • We generalize rank-based methods from two-sample location problems to general linear models

Linear Regression in Two-Sample Problem

  • Recall that we framed the two-sample location problem as a regression problem
  • Combine sample in one vector \(\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)\)
  • Let \(\boldsymbol{c}\) be a \(n \times 1\) vector with
    • zeros at position \(1\) to \(n_1\) and
    • ones at positions \(n_1+1\) to \(n\)
  • Then we can rewrite the location model as \[Z_i = \alpha + c_i \Delta + e_i\] where \(e_1,\dots,e_n\) are iid with pdf \(f(t)\)

Rank-Based Linear Regression

Test for Slope

  • For a single regression model (\(i = 1,\dots,n\)) \[ Y_i = \alpha + \beta x_i + e_i \]
  • Intercept \(\alpha\) and slope \(\beta\) are unkown
  • Error \(e_1,\dots,e_n\) are sample from continous population with median 0
  • Theil (1950) tested for: \(H_0: \beta = \beta_0\)
  • Which means for every unit increase in the value of the independent (predictor) variable \(x\), we expect an increase (or decrease, depending on the sign) of the dependent (response) varibale \(Y\)

Test for Slope

  • Difference \[D_i = Y_i - \beta_0 x_i \hspace{1cm}\text{for}\hspace{1cm} n = 1,\dots,n\]
  • Let \[C = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \operatorname{sign}(D_j - D_i)\]

Test for Slope

  • Motivation for the Test: \[ D_j - D_i = Y_j - \beta_0 x_j - (Y_i - \beta_0 x_i) = Y_j - Y_i + \beta_0 (x_i - x_j)\]
  • under null the median of \(Y_j - Y_i = \beta(x_j - x_i) + (e_j - e_i)\) is \(\beta (x_j - x_i)\)
  • under null the median of \(D_j - D_i\) is \(\beta (x_j - x_i) + \beta_0 (x_j - x_i) = (\beta - \beta_0)(x_j - x_i)\)
  • hence, we tend to obtain positive \(D_j - D_i\) difference when \(\beta > \beta_0\) which leads to large \(C\)
  • The statistics \(C\) is the Kendall's correlation statistics, and can be interpreted as a test for correlation between \(x\) and \(Y\)

Test for Slope (Example)

  • Smith (1967) described experiment in Australia on cloud seeding
  • To investigate impact of cloud seeding on rainfall

Test for Slope (Example)

  • Two area on montain served as target and control
  • During any period a random process was used to determine whether to clouds over the target area should be seeded
  • The effect of seeding was measured by double ratio \[\frac{T/Q \text{ (seeded)}}{T/Q \text{ (unseeded)}}\]
  • \(T\) total rainfalls in the target areas
  • \(G\) total rainfalls in the control areas
  • Slope parameter \(\beta\) represenents the rate of change in \(Y\) (double ratio) per unit change in \(x\) (year)
  • Test \(H_0: \beta = 0\) versus \(H_A: \beta \ne 0\) (seeding no impact on rainfall)

Test for Slope (Example)

Data collected over 5 years:

##   year doubleRatio
## 1    1        1.26
## 2    2        1.27
## 3    3        1.12
## 4    4        1.16
## 5    5        1.03

Under the null \(\beta_0 = 0\) we have \(D_i = Y_i\):

Test for Slope (Example)

##    i j     D signD
## 1  1 2  0.01     1
## 2  1 3 -0.14    -1
## 3  1 4 -0.10    -1
## 4  1 5 -0.23    -1
## 5  2 3 -0.15    -1
## 6  2 4 -0.11    -1
## 7  2 5 -0.24    -1
## 8  3 4  0.04     1
## 9  3 5 -0.09    -1
## 10 4 5 -0.13    -1
C = sum(sign(diff)); C
## [1] -6

Test for Slope (Example)

ken = cor.test(year,doubleRatio,method="kendall",alternative = "two.sided")
ken$p.value
## [1] 0.2333333

No evidence for cloud seeding impacting rainfall

Correlation

  • In a simple linear regression setting as before, we have reponse variable \(Y\) predictor variable \(X\), and the fit of the model is of main interest
  • Often, we want to predict random variable \(Y\) from \(x\) and we treat \(x\) as nonstochastic
  • In correltaion analysis, we consider random pairs \((X,Y)\), and the strength of of a relationship or association between \(X\) and \(Y\) is of main interest
  • No association means that \(X\) and \(Y\) are independent:
    \[ H_0: X \text{ and } Y \text{ are independent versus } H_A: X \text{ and } Y \text{ are dependent} \]
  • We assume that \((X,Y)\) is a continuous random vector with cdf \(F(x,y)\) and pdf \(f(x,y)\)
  • Recall that \(X\) and \(Y\) are independent if their joint cdf factors into the product of marginals cdfs \[ F(x,y) = F_X(x) F_Y(y) \]

Correlation

  • Last week, we discussed \(\chi^2\) goodness-of-fit test for discrete random variables
  • In the discrete case, independence was \(P(X=x,Y=y) = P(X=x) P(Y=y)\) for all \(x\) and \(y\)
  • In the continuous case, we have the Pearson's measure of association and two popular nonparametric measures (Kendall and Spearman)
  • Random sample \((X_1,Y_1),\dots,(X_n,Y_n)\)

Pearson's Correlation Coefficient

  • The traditional correlation coefficient \[ \rho = \frac{ E\left( (X - \mu_X) (Y - \mu_Y) \right)}{\sigma_X \sigma_Y} \]
  • Measure of linear association
  • Note that if \(X\) and \(Y\) are independent then \(\rho = 0\)
  • And if \(\rho \ne 0\) then are dependent

Pearson's Correlation Coefficient

  • The numenator is estimated by the sample covariance
  • The denominator is estimated by the product of sample standard deviations \[ r = \frac{n^{-1} \sum_{i=1}^n (X_i- \bar{X} )(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^n(X_i-\bar{X})^2 \cdot \sum_{i=1}^n (Y_i-\bar{Y})^2}} \]
  • The estimate of the correlation coefficient is directly related to simple least squares regression
  • Let \(\widehat{\sigma}_x\) and \(\widehat{\sigma}_y\) denote the respective sample standard deviations of \(X\) and \(Y\), then \(r = \frac{ \widehat{\sigma}_x }{ \widehat{\sigma}_y } \widehat{\beta}\)
  • \(\widehat{\beta}\) is least squares estimate of slope in simple regression of \(Y_i\) on \(X_i\)
  • Under the null \(\sqrt{n} r\) is asymptotically normal

Kendall's \(\tau_K\)

  • Kendall's \(\tau_K\) is a measure of monotonicity between X and Y
  • Let two pairs of random variables \((X_1,Y_1)\) and \((X_2,Y_2)\) be independent random vectors with the same distribution as \((X,Y)\)
  • The pairs \((X_1,Y_1)\) and \((X_2,Y_2)\) are concordant or discordant if \[ \operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = 1 \hspace{1cm}\text{or}\hspace{1cm} \operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = -1 \]
  • Concordant pairs are indicative of increasing monotonicity between \(X\) and \(Y\)
  • Discordant pairs indicate decreasing monotonicity \[ \tau_K = P\left( \text{concordant} \right) - P\left( \text{discordant} \right)\]
  • If \(X\) and \(Y\) are independent then \(\tau_K = 0\)
  • If \(\tau_K = 0\) then \(X\) and \(Y\) are dependent

Kendall's \(\tau_K\)

  • Using a random sample \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\)
  • Estimator: count the number of concordant pairs and subtract from that the number of discordant pairs
  • In standardized form \[ \widehat{\tau}_K = {n \choose 2}^{-1} \sum_{i < j} \operatorname{sign} \{ (X_i-X_j)(Y_i-Y_j) \} \]
  • Tests of the hypotheses can be based on the exact finite sample distribution

Spearman \(\rho_S\)

  • Random sample \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\)
  • Denote by \(\operatorname{R}(X_i)\) the rank of \(X_i\) among \(X_1,X_2,\dots,X_n\)
  • Denote by \(\operatorname{R}(Y_i)\) as the rank of \(Y_i\) among \(Y_1,Y_2,\dots,Y_n\)
  • Estimate of \(\rho_S\) is the sample correlation coefficient \[ r_S = \frac{\sum_{i=1}^n \left(\operatorname{R}(X_i)-((n+1)/2))(\operatorname{R}(Y_i)-((n+1)/2)\right)}{n(n^2-1)/12} \]
  • We accept \(HA:\) \(X\) and \(Y\) are dependent for large values of \(|r_S|\)
  • This test can be carried out using the exact distribution

The Geometry of Linear Models

  • Setup the following linear model (for \(i = 1,\dots,n\)) \[ Y_i = x_i^T \boldsymbol{\beta} + e_i^* \] where \(\boldsymbol{\beta}\) is a \(1 \times p\) vector of unkown parameters
  • \(\boldsymbol{\beta}\) are the parameter of interst
  • Center (usually using the median \(T(e_i^*) = \alpha\)) the errors \(e_i = e_i^* - \alpha\) \[ Y_i = \alpha + \boldsymbol{x}_i \boldsymbol{\beta} + e_i\]
  • Let \(f(t)\) be the pdf of the erros \(e_i\)
  • Assumption: \(f(t)\) can be either asymmetric or symmetric depending on whether signs or ranks are used
  • The intercept \(\alpha\) is independent of the slope \(\boldsymbol{\beta}\)

The Geometry of Linear Models

  • Let \(\boldsymbol{Y} = (Y_1,\dots,Y_n)^T\) denote the \(n \times 1\) vector of observations
  • Let \(\boldsymbol{X}\) denote the \(n × p\) matrix with rows \(x^T_i\)
  • Then we can write the linear model in matrix form: \[ \boldsymbol{Y} = \boldsymbol{1}\alpha + \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{e} \]
  • \(\boldsymbol{X}\) is centered (that's fine since we have \(\alpha\) in the model), and assume \(\boldsymbol{X}\) is full column rank
  • Let \(\Omega_F\) be the column space spanned by coluns of \(\boldsymbol{X}\)
  • So we can rewrite the linear model as (coordinate-free because not restricited to any specific basis vectors) \[ \boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e} \] with \(\boldsymbol{\eta} = \Omega_F\)

The Geometry of Linear Models

  • Now we can estimate \(\boldsymbol{\beta}\)
  • And test hypothesis \[ H_0: \boldsymbol{M}\boldsymbol{\beta} = 0 \hspace{2cm} H_A: \boldsymbol{M}\boldsymbol{\beta} \ne 0 \]
  • \(\boldsymbol{M}\) is a \(q \times p\) matrix of full rank

The Geometry of Estimation

\[ \boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e} \hspace{1cm}\text{with}\hspace{1cm} \boldsymbol{\eta} = \Omega_F\]

  • Task is to minimize some distance between \(\boldsymbol{Y}\) and subspace \(\Omega_F\)
  • Think of \(\boldsymbol{\eta}\) as a hyperplane and the task as projecting \(\boldsymbol{Y}\) onto it
  • For the projection we need to define a distance
  • Instead of using the usual Euclidean distance, we use a distance based on signs and ranks \[ \| v_i \|_{\varphi} = \sum_{i=1}^n a(R(v_i)) v_i \]
  • with scores \(a(1) \le a(2) \le \cdots \le a(n)\) and score function \(a(i) = \varphi(i/(n+1))\)
  • \(\varphi\) is nondecreasing, centered, standardized and defined on the interval \((0,1)\)

The Geometry of Estimation

  • \(\|\boldsymbol{v}\|_{\varphi}\) is a pseudo-norm:
    • triangle inequality, non-negative, \(\|\alpha \boldsymbol{v}\|_{\varphi} = |\alpha \|\boldsymbol{v}\|_{\varphi}\), and
    • additionally \(\|\boldsymbol{v}\|_{\varphi} = 0\) if and only if \(v_1 = \dots = v_n\)
  • By setting \(\varphi_R(u) = \sqrt{12}(u − 1/2)\), we get the Wilcoxon pseudo-norm
  • By setting \(\varphi_S(u) = \operatorname{sgn}(u − 1/2)\), we get the sign pseudo-norm (equivalent to using the \(L_1\) norm)
  • In general \[ D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi} \]

The Geometry of Estimation

\[\widehat{\boldsymbol{\eta}} = D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi}\]

Source: Hettmansperger & McKean (2011)

  • Estimate \(\widehat{\boldsymbol{\eta}}_{\varphi}\)
  • Distance between \(Y\) and the space \(\Omega_F\) is \(d_F\)
  • Reduced model subspace \(\Omega_R \subset \Omega_F\)

The Geometry of Estimation

  • Written in coordinate form, we can estimate \[ D_{\varphi}(\boldsymbol{\beta}) = \| \boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta} \|_{\varphi} \]
  • \(D_{\varphi}(\boldsymbol{\beta})\) is a piecewise linear, continious, and convex function of \(\boldsymbol{\beta}\),
  • and the gradient with respect to \(\boldsymbol{\beta}\) is given by \[ \nabla D_{\varphi}(\boldsymbol{\beta}) = \boldsymbol{X}^T \boldsymbol{a}(R(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})) \] with \(\boldsymbol{a}(R(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})^T = \left( a(R(Y_1-x_1^T\boldsymbol{\beta}),\dots,a(R(Y_n-x_n^T\boldsymbol{\beta}) \right)\)
  • Our estimate is \(\nabla D_{\varphi}(\boldsymbol{\beta}) = 0\)

The Geometry of Testing

  • \(\Omega_R\) is the reduced model space of \(\Omega_F\) subject to \(H_0\): \(\Omega_R = \{ \boldsymbol{\eta} \in \Omega_F: \boldsymbol{\eta} = \boldsymbol{X}\boldsymbol{\beta} \}\)
  • for some \(\boldsymbol{\beta}\) such that \(\boldsymbol{M}\boldsymbol{\beta} = 0\)
  • \(RD_{\varphi} = D_{\varphi}(Y,\Omega_R) - D_{\varphi}(Y,\Omega_F)\)
  • Large values of \(RD_{\varphi}\) indicate \(H_A\) while small values support \(H_0\)

References

  • Hettmansperger and McKean (2011). Robust Nonparametric Statistical Methods
  • Wassermann (2006). All of Nonparametric Statistics
  • Hollander, Wolfe, and Chicken (2014). Nonparametric Statistical Methods