Stanford University, Spring 2016, STATS 205

## Overview

• Smoothing or estimating curves
• Density Estimation
• Non-linear regression
• Rank-based linear regression

## Curve Estimation

• A curve of interest can be a probability density function $$f$$ or a regression function $$r$$
• In density estimation, we observe $$X_1,\dots,X_n$$ from some unknown cdf $$F$$ with density $$f$$ $X_1,\dots,X_n \sim f$ and the goal is to estimate density $$f$$
• In regression, we observe pairs $$(x_1,Y_1),\dots,(x_n,Y_n)$$ that are related as $Y_i = r(x_i) + e_i$ with $$E(e_i) = 0$$, and the goal is to estimate the regression function $$r$$

## Rank-Based Linear Regression

• But today, we focus on linear regression
• We generalize rank-based methods from two-sample location problems to general linear models

## Linear Regression in Two-Sample Problem

• Recall that we framed the two-sample location problem as a regression problem
• Combine sample in one vector $$\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)$$
• Let $$\boldsymbol{c}$$ be a $$n \times 1$$ vector with
• zeros at position $$1$$ to $$n_1$$ and
• ones at positions $$n_1+1$$ to $$n$$
• Then we can rewrite the location model as $Z_i = \alpha + c_i \Delta + e_i$ where $$e_1,\dots,e_n$$ are iid with pdf $$f(t)$$

## Test for Slope

• For a single regression model ($$i = 1,\dots,n$$) $Y_i = \alpha + \beta x_i + e_i$
• Intercept $$\alpha$$ and slope $$\beta$$ are unkown
• Error $$e_1,\dots,e_n$$ are sample from continous population with median 0
• Theil (1950) tested for: $$H_0: \beta = \beta_0$$
• Which means for every unit increase in the value of the independent (predictor) variable $$x$$, we expect an increase (or decrease, depending on the sign) of the dependent (response) varibale $$Y$$

## Test for Slope

• Difference $D_i = Y_i - \beta_0 x_i \hspace{1cm}\text{for}\hspace{1cm} n = 1,\dots,n$
• Let $C = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \operatorname{sign}(D_j - D_i)$

## Test for Slope

• Motivation for the Test: $D_j - D_i = Y_j - \beta_0 x_j - (Y_i - \beta_0 x_i) = Y_j - Y_i + \beta_0 (x_i - x_j)$
• under null the median of $$Y_j - Y_i = \beta(x_j - x_i) + (e_j - e_i)$$ is $$\beta (x_j - x_i)$$
• under null the median of $$D_j - D_i$$ is $$\beta (x_j - x_i) + \beta_0 (x_j - x_i) = (\beta - \beta_0)(x_j - x_i)$$
• hence, we tend to obtain positive $$D_j - D_i$$ difference when $$\beta > \beta_0$$ which leads to large $$C$$
• The statistics $$C$$ is the Kendall's correlation statistics, and can be interpreted as a test for correlation between $$x$$ and $$Y$$

## Test for Slope (Example)

• Smith (1967) described experiment in Australia on cloud seeding
• To investigate impact of cloud seeding on rainfall

## Test for Slope (Example)

• Two area on montain served as target and control
• During any period a random process was used to determine whether to clouds over the target area should be seeded
• The effect of seeding was measured by double ratio $\frac{T/Q \text{ (seeded)}}{T/Q \text{ (unseeded)}}$
• $$T$$ total rainfalls in the target areas
• $$G$$ total rainfalls in the control areas
• Slope parameter $$\beta$$ represenents the rate of change in $$Y$$ (double ratio) per unit change in $$x$$ (year)
• Test $$H_0: \beta = 0$$ versus $$H_A: \beta \ne 0$$ (seeding no impact on rainfall)

## Test for Slope (Example)

Data collected over 5 years:

##   year doubleRatio
## 1    1        1.26
## 2    2        1.27
## 3    3        1.12
## 4    4        1.16
## 5    5        1.03

Under the null $$\beta_0 = 0$$ we have $$D_i = Y_i$$:

## Test for Slope (Example)

##    i j     D signD
## 1  1 2  0.01     1
## 2  1 3 -0.14    -1
## 3  1 4 -0.10    -1
## 4  1 5 -0.23    -1
## 5  2 3 -0.15    -1
## 6  2 4 -0.11    -1
## 7  2 5 -0.24    -1
## 8  3 4  0.04     1
## 9  3 5 -0.09    -1
## 10 4 5 -0.13    -1
C = sum(sign(diff)); C
## [1] -6

## Test for Slope (Example)

ken = cor.test(year,doubleRatio,method="kendall",alternative = "two.sided")
ken\$p.value
## [1] 0.2333333

No evidence for cloud seeding impacting rainfall

## Correlation

• In a simple linear regression setting as before, we have reponse variable $$Y$$ predictor variable $$X$$, and the fit of the model is of main interest
• Often, we want to predict random variable $$Y$$ from $$x$$ and we treat $$x$$ as nonstochastic
• In correltaion analysis, we consider random pairs $$(X,Y)$$, and the strength of of a relationship or association between $$X$$ and $$Y$$ is of main interest
• No association means that $$X$$ and $$Y$$ are independent:
$H_0: X \text{ and } Y \text{ are independent versus } H_A: X \text{ and } Y \text{ are dependent}$
• We assume that $$(X,Y)$$ is a continuous random vector with cdf $$F(x,y)$$ and pdf $$f(x,y)$$
• Recall that $$X$$ and $$Y$$ are independent if their joint cdf factors into the product of marginals cdfs $F(x,y) = F_X(x) F_Y(y)$

## Correlation

• Last week, we discussed $$\chi^2$$ goodness-of-fit test for discrete random variables
• In the discrete case, independence was $$P(X=x,Y=y) = P(X=x) P(Y=y)$$ for all $$x$$ and $$y$$
• In the continuous case, we have the Pearson's measure of association and two popular nonparametric measures (Kendall and Spearman)
• Random sample $$(X_1,Y_1),\dots,(X_n,Y_n)$$

## Pearson's Correlation Coefficient

• The traditional correlation coefficient $\rho = \frac{ E\left( (X - \mu_X) (Y - \mu_Y) \right)}{\sigma_X \sigma_Y}$
• Measure of linear association
• Note that if $$X$$ and $$Y$$ are independent then $$\rho = 0$$
• And if $$\rho \ne 0$$ then are dependent

## Pearson's Correlation Coefficient

• The numenator is estimated by the sample covariance
• The denominator is estimated by the product of sample standard deviations $r = \frac{n^{-1} \sum_{i=1}^n (X_i- \bar{X} )(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^n(X_i-\bar{X})^2 \cdot \sum_{i=1}^n (Y_i-\bar{Y})^2}}$
• The estimate of the correlation coefficient is directly related to simple least squares regression
• Let $$\widehat{\sigma}_x$$ and $$\widehat{\sigma}_y$$ denote the respective sample standard deviations of $$X$$ and $$Y$$, then $$r = \frac{ \widehat{\sigma}_x }{ \widehat{\sigma}_y } \widehat{\beta}$$
• $$\widehat{\beta}$$ is least squares estimate of slope in simple regression of $$Y_i$$ on $$X_i$$
• Under the null $$\sqrt{n} r$$ is asymptotically normal

## Kendall's $$\tau_K$$

• Kendall's $$\tau_K$$ is a measure of monotonicity between X and Y
• Let two pairs of random variables $$(X_1,Y_1)$$ and $$(X_2,Y_2)$$ be independent random vectors with the same distribution as $$(X,Y)$$
• The pairs $$(X_1,Y_1)$$ and $$(X_2,Y_2)$$ are concordant or discordant if $\operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = 1 \hspace{1cm}\text{or}\hspace{1cm} \operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = -1$
• Concordant pairs are indicative of increasing monotonicity between $$X$$ and $$Y$$
• Discordant pairs indicate decreasing monotonicity $\tau_K = P\left( \text{concordant} \right) - P\left( \text{discordant} \right)$
• If $$X$$ and $$Y$$ are independent then $$\tau_K = 0$$
• If $$\tau_K = 0$$ then $$X$$ and $$Y$$ are dependent

## Kendall's $$\tau_K$$

• Using a random sample $$(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)$$
• Estimator: count the number of concordant pairs and subtract from that the number of discordant pairs
• In standardized form $\widehat{\tau}_K = {n \choose 2}^{-1} \sum_{i < j} \operatorname{sign} \{ (X_i-X_j)(Y_i-Y_j) \}$
• Tests of the hypotheses can be based on the exact finite sample distribution

## Spearman $$\rho_S$$

• Random sample $$(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)$$
• Denote by $$\operatorname{R}(X_i)$$ the rank of $$X_i$$ among $$X_1,X_2,\dots,X_n$$
• Denote by $$\operatorname{R}(Y_i)$$ as the rank of $$Y_i$$ among $$Y_1,Y_2,\dots,Y_n$$
• Estimate of $$\rho_S$$ is the sample correlation coefficient $r_S = \frac{\sum_{i=1}^n \left(\operatorname{R}(X_i)-((n+1)/2))(\operatorname{R}(Y_i)-((n+1)/2)\right)}{n(n^2-1)/12}$
• We accept $$HA:$$ $$X$$ and $$Y$$ are dependent for large values of $$|r_S|$$
• This test can be carried out using the exact distribution

## The Geometry of Linear Models

• Setup the following linear model (for $$i = 1,\dots,n$$) $Y_i = x_i^T \boldsymbol{\beta} + e_i^*$ where $$\boldsymbol{\beta}$$ is a $$1 \times p$$ vector of unkown parameters
• $$\boldsymbol{\beta}$$ are the parameter of interst
• Center (usually using the median $$T(e_i^*) = \alpha$$) the errors $$e_i = e_i^* - \alpha$$ $Y_i = \alpha + \boldsymbol{x}_i \boldsymbol{\beta} + e_i$
• Let $$f(t)$$ be the pdf of the erros $$e_i$$
• Assumption: $$f(t)$$ can be either asymmetric or symmetric depending on whether signs or ranks are used
• The intercept $$\alpha$$ is independent of the slope $$\boldsymbol{\beta}$$

## The Geometry of Linear Models

• Let $$\boldsymbol{Y} = (Y_1,\dots,Y_n)^T$$ denote the $$n \times 1$$ vector of observations
• Let $$\boldsymbol{X}$$ denote the $$n × p$$ matrix with rows $$x^T_i$$
• Then we can write the linear model in matrix form: $\boldsymbol{Y} = \boldsymbol{1}\alpha + \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{e}$
• $$\boldsymbol{X}$$ is centered (that's fine since we have $$\alpha$$ in the model), and assume $$\boldsymbol{X}$$ is full column rank
• Let $$\Omega_F$$ be the column space spanned by coluns of $$\boldsymbol{X}$$
• So we can rewrite the linear model as (coordinate-free because not restricited to any specific basis vectors) $\boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e}$ with $$\boldsymbol{\eta} = \Omega_F$$

## The Geometry of Linear Models

• Now we can estimate $$\boldsymbol{\beta}$$
• And test hypothesis $H_0: \boldsymbol{M}\boldsymbol{\beta} = 0 \hspace{2cm} H_A: \boldsymbol{M}\boldsymbol{\beta} \ne 0$
• $$\boldsymbol{M}$$ is a $$q \times p$$ matrix of full rank

## The Geometry of Estimation

$\boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e} \hspace{1cm}\text{with}\hspace{1cm} \boldsymbol{\eta} = \Omega_F$

• Task is to minimize some distance between $$\boldsymbol{Y}$$ and subspace $$\Omega_F$$
• Think of $$\boldsymbol{\eta}$$ as a hyperplane and the task as projecting $$\boldsymbol{Y}$$ onto it
• For the projection we need to define a distance
• Instead of using the usual Euclidean distance, we use a distance based on signs and ranks $\| v_i \|_{\varphi} = \sum_{i=1}^n a(R(v_i)) v_i$
• with scores $$a(1) \le a(2) \le \cdots \le a(n)$$ and score function $$a(i) = \varphi(i/(n+1))$$
• $$\varphi$$ is nondecreasing, centered, standardized and defined on the interval $$(0,1)$$

## The Geometry of Estimation

• $$\|\boldsymbol{v}\|_{\varphi}$$ is a pseudo-norm:
• triangle inequality, non-negative, $$\|\alpha \boldsymbol{v}\|_{\varphi} = |\alpha \|\boldsymbol{v}\|_{\varphi}$$, and
• additionally $$\|\boldsymbol{v}\|_{\varphi} = 0$$ if and only if $$v_1 = \dots = v_n$$
• By setting $$\varphi_R(u) = \sqrt{12}(u − 1/2)$$, we get the Wilcoxon pseudo-norm
• By setting $$\varphi_S(u) = \operatorname{sgn}(u − 1/2)$$, we get the sign pseudo-norm (equivalent to using the $$L_1$$ norm)
• In general $D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi}$

## The Geometry of Estimation

$\widehat{\boldsymbol{\eta}} = D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi}$