- Smoothing or estimating curves
- Density Estimation
- Non-linear regression

- Rank-based linear regression

Stanford University, Spring 2016, STATS 205

- Smoothing or estimating curves
- Density Estimation
- Non-linear regression

- Rank-based linear regression

- A curve of interest can be a probability density function \(f\) or a regression function \(r\)
- In density estimation, we observe \(X_1,\dots,X_n\) from some unknown cdf \(F\) with density \(f\) \[ X_1,\dots,X_n \sim f \] and the goal is to estimate density \(f\)
- In regression, we observe pairs \((x_1,Y_1),\dots,(x_n,Y_n)\) that are related as \[ Y_i = r(x_i) + e_i \] with \(E(e_i) = 0\), and the goal is to estimate the regression function \(r\)

- But today, we focus on linear regression
- We generalize rank-based methods from two-sample location problems to general linear models

- Recall that we framed the two-sample location problem as a regression problem
- Combine sample in one vector \(\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)\)
- Let \(\boldsymbol{c}\) be a \(n \times 1\) vector with
- zeros at position \(1\) to \(n_1\) and
- ones at positions \(n_1+1\) to \(n\)

- Then we can rewrite the location model as \[Z_i = \alpha + c_i \Delta + e_i\] where \(e_1,\dots,e_n\) are iid with pdf \(f(t)\)

- For a single regression model (\(i = 1,\dots,n\)) \[ Y_i = \alpha + \beta x_i + e_i \]
- Intercept \(\alpha\) and slope \(\beta\) are unkown
- Error \(e_1,\dots,e_n\) are sample from continous population with median 0
- Theil (1950) tested for: \(H_0: \beta = \beta_0\)
- Which means for every unit increase in the value of the independent (predictor) variable \(x\), we expect an increase (or decrease, depending on the sign) of the dependent (response) varibale \(Y\)

- Difference \[D_i = Y_i - \beta_0 x_i \hspace{1cm}\text{for}\hspace{1cm} n = 1,\dots,n\]
- Let \[C = \sum_{i=1}^{n-1} \sum_{j=i+1}^n \operatorname{sign}(D_j - D_i)\]

- Motivation for the Test: \[ D_j - D_i = Y_j - \beta_0 x_j - (Y_i - \beta_0 x_i) = Y_j - Y_i + \beta_0 (x_i - x_j)\]
- under null the median of \(Y_j - Y_i = \beta(x_j - x_i) + (e_j - e_i)\) is \(\beta (x_j - x_i)\)
- under null the median of \(D_j - D_i\) is \(\beta (x_j - x_i) + \beta_0 (x_j - x_i) = (\beta - \beta_0)(x_j - x_i)\)
- hence, we tend to obtain positive \(D_j - D_i\) difference when \(\beta > \beta_0\) which leads to large \(C\)
- The statistics \(C\) is the Kendall's correlation statistics, and can be interpreted as a test for correlation between \(x\) and \(Y\)

- Smith (1967) described experiment in Australia on cloud seeding
- To investigate impact of cloud seeding on rainfall

- Two area on montain served as target and control
- During any period a random process was used to determine whether to clouds over the target area should be seeded
- The effect of seeding was measured by double ratio \[\frac{T/Q \text{ (seeded)}}{T/Q \text{ (unseeded)}}\]
- \(T\) total rainfalls in the target areas
- \(G\) total rainfalls in the control areas
- Slope parameter \(\beta\) represenents the rate of change in \(Y\) (double ratio) per unit change in \(x\) (year)
- Test \(H_0: \beta = 0\) versus \(H_A: \beta \ne 0\) (seeding no impact on rainfall)

Data collected over 5 years:

## year doubleRatio ## 1 1 1.26 ## 2 2 1.27 ## 3 3 1.12 ## 4 4 1.16 ## 5 5 1.03

Under the null \(\beta_0 = 0\) we have \(D_i = Y_i\):

## i j D signD ## 1 1 2 0.01 1 ## 2 1 3 -0.14 -1 ## 3 1 4 -0.10 -1 ## 4 1 5 -0.23 -1 ## 5 2 3 -0.15 -1 ## 6 2 4 -0.11 -1 ## 7 2 5 -0.24 -1 ## 8 3 4 0.04 1 ## 9 3 5 -0.09 -1 ## 10 4 5 -0.13 -1

C = sum(sign(diff)); C

## [1] -6

ken = cor.test(year,doubleRatio,method="kendall",alternative = "two.sided") ken$p.value

## [1] 0.2333333

No evidence for cloud seeding impacting rainfall

- In a simple linear regression setting as before, we have reponse variable \(Y\) predictor variable \(X\), and the fit of the model is of main interest
- Often, we want to predict random variable \(Y\) from \(x\) and we treat \(x\) as nonstochastic
- In correltaion analysis, we consider random pairs \((X,Y)\), and the strength of of a relationship or association between \(X\) and \(Y\) is of main interest
- No association means that \(X\) and \(Y\) are independent:

\[ H_0: X \text{ and } Y \text{ are independent versus } H_A: X \text{ and } Y \text{ are dependent} \] - We assume that \((X,Y)\) is a continuous random vector with cdf \(F(x,y)\) and pdf \(f(x,y)\)
- Recall that \(X\) and \(Y\) are independent if their joint cdf factors into the product of marginals cdfs \[ F(x,y) = F_X(x) F_Y(y) \]

- Last week, we discussed \(\chi^2\) goodness-of-fit test for discrete random variables
- In the discrete case, independence was \(P(X=x,Y=y) = P(X=x) P(Y=y)\) for all \(x\) and \(y\)
- In the continuous case, we have the Pearson's measure of association and two popular nonparametric measures (Kendall and Spearman)
- Random sample \((X_1,Y_1),\dots,(X_n,Y_n)\)

- The traditional correlation coefficient \[ \rho = \frac{ E\left( (X - \mu_X) (Y - \mu_Y) \right)}{\sigma_X \sigma_Y} \]
- Measure of linear association
- Note that if \(X\) and \(Y\) are independent then \(\rho = 0\)
- And if \(\rho \ne 0\) then are dependent

- The numenator is estimated by the sample covariance
- The denominator is estimated by the product of sample standard deviations \[ r = \frac{n^{-1} \sum_{i=1}^n (X_i- \bar{X} )(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^n(X_i-\bar{X})^2 \cdot \sum_{i=1}^n (Y_i-\bar{Y})^2}} \]
- The estimate of the correlation coefficient is directly related to simple least squares regression
- Let \(\widehat{\sigma}_x\) and \(\widehat{\sigma}_y\) denote the respective sample standard deviations of \(X\) and \(Y\), then \(r = \frac{ \widehat{\sigma}_x }{ \widehat{\sigma}_y } \widehat{\beta}\)
- \(\widehat{\beta}\) is least squares estimate of slope in simple regression of \(Y_i\) on \(X_i\)
- Under the null \(\sqrt{n} r\) is asymptotically normal

- Kendall's \(\tau_K\) is a measure of monotonicity between X and Y
- Let two pairs of random variables \((X_1,Y_1)\) and \((X_2,Y_2)\) be independent random vectors with the same distribution as \((X,Y)\)
- The pairs \((X_1,Y_1)\) and \((X_2,Y_2)\) are
*concordant*or*discordant*if \[ \operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = 1 \hspace{1cm}\text{or}\hspace{1cm} \operatorname{sign}\{ (X_1-X_2)(Y_1-Y_2) \} = -1 \] - Concordant pairs are indicative of increasing monotonicity between \(X\) and \(Y\)
- Discordant pairs indicate decreasing monotonicity \[ \tau_K = P\left( \text{concordant} \right) - P\left( \text{discordant} \right)\]
- If \(X\) and \(Y\) are independent then \(\tau_K = 0\)
- If \(\tau_K = 0\) then \(X\) and \(Y\) are dependent

- Using a random sample \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\)
- Estimator: count the number of concordant pairs and subtract from that the number of discordant pairs
- In standardized form \[ \widehat{\tau}_K = {n \choose 2}^{-1} \sum_{i < j} \operatorname{sign} \{ (X_i-X_j)(Y_i-Y_j) \} \]
- Tests of the hypotheses can be based on the exact finite sample distribution

- Random sample \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\)
- Denote by \(\operatorname{R}(X_i)\) the rank of \(X_i\) among \(X_1,X_2,\dots,X_n\)
- Denote by \(\operatorname{R}(Y_i)\) as the rank of \(Y_i\) among \(Y_1,Y_2,\dots,Y_n\)
- Estimate of \(\rho_S\) is the sample correlation coefficient \[ r_S = \frac{\sum_{i=1}^n \left(\operatorname{R}(X_i)-((n+1)/2))(\operatorname{R}(Y_i)-((n+1)/2)\right)}{n(n^2-1)/12} \]
- We accept \(HA:\) \(X\) and \(Y\) are dependent for large values of \(|r_S|\)
- This test can be carried out using the exact distribution

- Setup the following linear model (for \(i = 1,\dots,n\)) \[ Y_i = x_i^T \boldsymbol{\beta} + e_i^* \] where \(\boldsymbol{\beta}\) is a \(1 \times p\) vector of unkown parameters
- \(\boldsymbol{\beta}\) are the parameter of interst
- Center (usually using the median \(T(e_i^*) = \alpha\)) the errors \(e_i = e_i^* - \alpha\) \[ Y_i = \alpha + \boldsymbol{x}_i \boldsymbol{\beta} + e_i\]
- Let \(f(t)\) be the pdf of the erros \(e_i\)
- Assumption: \(f(t)\) can be either asymmetric or symmetric depending on whether signs or ranks are used
- The intercept \(\alpha\) is independent of the slope \(\boldsymbol{\beta}\)

- Let \(\boldsymbol{Y} = (Y_1,\dots,Y_n)^T\) denote the \(n \times 1\) vector of observations
- Let \(\boldsymbol{X}\) denote the \(n × p\) matrix with rows \(x^T_i\)
- Then we can write the linear model in matrix form: \[ \boldsymbol{Y} = \boldsymbol{1}\alpha + \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{e} \]
- \(\boldsymbol{X}\) is centered (that's fine since we have \(\alpha\) in the model), and assume \(\boldsymbol{X}\) is full column rank
- Let \(\Omega_F\) be the column space spanned by coluns of \(\boldsymbol{X}\)
- So we can rewrite the linear model as (coordinate-free because not restricited to any specific basis vectors) \[ \boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e} \] with \(\boldsymbol{\eta} = \Omega_F\)

- Now we can estimate \(\boldsymbol{\beta}\)
- And test hypothesis \[ H_0: \boldsymbol{M}\boldsymbol{\beta} = 0 \hspace{2cm} H_A: \boldsymbol{M}\boldsymbol{\beta} \ne 0 \]
- \(\boldsymbol{M}\) is a \(q \times p\) matrix of full rank

\[ \boldsymbol{Y} = \boldsymbol{1}\boldsymbol{\beta} + \boldsymbol{\eta} + \boldsymbol{e} \hspace{1cm}\text{with}\hspace{1cm} \boldsymbol{\eta} = \Omega_F\]

- Task is to minimize some distance between \(\boldsymbol{Y}\) and subspace \(\Omega_F\)
- Think of \(\boldsymbol{\eta}\) as a hyperplane and the task as projecting \(\boldsymbol{Y}\) onto it
- For the projection we need to define a distance
- Instead of using the usual Euclidean distance, we use a distance based on signs and ranks \[ \| v_i \|_{\varphi} = \sum_{i=1}^n a(R(v_i)) v_i \]
- with scores \(a(1) \le a(2) \le \cdots \le a(n)\) and score function \(a(i) = \varphi(i/(n+1))\)
- \(\varphi\) is nondecreasing, centered, standardized and defined on the interval \((0,1)\)

- \(\|\boldsymbol{v}\|_{\varphi}\) is a pseudo-norm:
- triangle inequality, non-negative, \(\|\alpha \boldsymbol{v}\|_{\varphi} = |\alpha \|\boldsymbol{v}\|_{\varphi}\), and
- additionally \(\|\boldsymbol{v}\|_{\varphi} = 0\) if and only if \(v_1 = \dots = v_n\)

- By setting \(\varphi_R(u) = \sqrt{12}(u − 1/2)\), we get the Wilcoxon pseudo-norm
- By setting \(\varphi_S(u) = \operatorname{sgn}(u − 1/2)\), we get the sign pseudo-norm (equivalent to using the \(L_1\) norm)
- In general \[ D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi} \]

\[\widehat{\boldsymbol{\eta}} = D(\boldsymbol{Y},\Omega_F) = \| \boldsymbol{Y} - \widehat{\boldsymbol{Y}}_{\varphi} \|_{\varphi} = \underset{\boldsymbol{\eta \in \Omega_F}}{\min} \| \boldsymbol{Y}-\boldsymbol{\eta} \|_{\varphi}\]