Stanford University, Spring 2016, STATS 205

Analyses for a Shift in Location

Analyses for a Shift in Location

  • Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\) and pdf \(f(t)\)
  • Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(F(t-\Delta)\) and pdf \(f(t-\Delta)\)
  • Hypothesis test: \[H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0\]
  • Additionally, we can estimate \(\widehat{\Delta}\) and form confidence intervals
  • Wilcoxon test statistics: \[ T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\] among the combined samples \(X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}\)

Analyses for a Shift in Location

  • The estimate of \(\Delta\) is the Hodges–Lehmann estimator \[ \widehat{\Delta} = \operatorname{median}_{i,j} \{ Y_j - X_i \}\]
  • There are \(n_1 n_2\) differences
  • Order the differences \(D_{(1)} < D_{(2)} < \dots < D_{(n_1 n_2)}\)
  • Fix confidence level at \(1 - \alpha\)
  • Find \(c\) such that \[ \alpha/2 = \operatorname{P}_{H_0} ( T \le c ) \]
  • Then the interval \((D_{(c+1)},D_{(n-c)})\) is \((1-\alpha) 100\%\) confidence interval for \(\widehat{\Delta}\)

Analyses for a Shift in Location Example

\(t\)-distribution with 5 degrees of freedom and
a true shift parameter \(\Delta\) was set at the value \(8\)

n1 = 11
n2 = 9
delta = 8
x = round(rt(n1,5)*10+42,1)
y = round(rt(n2,5)*10+42+delta,1)
sort(x)
##  [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6
sort(y)
## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

Analyses for a Shift in Location Example

Estimate of shift parameter \(\Delta\) and confidence intervals:

wilcox.test(y,x,conf.int=TRUE)
## 
##  Wilcoxon rank sum test
## 
## data:  y and x
## W = 60, p-value = 0.4561
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -9 18
## sample estimates:
## difference in location 
##                      6

Linear Regression Model

  • Frame the two-sample location problem as a regression problem
  • Combine sample in one vector \(\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)\)
  • Let \(\boldsymbol{c}\) be a \(n \times 1\) vector with
    • zeros at position \(1\) to \(n_1\) and
    • ones at positions \(n_1+1\) to \(n\)
  • Then we can rewrite the location model as \[Z_i = \alpha + c_i \Delta + e_i\] where \(e_1,\dots,e_n\) are iid with pdf \(f(t)\)
  • We can use method of Least Squares (SL) to estimate \(\widehat{\Delta}\)
  • Or the Hodges–Lehmann estimator
  • The intercept \(\alpha\) is estimated in a second step on the residuals

Efficiency of Estimator

  • Suppose, two estimators \(\widehat{\Delta}_1,\widehat{\Delta}_2\) converge in distribution \[ \sqrt{n} \left( \widehat{\Delta}_i - \Delta \right) \overset{d}{\to} N(0,\sigma^2_i) \text{ for } i = 1,2\]
  • Asymptotic Relative Efficiency (ARE) between two estimators \(\widehat{\Delta}_1\) and \(\widehat{\Delta}_2\) is defined as: \[ \operatorname{ARE}\left(\widehat{\Delta}_1,\widehat{\Delta}_2\right) = \frac{\sigma_2^2}{\sigma_1^2}\]

Efficiency of Estimator

Contaminated normal \((0 < \epsilon < 0.5, \sigma_c > 1)\): \[F(x) = (1 − \epsilon) \Phi(x) + \epsilon \Phi(x/\sigma_c)\]

n = 10000
sigmaC = 3
epsilon = 0.25
sample = c(rnorm((1-epsilon)*n,0,1),rnorm(epsilon*n,0,sigmaC))

Efficiency of Estimator

##                         [,1]  [,2] [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## epsilon                0.000 0.010 0.02 0.030 0.050 0.100 0.150 0.250
## ARE(Hodges–Lehmann,LS) 0.955 1.009 1.06 1.108 1.196 1.373 1.497 1.616

Test for Dispertion

Test for Dispertion

  • Same as before, we assume that there are two populations with cdf \(F\) and \(G\)
  • The null hypothesis is that \(X\) and \(Y\) have the same but unspecified distribution \[ H_0: F(t) = G(t) \]
  • Consider the alternative that \(X\) and \(Y\) have different variability with same median \[ \frac{X-\theta}{\sigma_1} \overset{d}{=} \frac{Y-\theta}{\sigma_2} \]
  • Then, what's left to analysis is the dispersion \[ \eta^2 = \frac{\sigma_1^2}{\sigma_2^2} = \frac{\operatorname{Var}(X)}{\operatorname{Var}(Y)}\]

Test for Dispertion

  • So that our null hypothesis reduces to \[ H_0: \eta^2 = 1 \]
  • We will use the Ansari-Bradley two-sample scale statistic \(C\)
  • Rank combined sample from smallest to largest
  • Assign score 1 to smallest and largest
  • Assign score 2 to second smallest and second largest
  • etc.
  • If \(n\) is even: \(1,2,3,\dots,n/2,n/2,\dots,3,2,1\)
  • If \(n\) is odd: \(1,2,3,\dots,(n-1)/2,(n+1)/2,(n-1)/2,\dots,3,2,1\)
  • then the statistic is a function of these ranks \(C = \sum_{j=1}^{n_2} R(Y_j)\)

Behrens–Fisher Problem

Behrens–Fisher Problem

  • Suppose that we have two populations which differ by location and scale
  • We are interested in testing that the locations are the same
  • Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\)
  • Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(G(t)\)
  • Let \(θ_X\) and \(θ_Y\) denote the medians of the distributions \(F(t)\) and \(G(t)\)
  • Hypothesis test \[ H_0: \theta_X = \theta_Y \hspace{1cm} \text{versus} \hspace{1cm} H_A: \theta_X < \theta_Y\]
  • This is called the Behrens–Fisher problem and the traditional test in this situation is the two-sample \(t\)-test which uses a \(t\)-statistic with the Satterthwaite degrees of freedom correction
  • There is the two-sample Fligner-Policello test which serves as a robust alternative to this approximate \(t\)-test

Behrens–Fisher Problem

  • Assume that the cdfs \(F(t)\) and \(G(t)\) are symmetric about \(\theta_X\) and \(\theta_Y\)
  • Let \(P_1,\dots,P_{n1}\) denote the placements of the \(X_i\)'s in terms of the \(Y\)-sample \[P_i = \#_j \{ Y_j < X_i \}, i = 1,\dots,n_1\]
  • Let \(Q_1,\dots,Q_n\) denote the placements of the \(Y_j\)'s in terms of the \(X\)-sample \[Q_j = \#_i \{X_i < Y_j \}, j=1,\dots,n_2\]

Behrens–Fisher Problem

  • Define \[\bar{P} = \frac{1}{n_1} \sum_{i=1}^{n_1} P_i \hspace{1cm}\text{and}\hspace{1cm} \bar{Q} = \frac{1}{n_2} \sum_{j=1}^{n_2} Q_j\] \[V_1 = \sum_{i=1}^{n_1} (P_i-\bar{P})^2 \hspace{1cm}\text{and}\hspace{1cm} V_2 = \sum_{j=1}^{n_2} (Q_j-\bar{Q})^2\]
  • Then the standardized test statistic is \[ U = \frac{\sum_{j=1}^{n_2} Q_j - \sum_{i=1}^{n_1} P_j}{2 \left(V_1 + V_2 + \bar{P}\bar{Q}\right)^{1/2}} \]

Behrens–Fisher Problem

Monte Carlo simulation of distibution of \(U\) under the null

n1 = length(hg); n2 = length(lg); n = n1+n2; nSim = 10000
Shuffle = replicate(nSim,sample(n,n,replace = FALSE))
Xi = Shuffle[1:n1,]
Yj = Shuffle[(n1+1):n,]

U = function(Xi,Yj) {
  Pi = function(i) { sum(Yj < Xi[i]) }; Qj = function(j) { sum(Xi < Yj[j]) }
  P = sapply(1:n1,Pi); Q = sapply(1:n2,Qj)
  Phat = mean(P); Qhat = mean(Q)
  V1 = sum((P-Phat)^2); V2 = sum((Q-Qhat)^2)
  (sum(Q)-sum(P))/(2*sqrt(V1+V2+Phat*Qhat)) }

UNull = rep(0,nSim)
for(trial in 1:nSim) { UNull[trial] = U(Xi[,trial],Yj[,trial]) }

Behrens–Fisher Problem Example

  • Hollander and Wolfe (1999) a study of healthy and lead-poisoned geese
  • The study involved 7 healthy geese and 8 lead-poisoned geese
  • The response of interest was the amount of plasma glucose in the geese in mg/100 ml of plasma
  • The hypotheses of interest are: \[H_0: θ_H = θ_L \hspace{1cm}\text{vesus}\hspace{1cm} H_A: θ_H < θ_L\]
  • \(θ_L\) denote the true median plasma glucose values of lead-poisoned geese
  • \(θ_H\) denote the true median plasma glucose values of healthy geese

Behrens–Fisher Problem Example

Test statistic of our sample

ranks = rank(c(hg,lg))
XiObsv = ranks[1:n1]
YjObsv = ranks[(n1+1):n]
UObsv = U(XiObsv,YjObsv)
UObsv
## [1] 1.467599
pvalue = mean(UNull >= UObsv)
pvalue
## [1] 0.0814

Behrens–Fisher Problem Example

From Monte Carlo simulation:

General Difference in Two Populations

  • The most general test: Any difference between \(X\) and \(Y\) \[ H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \ne G(t) \hspace{1cm} \text{for at least one } t\]
  • The Kolmogorov-Smirnov test is such a test
  • Obtain empirical distributions \(F_{n_1}\) and \(G_{n_2}\)
  • \(F_{n_1} = \# \{X_i \le t \} / n_1\)
  • \(G_{n_2} = \# \{Y_j \le t \} / n_2\)
  • Assuming \(n_1 = n_2\), then we get statistic (ordered combined observations \(Z_{(i)} \le \dots \le Z_{(n)}\)) \[J = \max_{i=1,\dots,n} \{ | F_{n_1}(Z_{(i)}) - G_{n_2}(Z_{(i)}) | \}\]

Do it Yourself Test Statistics

  • Idea: Put a metric \(d(\pi,\sigma)\) on permutations \(\pi\) and \(\sigma\)
  • Possible metrics:
    • Metric 1: \(K(\pi,\sigma) =\) minimum number of pairwise adjacent transposition to go from \(\pi\) to \(\sigma\)
    • Metric 2: \(R(\pi,\sigma) = \sqrt{ \sum_{i=1}^n (\pi(i)-\sigma(i))^2 }\)
  • Example for \(K(\pi,\sigma)\):
    • \(\pi = \{ 3,2,1 \} \text{ and } \sigma = \{ 1,2,3 \}\)
    • Move 1: { 2,3,1 }
    • Move 2: { 2,1,3 }
    • Move 3: { 1,2,3 }
    • So \(K(\pi,\sigma) = 3\)

Do it Yourself Test Statistics

  • Four steps program
  • Step 1:
    • Let \(\pi\) be the observed ranks
    • The metric is right-invariant \(d(\pi,\sigma) = d(\pi\tau,\sigma\tau)\)
    • In other words, this takes care of all possible relabeling of observations into populations
  • Step 2:
    • Permutation \(\sigma\) is "equivalent" to observed \(\pi\) if and only if it assigns the same set of ranks to population 1 and thus to population 2
    • This builds and equivalent class \([\pi]\) of size \(n_1!n_2!\) which is a subset of all possible permutations

Do it Yourself Test Statistics

  • Step 3:
    • Construct extremal set \(E\) containing all permutations which are most in agreement with \(H_A\) and least with \(H_0\)
  • Step 4:
    • Test statistics is now defined by choosing a distance \(d([\pi],E)\) to measure how far we are from \(H_A\)
    • \(K(\pi,\sigma)\) is Mann-Whitney test statistic
    • \(R^2(\pi,\sigma)\) is equivalent to Wilcoxon test statistic
  • Reference: Critchlow (1986), A Unified Approach to Constructing Nonparametric Rank Tests (Link)