Two-Sample Problems (Part 2)

Stanford University, Spring 2016, STATS 205

Analyses for a Shift in Location

Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\) and pdf \(f(t)\)
Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(F(t-\Delta)\) and pdf \(f(t-\Delta)\)
Hypothesis test: \[H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0\]
Additionally, we can estimate \(\widehat{\Delta}\) and form confidence intervals
Wilcoxon test statistics: \[ T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\] among the combined samples \(X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}\)

Analyses for a Shift in Location

The estimate of \(\Delta\) is the Hodges–Lehmann estimator \[ \widehat{\Delta} = \operatorname{median}_{i,j} \{ Y_j - X_i \}\]
There are \(n_1 n_2\) differences
Order the differences \(D_{(1)} < D_{(2)} < \dots < D_{(n_1 n_2)}\)
Fix confidence level at \(1 - \alpha\)
Find \(c\) such that \[ \alpha/2 = \operatorname{P}_{H_0} ( T \le c ) \]
Then the interval \((D_{(c+1)},D_{(n-c)})\) is \((1-\alpha) 100\%\) confidence interval for \(\widehat{\Delta}\)

Analyses for a Shift in Location Example

\(t\)-distribution with 5 degrees of freedom and
a true shift parameter \(\Delta\) was set at the value \(8\)

n1 = 11
n2 = 9
delta = 8
x = round(rt(n1,5)*10+42,1)
y = round(rt(n2,5)*10+42+delta,1)
sort(x)

##  [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6

sort(y)

## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

Analyses for a Shift in Location Example

Estimate of shift parameter \(\Delta\) and confidence intervals:

wilcox.test(y,x,conf.int=TRUE)

## 
##  Wilcoxon rank sum test
## 
## data:  y and x
## W = 60, p-value = 0.4561
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -9 18
## sample estimates:
## difference in location 
##                      6

Linear Regression Model

Frame the two-sample location problem as a regression problem
Combine sample in one vector \(\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)\)
Let \(\boldsymbol{c}\) be a \(n \times 1\) vector with
- zeros at position \(1\) to \(n_1\) and
- ones at positions \(n_1+1\) to \(n\)
Then we can rewrite the location model as \[Z_i = \alpha + c_i \Delta + e_i\] where \(e_1,\dots,e_n\) are iid with pdf \(f(t)\)
We can use method of Least Squares (SL) to estimate \(\widehat{\Delta}\)
Or the Hodges–Lehmann estimator
The intercept \(\alpha\) is estimated in a second step on the residuals

Efficiency of Estimator

Suppose, two estimators \(\widehat{\Delta}_1,\widehat{\Delta}_2\) converge in distribution \[ \sqrt{n} \left( \widehat{\Delta}_i - \Delta \right) \overset{d}{\to} N(0,\sigma^2_i) \text{ for } i = 1,2\]
Asymptotic Relative Efficiency (ARE) between two estimators \(\widehat{\Delta}_1\) and \(\widehat{\Delta}_2\) is defined as: \[ \operatorname{ARE}\left(\widehat{\Delta}_1,\widehat{\Delta}_2\right) = \frac{\sigma_2^2}{\sigma_1^2}\]

Efficiency of Estimator

Contaminated normal \((0 < \epsilon < 0.5, \sigma_c > 1)\): \[F(x) = (1 − \epsilon) \Phi(x) + \epsilon \Phi(x/\sigma_c)\]

n = 10000
sigmaC = 3
epsilon = 0.25
sample = c(rnorm((1-epsilon)*n,0,1),rnorm(epsilon*n,0,sigmaC))

Efficiency of Estimator

##                         [,1]  [,2] [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## epsilon                0.000 0.010 0.02 0.030 0.050 0.100 0.150 0.250
## ARE(Hodges–Lehmann,LS) 0.955 1.009 1.06 1.108 1.196 1.373 1.497 1.616

Test for Dispertion

Same as before, we assume that there are two populations with cdf \(F\) and \(G\)
The null hypothesis is that \(X\) and \(Y\) have the same but unspecified distribution \[ H_0: F(t) = G(t) \]
Consider the alternative that \(X\) and \(Y\) have different variability with same median \[ \frac{X-\theta}{\sigma_1} \overset{d}{=} \frac{Y-\theta}{\sigma_2} \]
Then, what's left to analysis is the dispersion \[ \eta^2 = \frac{\sigma_1^2}{\sigma_2^2} = \frac{\operatorname{Var}(X)}{\operatorname{Var}(Y)}\]

Test for Dispertion

So that our null hypothesis reduces to \[ H_0: \eta^2 = 1 \]
We will use the Ansari-Bradley two-sample scale statistic \(C\)
Rank combined sample from smallest to largest
Assign score 1 to smallest and largest
Assign score 2 to second smallest and second largest
etc.
If \(n\) is even: \(1,2,3,\dots,n/2,n/2,\dots,3,2,1\)
If \(n\) is odd: \(1,2,3,\dots,(n-1)/2,(n+1)/2,(n-1)/2,\dots,3,2,1\)
then the statistic is a function of these ranks \(C = \sum_{j=1}^{n_2} R(Y_j)\)

Behrens–Fisher Problem

Suppose that we have two populations which differ by location and scale
We are interested in testing that the locations are the same
Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\)
Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(G(t)\)
Let \(θ_X\) and \(θ_Y\) denote the medians of the distributions \(F(t)\) and \(G(t)\)
Hypothesis test \[ H_0: \theta_X = \theta_Y \hspace{1cm} \text{versus} \hspace{1cm} H_A: \theta_X < \theta_Y\]
This is called the Behrens–Fisher problem and the traditional test in this situation is the two-sample \(t\)-test which uses a \(t\)-statistic with the Satterthwaite degrees of freedom correction
There is the two-sample Fligner-Policello test which serves as a robust alternative to this approximate \(t\)-test

Behrens–Fisher Problem

Assume that the cdfs \(F(t)\) and \(G(t)\) are symmetric about \(\theta_X\) and \(\theta_Y\)
Let \(P_1,\dots,P_{n1}\) denote the placements of the \(X_i\)'s in terms of the \(Y\)-sample \[P_i = \#_j \{ Y_j < X_i \}, i = 1,\dots,n_1\]
Let \(Q_1,\dots,Q_n\) denote the placements of the \(Y_j\)'s in terms of the \(X\)-sample \[Q_j = \#_i \{X_i < Y_j \}, j=1,\dots,n_2\]

Behrens–Fisher Problem

Define \[\bar{P} = \frac{1}{n_1} \sum_{i=1}^{n_1} P_i \hspace{1cm}\text{and}\hspace{1cm} \bar{Q} = \frac{1}{n_2} \sum_{j=1}^{n_2} Q_j\] \[V_1 = \sum_{i=1}^{n_1} (P_i-\bar{P})^2 \hspace{1cm}\text{and}\hspace{1cm} V_2 = \sum_{j=1}^{n_2} (Q_j-\bar{Q})^2\]
Then the standardized test statistic is \[ U = \frac{\sum_{j=1}^{n_2} Q_j - \sum_{i=1}^{n_1} P_j}{2 \left(V_1 + V_2 + \bar{P}\bar{Q}\right)^{1/2}} \]

Behrens–Fisher Problem

Monte Carlo simulation of distibution of \(U\) under the null

n1 = length(hg); n2 = length(lg); n = n1+n2; nSim = 10000
Shuffle = replicate(nSim,sample(n,n,replace = FALSE))
Xi = Shuffle[1:n1,]
Yj = Shuffle[(n1+1):n,]

U = function(Xi,Yj) {
  Pi = function(i) { sum(Yj < Xi[i]) }; Qj = function(j) { sum(Xi < Yj[j]) }
  P = sapply(1:n1,Pi); Q = sapply(1:n2,Qj)
  Phat = mean(P); Qhat = mean(Q)
  V1 = sum((P-Phat)^2); V2 = sum((Q-Qhat)^2)
  (sum(Q)-sum(P))/(2*sqrt(V1+V2+Phat*Qhat)) }

UNull = rep(0,nSim)
for(trial in 1:nSim) { UNull[trial] = U(Xi[,trial],Yj[,trial]) }

Behrens–Fisher Problem Example

Hollander and Wolfe (1999) a study of healthy and lead-poisoned geese
The study involved 7 healthy geese and 8 lead-poisoned geese
The response of interest was the amount of plasma glucose in the geese in mg/100 ml of plasma
The hypotheses of interest are: \[H_0: θ_H = θ_L \hspace{1cm}\text{vesus}\hspace{1cm} H_A: θ_H < θ_L\]
\(θ_L\) denote the true median plasma glucose values of lead-poisoned geese
\(θ_H\) denote the true median plasma glucose values of healthy geese

Behrens–Fisher Problem Example

Test statistic of our sample

ranks = rank(c(hg,lg))
XiObsv = ranks[1:n1]
YjObsv = ranks[(n1+1):n]
UObsv = U(XiObsv,YjObsv)
UObsv

## [1] 1.467599

pvalue = mean(UNull >= UObsv)
pvalue

## [1] 0.0814

Behrens–Fisher Problem Example

From Monte Carlo simulation:

General Difference in Two Populations

The most general test: Any difference between \(X\) and \(Y\) \[ H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \ne G(t) \hspace{1cm} \text{for at least one } t\]
The Kolmogorov-Smirnov test is such a test
Obtain empirical distributions \(F_{n_1}\) and \(G_{n_2}\)
\(F_{n_1} = \# \{X_i \le t \} / n_1\)
\(G_{n_2} = \# \{Y_j \le t \} / n_2\)
Assuming \(n_1 = n_2\), then we get statistic (ordered combined observations \(Z_{(i)} \le \dots \le Z_{(n)}\)) \[J = \max_{i=1,\dots,n} \{ | F_{n_1}(Z_{(i)}) - G_{n_2}(Z_{(i)}) | \}\]

Do it Yourself Test Statistics

Idea: Put a metric \(d(\pi,\sigma)\) on permutations \(\pi\) and \(\sigma\)
Possible metrics:
- Metric 1: \(K(\pi,\sigma) =\) minimum number of pairwise adjacent transposition to go from \(\pi\) to \(\sigma\)
- Metric 2: \(R(\pi,\sigma) = \sqrt{ \sum_{i=1}^n (\pi(i)-\sigma(i))^2 }\)
Example for \(K(\pi,\sigma)\):
- \(\pi = \{ 3,2,1 \} \text{ and } \sigma = \{ 1,2,3 \}\)
- Move 1: { 2,3,1 }
- Move 2: { 2,1,3 }
- Move 3: { 1,2,3 }
- So \(K(\pi,\sigma) = 3\)

Do it Yourself Test Statistics

Four steps program
Step 1:
- Let \(\pi\) be the observed ranks
- The metric is right-invariant \(d(\pi,\sigma) = d(\pi\tau,\sigma\tau)\)
- In other words, this takes care of all possible relabeling of observations into populations
Step 2:
- Permutation \(\sigma\) is "equivalent" to observed \(\pi\) if and only if it assigns the same set of ranks to population 1 and thus to population 2
- This builds and equivalent class \([\pi]\) of size \(n_1!n_2!\) which is a subset of all possible permutations

Do it Yourself Test Statistics

Step 3:
- Construct extremal set \(E\) containing all permutations which are most in agreement with \(H_A\) and least with \(H_0\)
Step 4:
- Test statistics is now defined by choosing a distance \(d([\pi],E)\) to measure how far we are from \(H_A\)
- \(K(\pi,\sigma)\) is Mann-Whitney test statistic
- \(R^2(\pi,\sigma)\) is equivalent to Wilcoxon test statistic
Reference: Critchlow (1986), A Unified Approach to Constructing Nonparametric Rank Tests (Link)