Stanford University, Spring 2016, STATS 205

## Analyses for a Shift in Location

• Random sample $$X_1,\dots,X_{n_1}$$ with cdf $$F(t)$$ and pdf $$f(t)$$
• Random sample $$Y_1,\dots,Y_{n_2}$$ with cdf $$F(t-\Delta)$$ and pdf $$f(t-\Delta)$$
• Hypothesis test: $H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0$
• Additionally, we can estimate $$\widehat{\Delta}$$ and form confidence intervals
• Wilcoxon test statistics: $T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)$ among the combined samples $$X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}$$

## Analyses for a Shift in Location

• The estimate of $$\Delta$$ is the Hodgesâ€“Lehmann estimator $\widehat{\Delta} = \operatorname{median}_{i,j} \{ Y_j - X_i \}$
• There are $$n_1 n_2$$ differences
• Order the differences $$D_{(1)} < D_{(2)} < \dots < D_{(n_1 n_2)}$$
• Fix confidence level at $$1 - \alpha$$
• Find $$c$$ such that $\alpha/2 = \operatorname{P}_{H_0} ( T \le c )$
• Then the interval $$(D_{(c+1)},D_{(n-c)})$$ is $$(1-\alpha) 100\%$$ confidence interval for $$\widehat{\Delta}$$

## Analyses for a Shift in Location Example

$$t$$-distribution with 5 degrees of freedom and
a true shift parameter $$\Delta$$ was set at the value $$8$$

n1 = 11
n2 = 9
delta = 8
x = round(rt(n1,5)*10+42,1)
y = round(rt(n2,5)*10+42+delta,1)
sort(x)
##  [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6
sort(y)
## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

## Analyses for a Shift in Location Example

Estimate of shift parameter $$\Delta$$ and confidence intervals:

wilcox.test(y,x,conf.int=TRUE)
##
##  Wilcoxon rank sum test
##
## data:  y and x
## W = 60, p-value = 0.4561
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -9 18
## sample estimates:
## difference in location
##                      6

## Linear Regression Model

• Frame the two-sample location problem as a regression problem
• Combine sample in one vector $$\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)$$
• Let $$\boldsymbol{c}$$ be a $$n \times 1$$ vector with
• zeros at position $$1$$ to $$n_1$$ and
• ones at positions $$n_1+1$$ to $$n$$
• Then we can rewrite the location model as $Z_i = \alpha + c_i \Delta + e_i$ where $$e_1,\dots,e_n$$ are iid with pdf $$f(t)$$
• We can use method of Least Squares (SL) to estimate $$\widehat{\Delta}$$
• Or the Hodgesâ€“Lehmann estimator
• The intercept $$\alpha$$ is estimated in a second step on the residuals

## Efficiency of Estimator

• Suppose, two estimators $$\widehat{\Delta}_1,\widehat{\Delta}_2$$ converge in distribution $\sqrt{n} \left( \widehat{\Delta}_i - \Delta \right) \overset{d}{\to} N(0,\sigma^2_i) \text{ for } i = 1,2$
• Asymptotic Relative Efficiency (ARE) between two estimators $$\widehat{\Delta}_1$$ and $$\widehat{\Delta}_2$$ is defined as: $\operatorname{ARE}\left(\widehat{\Delta}_1,\widehat{\Delta}_2\right) = \frac{\sigma_2^2}{\sigma_1^2}$

## Efficiency of Estimator

Contaminated normal $$(0 < \epsilon < 0.5, \sigma_c > 1)$$: $F(x) = (1 âˆ’ \epsilon) \Phi(x) + \epsilon \Phi(x/\sigma_c)$

n = 10000
sigmaC = 3
epsilon = 0.25
sample = c(rnorm((1-epsilon)*n,0,1),rnorm(epsilon*n,0,sigmaC))

## Efficiency of Estimator

##                         [,1]  [,2] [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## epsilon                0.000 0.010 0.02 0.030 0.050 0.100 0.150 0.250
## ARE(Hodgesâ€“Lehmann,LS) 0.955 1.009 1.06 1.108 1.196 1.373 1.497 1.616

## Test for Dispertion

• Same as before, we assume that there are two populations with cdf $$F$$ and $$G$$
• The null hypothesis is that $$X$$ and $$Y$$ have the same but unspecified distribution $H_0: F(t) = G(t)$
• Consider the alternative that $$X$$ and $$Y$$ have different variability with same median $\frac{X-\theta}{\sigma_1} \overset{d}{=} \frac{Y-\theta}{\sigma_2}$
• Then, what's left to analysis is the dispersion $\eta^2 = \frac{\sigma_1^2}{\sigma_2^2} = \frac{\operatorname{Var}(X)}{\operatorname{Var}(Y)}$

## Test for Dispertion

• So that our null hypothesis reduces to $H_0: \eta^2 = 1$
• We will use the Ansari-Bradley two-sample scale statistic $$C$$
• Rank combined sample from smallest to largest
• Assign score 1 to smallest and largest
• Assign score 2 to second smallest and second largest
• etc.
• If $$n$$ is even: $$1,2,3,\dots,n/2,n/2,\dots,3,2,1$$
• If $$n$$ is odd: $$1,2,3,\dots,(n-1)/2,(n+1)/2,(n-1)/2,\dots,3,2,1$$
• then the statistic is a function of these ranks $$C = \sum_{j=1}^{n_2} R(Y_j)$$

## Behrensâ€“Fisher Problem

• Suppose that we have two populations which differ by location and scale
• We are interested in testing that the locations are the same
• Random sample $$X_1,\dots,X_{n_1}$$ with cdf $$F(t)$$
• Random sample $$Y_1,\dots,Y_{n_2}$$ with cdf $$G(t)$$
• Let $$Î¸_X$$ and $$Î¸_Y$$ denote the medians of the distributions $$F(t)$$ and $$G(t)$$
• Hypothesis test $H_0: \theta_X = \theta_Y \hspace{1cm} \text{versus} \hspace{1cm} H_A: \theta_X < \theta_Y$
• This is called the Behrensâ€“Fisher problem and the traditional test in this situation is the two-sample $$t$$-test which uses a $$t$$-statistic with the Satterthwaite degrees of freedom correction
• There is the two-sample Fligner-Policello test which serves as a robust alternative to this approximate $$t$$-test

## Behrensâ€“Fisher Problem

• Assume that the cdfs $$F(t)$$ and $$G(t)$$ are symmetric about $$\theta_X$$ and $$\theta_Y$$
• Let $$P_1,\dots,P_{n1}$$ denote the placements of the $$X_i$$'s in terms of the $$Y$$-sample $P_i = \#_j \{ Y_j < X_i \}, i = 1,\dots,n_1$
• Let $$Q_1,\dots,Q_n$$ denote the placements of the $$Y_j$$'s in terms of the $$X$$-sample $Q_j = \#_i \{X_i < Y_j \}, j=1,\dots,n_2$

## Behrensâ€“Fisher Problem

• Define $\bar{P} = \frac{1}{n_1} \sum_{i=1}^{n_1} P_i \hspace{1cm}\text{and}\hspace{1cm} \bar{Q} = \frac{1}{n_2} \sum_{j=1}^{n_2} Q_j$ $V_1 = \sum_{i=1}^{n_1} (P_i-\bar{P})^2 \hspace{1cm}\text{and}\hspace{1cm} V_2 = \sum_{j=1}^{n_2} (Q_j-\bar{Q})^2$
• Then the standardized test statistic is $U = \frac{\sum_{j=1}^{n_2} Q_j - \sum_{i=1}^{n_1} P_j}{2 \left(V_1 + V_2 + \bar{P}\bar{Q}\right)^{1/2}}$

## Behrensâ€“Fisher Problem

Monte Carlo simulation of distibution of $$U$$ under the null

n1 = length(hg); n2 = length(lg); n = n1+n2; nSim = 10000
Shuffle = replicate(nSim,sample(n,n,replace = FALSE))
Xi = Shuffle[1:n1,]
Yj = Shuffle[(n1+1):n,]

U = function(Xi,Yj) {
Pi = function(i) { sum(Yj < Xi[i]) }; Qj = function(j) { sum(Xi < Yj[j]) }
P = sapply(1:n1,Pi); Q = sapply(1:n2,Qj)
Phat = mean(P); Qhat = mean(Q)
V1 = sum((P-Phat)^2); V2 = sum((Q-Qhat)^2)
(sum(Q)-sum(P))/(2*sqrt(V1+V2+Phat*Qhat)) }

UNull = rep(0,nSim)
for(trial in 1:nSim) { UNull[trial] = U(Xi[,trial],Yj[,trial]) }

## Behrensâ€“Fisher Problem Example

• Hollander and Wolfe (1999) a study of healthy and lead-poisoned geese
• The study involved 7 healthy geese and 8 lead-poisoned geese
• The response of interest was the amount of plasma glucose in the geese in mg/100 ml of plasma
• The hypotheses of interest are: $H_0: Î¸_H = Î¸_L \hspace{1cm}\text{vesus}\hspace{1cm} H_A: Î¸_H < Î¸_L$
• $$Î¸_L$$ denote the true median plasma glucose values of lead-poisoned geese
• $$Î¸_H$$ denote the true median plasma glucose values of healthy geese

## Behrensâ€“Fisher Problem Example

Test statistic of our sample

ranks = rank(c(hg,lg))
XiObsv = ranks[1:n1]
YjObsv = ranks[(n1+1):n]
UObsv = U(XiObsv,YjObsv)
UObsv
## [1] 1.467599
pvalue = mean(UNull >= UObsv)
pvalue
## [1] 0.0814

## Behrensâ€“Fisher Problem Example

From Monte Carlo simulation:

## General Difference in Two Populations

• The most general test: Any difference between $$X$$ and $$Y$$ $H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \ne G(t) \hspace{1cm} \text{for at least one } t$
• The Kolmogorov-Smirnov test is such a test
• Obtain empirical distributions $$F_{n_1}$$ and $$G_{n_2}$$
• $$F_{n_1} = \# \{X_i \le t \} / n_1$$
• $$G_{n_2} = \# \{Y_j \le t \} / n_2$$
• Assuming $$n_1 = n_2$$, then we get statistic (ordered combined observations $$Z_{(i)} \le \dots \le Z_{(n)}$$) $J = \max_{i=1,\dots,n} \{ | F_{n_1}(Z_{(i)}) - G_{n_2}(Z_{(i)}) | \}$

## Do it Yourself Test Statistics

• Idea: Put a metric $$d(\pi,\sigma)$$ on permutations $$\pi$$ and $$\sigma$$
• Possible metrics:
• Metric 1: $$K(\pi,\sigma) =$$ minimum number of pairwise adjacent transposition to go from $$\pi$$ to $$\sigma$$
• Metric 2: $$R(\pi,\sigma) = \sqrt{ \sum_{i=1}^n (\pi(i)-\sigma(i))^2 }$$
• Example for $$K(\pi,\sigma)$$:
• $$\pi = \{ 3,2,1 \} \text{ and } \sigma = \{ 1,2,3 \}$$
• Move 1: { 2,3,1 }
• Move 2: { 2,1,3 }
• Move 3: { 1,2,3 }
• So $$K(\pi,\sigma) = 3$$

## Do it Yourself Test Statistics

• Four steps program
• Step 1:
• Let $$\pi$$ be the observed ranks
• The metric is right-invariant $$d(\pi,\sigma) = d(\pi\tau,\sigma\tau)$$
• In other words, this takes care of all possible relabeling of observations into populations
• Step 2:
• Permutation $$\sigma$$ is "equivalent" to observed $$\pi$$ if and only if it assigns the same set of ranks to population 1 and thus to population 2
• This builds and equivalent class $$[\pi]$$ of size $$n_1!n_2!$$ which is a subset of all possible permutations

## Do it Yourself Test Statistics

• Step 3:
• Construct extremal set $$E$$ containing all permutations which are most in agreement with $$H_A$$ and least with $$H_0$$
• Step 4:
• Test statistics is now defined by choosing a distance $$d([\pi],E)$$ to measure how far we are from $$H_A$$
• $$K(\pi,\sigma)$$ is Mann-Whitney test statistic
• $$R^2(\pi,\sigma)$$ is equivalent to Wilcoxon test statistic
• Reference: Critchlow (1986), A Unified Approach to Constructing Nonparametric Rank Tests (Link)