Stanford University, Spring 2016, STATS 205

- Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\) and pdf \(f(t)\)
- Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(F(t-\Delta)\) and pdf \(f(t-\Delta)\)
- Hypothesis test: \[H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0\]
- Additionally, we can estimate \(\widehat{\Delta}\) and form confidence intervals
- Wilcoxon test statistics: \[ T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\] among the combined samples \(X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}\)

- The estimate of \(\Delta\) is the Hodgesâ€“Lehmann estimator \[ \widehat{\Delta} = \operatorname{median}_{i,j} \{ Y_j - X_i \}\]
- There are \(n_1 n_2\) differences
- Order the differences \(D_{(1)} < D_{(2)} < \dots < D_{(n_1 n_2)}\)
- Fix confidence level at \(1 - \alpha\)
- Find \(c\) such that \[ \alpha/2 = \operatorname{P}_{H_0} ( T \le c ) \]
- Then the interval \((D_{(c+1)},D_{(n-c)})\) is \((1-\alpha) 100\%\) confidence interval for \(\widehat{\Delta}\)

\(t\)-distribution with 5 degrees of freedom and

a true shift parameter \(\Delta\) was set at the value \(8\)

n1 = 11 n2 = 9 delta = 8 x = round(rt(n1,5)*10+42,1) y = round(rt(n2,5)*10+42+delta,1) sort(x)

## [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6

sort(y)

## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

Estimate of shift parameter \(\Delta\) and confidence intervals:

wilcox.test(y,x,conf.int=TRUE)

## ## Wilcoxon rank sum test ## ## data: y and x ## W = 60, p-value = 0.4561 ## alternative hypothesis: true location shift is not equal to 0 ## 95 percent confidence interval: ## -9 18 ## sample estimates: ## difference in location ## 6

- Frame the two-sample location problem as a regression problem
- Combine sample in one vector \(\boldsymbol{Z} = (X_1,\dots,X_{n_1},Y_1,\dots,Y_{n_2}^T)\)
- Let \(\boldsymbol{c}\) be a \(n \times 1\) vector with
- zeros at position \(1\) to \(n_1\) and
- ones at positions \(n_1+1\) to \(n\)

- Then we can rewrite the location model as \[Z_i = \alpha + c_i \Delta + e_i\] where \(e_1,\dots,e_n\) are iid with pdf \(f(t)\)
- We can use method of Least Squares (SL) to estimate \(\widehat{\Delta}\)
- Or the Hodgesâ€“Lehmann estimator
- The intercept \(\alpha\) is estimated in a second step on the residuals

- Suppose, two estimators \(\widehat{\Delta}_1,\widehat{\Delta}_2\) converge in distribution \[ \sqrt{n} \left( \widehat{\Delta}_i - \Delta \right) \overset{d}{\to} N(0,\sigma^2_i) \text{ for } i = 1,2\]
- Asymptotic Relative Efficiency (ARE) between two estimators \(\widehat{\Delta}_1\) and \(\widehat{\Delta}_2\) is defined as: \[ \operatorname{ARE}\left(\widehat{\Delta}_1,\widehat{\Delta}_2\right) = \frac{\sigma_2^2}{\sigma_1^2}\]

Contaminated normal \((0 < \epsilon < 0.5, \sigma_c > 1)\): \[F(x) = (1 âˆ’ \epsilon) \Phi(x) + \epsilon \Phi(x/\sigma_c)\]

n = 10000 sigmaC = 3 epsilon = 0.25 sample = c(rnorm((1-epsilon)*n,0,1),rnorm(epsilon*n,0,sigmaC))

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## epsilon 0.000 0.010 0.02 0.030 0.050 0.100 0.150 0.250 ## ARE(Hodgesâ€“Lehmann,LS) 0.955 1.009 1.06 1.108 1.196 1.373 1.497 1.616

- Same as before, we assume that there are two populations with cdf \(F\) and \(G\)
- The null hypothesis is that \(X\) and \(Y\) have the same but unspecified distribution \[ H_0: F(t) = G(t) \]
- Consider the alternative that \(X\) and \(Y\) have different variability with same median \[ \frac{X-\theta}{\sigma_1} \overset{d}{=} \frac{Y-\theta}{\sigma_2} \]
- Then, what's left to analysis is the dispersion \[ \eta^2 = \frac{\sigma_1^2}{\sigma_2^2} = \frac{\operatorname{Var}(X)}{\operatorname{Var}(Y)}\]

- So that our null hypothesis reduces to \[ H_0: \eta^2 = 1 \]
- We will use the Ansari-Bradley two-sample scale statistic \(C\)
- Rank combined sample from smallest to largest
- Assign score 1 to smallest and largest
- Assign score 2 to second smallest and second largest
- etc.
- If \(n\) is even: \(1,2,3,\dots,n/2,n/2,\dots,3,2,1\)
- If \(n\) is odd: \(1,2,3,\dots,(n-1)/2,(n+1)/2,(n-1)/2,\dots,3,2,1\)
- then the statistic is a function of these ranks \(C = \sum_{j=1}^{n_2} R(Y_j)\)

- Suppose that we have two populations which differ by location and scale
- We are interested in testing that the locations are the same
- Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\)
- Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(G(t)\)
- Let \(Î¸_X\) and \(Î¸_Y\) denote the medians of the distributions \(F(t)\) and \(G(t)\)
- Hypothesis test \[ H_0: \theta_X = \theta_Y \hspace{1cm} \text{versus} \hspace{1cm} H_A: \theta_X < \theta_Y\]
- This is called the Behrensâ€“Fisher problem and the traditional test in this situation is the two-sample \(t\)-test which uses a \(t\)-statistic with the Satterthwaite degrees of freedom correction
- There is the two-sample Fligner-Policello test which serves as a robust alternative to this approximate \(t\)-test

- Assume that the cdfs \(F(t)\) and \(G(t)\) are symmetric about \(\theta_X\) and \(\theta_Y\)
- Let \(P_1,\dots,P_{n1}\) denote the placements of the \(X_i\)'s in terms of the \(Y\)-sample \[P_i = \#_j \{ Y_j < X_i \}, i = 1,\dots,n_1\]
- Let \(Q_1,\dots,Q_n\) denote the placements of the \(Y_j\)'s in terms of the \(X\)-sample \[Q_j = \#_i \{X_i < Y_j \}, j=1,\dots,n_2\]

- Define \[\bar{P} = \frac{1}{n_1} \sum_{i=1}^{n_1} P_i \hspace{1cm}\text{and}\hspace{1cm} \bar{Q} = \frac{1}{n_2} \sum_{j=1}^{n_2} Q_j\] \[V_1 = \sum_{i=1}^{n_1} (P_i-\bar{P})^2 \hspace{1cm}\text{and}\hspace{1cm} V_2 = \sum_{j=1}^{n_2} (Q_j-\bar{Q})^2\]
- Then the standardized test statistic is \[ U = \frac{\sum_{j=1}^{n_2} Q_j - \sum_{i=1}^{n_1} P_j}{2 \left(V_1 + V_2 + \bar{P}\bar{Q}\right)^{1/2}} \]

Monte Carlo simulation of distibution of \(U\) under the null

n1 = length(hg); n2 = length(lg); n = n1+n2; nSim = 10000 Shuffle = replicate(nSim,sample(n,n,replace = FALSE)) Xi = Shuffle[1:n1,] Yj = Shuffle[(n1+1):n,] U = function(Xi,Yj) { Pi = function(i) { sum(Yj < Xi[i]) }; Qj = function(j) { sum(Xi < Yj[j]) } P = sapply(1:n1,Pi); Q = sapply(1:n2,Qj) Phat = mean(P); Qhat = mean(Q) V1 = sum((P-Phat)^2); V2 = sum((Q-Qhat)^2) (sum(Q)-sum(P))/(2*sqrt(V1+V2+Phat*Qhat)) } UNull = rep(0,nSim) for(trial in 1:nSim) { UNull[trial] = U(Xi[,trial],Yj[,trial]) }

- Hollander and Wolfe (1999) a study of healthy and lead-poisoned geese
- The study involved 7 healthy geese and 8 lead-poisoned geese
- The response of interest was the amount of plasma glucose in the geese in mg/100 ml of plasma
- The hypotheses of interest are: \[H_0: Î¸_H = Î¸_L \hspace{1cm}\text{vesus}\hspace{1cm} H_A: Î¸_H < Î¸_L\]
- \(Î¸_L\) denote the true median plasma glucose values of lead-poisoned geese
- \(Î¸_H\) denote the true median plasma glucose values of healthy geese

Test statistic of our sample

ranks = rank(c(hg,lg)) XiObsv = ranks[1:n1] YjObsv = ranks[(n1+1):n] UObsv = U(XiObsv,YjObsv) UObsv

## [1] 1.467599

pvalue = mean(UNull >= UObsv) pvalue

## [1] 0.0814

From Monte Carlo simulation:

- The most general test: Any difference between \(X\) and \(Y\) \[ H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \ne G(t) \hspace{1cm} \text{for at least one } t\]
- The Kolmogorov-Smirnov test is such a test
- Obtain empirical distributions \(F_{n_1}\) and \(G_{n_2}\)
- \(F_{n_1} = \# \{X_i \le t \} / n_1\)
- \(G_{n_2} = \# \{Y_j \le t \} / n_2\)
- Assuming \(n_1 = n_2\), then we get statistic (ordered combined observations \(Z_{(i)} \le \dots \le Z_{(n)}\)) \[J = \max_{i=1,\dots,n} \{ | F_{n_1}(Z_{(i)}) - G_{n_2}(Z_{(i)}) | \}\]

- Idea: Put a metric \(d(\pi,\sigma)\) on permutations \(\pi\) and \(\sigma\)
- Possible metrics:
- Metric 1: \(K(\pi,\sigma) =\) minimum number of pairwise adjacent transposition to go from \(\pi\) to \(\sigma\)
- Metric 2: \(R(\pi,\sigma) = \sqrt{ \sum_{i=1}^n (\pi(i)-\sigma(i))^2 }\)

- Example for \(K(\pi,\sigma)\):
- \(\pi = \{ 3,2,1 \} \text{ and } \sigma = \{ 1,2,3 \}\)
- Move 1: { 2,3,1 }
- Move 2: { 2,1,3 }
- Move 3: { 1,2,3 }
- So \(K(\pi,\sigma) = 3\)

- Four steps program
- Step 1:
- Let \(\pi\) be the observed ranks
- The metric is right-invariant \(d(\pi,\sigma) = d(\pi\tau,\sigma\tau)\)
- In other words, this takes care of all possible relabeling of observations into populations

- Step 2:
- Permutation \(\sigma\) is "equivalent" to observed \(\pi\) if and only if it assigns the same set of ranks to population 1 and thus to population 2
- This builds and equivalent class \([\pi]\) of size \(n_1!n_2!\) which is a subset of all possible permutations

- Step 3:
- Construct extremal set \(E\) containing all permutations which are most in agreement with \(H_A\) and least with \(H_0\)

- Step 4:
- Test statistics is now defined by choosing a distance \(d([\pi],E)\) to measure how far we are from \(H_A\)
- \(K(\pi,\sigma)\) is Mann-Whitney test statistic
- \(R^2(\pi,\sigma)\) is equivalent to Wilcoxon test statistic

- Reference: Critchlow (1986),
*A Unified Approach to Constructing Nonparametric Rank Tests*(Link)