Stanford University, Spring 2016, STATS 205

Two-Sample Problems

  • Two populations
  • From each population, we have one sample
  • Infer whether or not there is a difference in location between populations
  • Measure (with standard error) the difference (size of the effect) between populations
  • Test: Wilcoxon two-sample rank test
  • Estimation procedure: Hodges–Lehmann estimate

Example

library(HSAUR2); head(USmelanoma)
##             mortality latitude longitude ocean
## Alabama           219     33.0      87.0   yes
## Arizona           160     34.5     112.0    no
## Arkansas          170     35.0      92.5    no
## California        182     37.5     119.5   yes
## Colorado          149     39.0     105.5    no
## Connecticut       159     41.8      72.8   yes

Data: Fisher and van Belle (1993) report mortality rates due to malignant melanoma of the skin for white males during the period 1950-1969, for each state on the US mainland

Mortality: Number of white males died due to malignant melanoma 1950-1969 per one million inhabitants

Example

Question: Is there a difference in mortality for ocean states and non-ocean states?

Wilcoxon for Stochastic Ordering of Alternatives

  • Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F\)
  • Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(G\)
  • Hypothesis test: \[H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \le G(t)\]
  • For the alternative, \(X\) is stochastically larger than \(Y\): \[\operatorname{P}(X > t) \ge \operatorname{P}(Y > t)\]

Wilcoxon for Stochastic Ordering of Alternatives

Wilcoxon for Stochastic Ordering of Alternatives

Wilcoxon for Stochastic Ordering of Alternatives

Wilcoxon for Stochastic Ordering of Alternatives

Wilcoxon for Stochastic Ordering of Alternatives

  • Both samples are combined in one sample
  • Combinded sample ranked from \(1\) to \(n = n_1+n_2\) (low to high)
  • Denote \(\operatorname{R}(Y_j)\) rank of observation \(j\) in combined sample
  • The Wilcoxon test statistics \[ T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\]
  • \(H_0\) is rejected for small values of \(T\)
  • Under \(H_0\) the two sample are from the same population:
    Any subset of ranks is equaliy likely as any other of the same size

Wilcoxon for Stochastic Ordering of Alternatives

  • For example, the probability that a subset of \(n_2\) rankings is assigned to \(Y\)'s is \[\frac{1}{n \choose n_2}\]
  • There is no mention of the population distribution, so this test is distriubiton free
  • So \(p\)-value can be calculated exactly by looking at the distribution of \(T\) or aproximated asymptotically in case of large \(n_1+n_2\)

Wilcoxon for Stochastic Ordering of Alternatives Example

Data: Case-control study of esophageal cancer in Ile-et-Vilaine, France (Breslow et al. 1980)

Hypothesis: Alcohol consumption same in the two groups

library(datasets); data(esoph); head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

Wilcoxon for Stochastic Ordering of Alternatives Example

Wilcoxon for Stochastic Ordering of Alternatives Example

wilcox.test(x,y,alternative = "greater")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and y
## W = 135610, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

Conclusion: Reject the null hypothesis in favor of the alternative that alcohol consumption is higher in participants who suffer from esophageal cancer

Analyses for a Shift in Location

  • Now for two-sample location problem
  • The parameter \(\Delta\), for \(-\infty < \Delta < \infty\) is shift in location \[ G(t) = F(t - \Delta) \hspace{1cm} \text{and} \hspace{1cm} g(t) = f(t - \Delta)\]
  • For example, difference in means and medians
  • Location model assume that the scale parameter of \(X\) and \(Y\) are the same (e.g. variance in normal distribution)

Analyses for a Shift in Location

Analyses for a Shift in Location

  • Random sample \(X_1,\dots,X_{n_1}\) with cdf \(F(t)\) and pdf \(f(t)\)
  • Random sample \(Y_1,\dots,Y_{n_2}\) with cdf \(F(t-\Delta)\) and pdf \(f(t-\Delta)\)
  • Hypothesis test: \[H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0\]
  • Additionally, we can esimate \(\widehat{\Delta}\) and form confidence intervals

Analyses for a Shift in Location

  • Wilcoxon test statistics (same as before): \[ T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\] among the combined samples \(X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}\)
  • Mann–Whitney test statistic (equivalent): \[ T^+ = \#_{i,j}\{ X_i < Y_j \} \] among \(n_1 n_2\) possible pairs
  • Relating the two: \[ T^+ = T + \frac{n_2(n_2+1)}{2}\]

Equivalence between \(T\) and \(T^+\)

  • Show that \(T^+ = T + \frac{n_2(n_2+1)}{2}\)
  • Rank of \(Y_j\) is equal to the number of \(X\)'s less than \(Y_j\)
    plus the number of \(Y\)'s less than \(Y_j\) plus one \[ \operatorname{R}(Y_j) = \#_i\{ X_i < Y_j \} + \#_{j'}\{ Y_{j'} < Y_j \} + 1 \]
  • Substituting in \(T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)\) gives \[ T^+ = \#_{j,i}\{ X_i < Y_j \} + \#_{j,j'}\{ Y_{j'} < Y_j \} + n_2 \]
  • First term is \(T\)
  • Second term is \(\{0 + 1 + 2 + \dots + n_2 - 1\} + n_2 = \frac{n_2(n_2+1)}{2}\)

Analyses for a Shift in Location Example

\(t\)-distribution with 5 degrees of freedom and
a true shift parameter \(\Delta\) was set at the value \(8\)

n1 = 42
n2 = 50
trails = 10
x = round(rt(11,5)*trails+n1,1)
y = round(rt(9,5)*trails+n2,1)
sort(x)
##  [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6
sort(y)
## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

Analyses for a Shift in Location Example

Use exact null distribution of \(T+\):

wilcox.test(x,y,exact=TRUE,correct=TRUE)
## 
##  Wilcoxon rank sum test
## 
## data:  x and y
## W = 39, p-value = 0.4561
## alternative hypothesis: true location shift is not equal to 0

Analyses for a Shift in Location Example

Use asymptotics:

wilcox.test(x,y,exact=FALSE,correct=FALSE)
## 
##  Wilcoxon rank sum test
## 
## data:  x and y
## W = 39, p-value = 0.425
## alternative hypothesis: true location shift is not equal to 0

Permutations and Mann-Whitney

  • Rank all \(n_1+n_2\) observations
  • Color the first sample red and the second sample blue
  • Count how many moves it takes to unscramble the two populations
  • Moves: pairwise adjacent transpositions
  • Unscrable: bring all the reds to the left
  • Few moves, things were pretty well sorted, we have grounds for believing the numbers were drawn from different populations
  • If observations were drawn from the same population, they should be well intermingled and require many moves to unscramble
  • This test which is equivalent to the popular Mann Whitney statistic
  • Reference: Critchlow (1986), A Unified Approach to Constructing Nonparametric Rank Tests (Link)