Stanford University, Spring 2016, STATS 205

Two-Sample Problems

• Two populations
• From each population, we have one sample
• Infer whether or not there is a difference in location between populations
• Measure (with standard error) the difference (size of the effect) between populations
• Test: Wilcoxon two-sample rank test
• Estimation procedure: Hodgesâ€“Lehmann estimate

Example

library(HSAUR2); head(USmelanoma)
##             mortality latitude longitude ocean
## Alabama           219     33.0      87.0   yes
## Arizona           160     34.5     112.0    no
## Arkansas          170     35.0      92.5    no
## California        182     37.5     119.5   yes
## Colorado          149     39.0     105.5    no
## Connecticut       159     41.8      72.8   yes

Data: Fisher and van Belle (1993) report mortality rates due to malignant melanoma of the skin for white males during the period 1950-1969, for each state on the US mainland

Mortality: Number of white males died due to malignant melanoma 1950-1969 per one million inhabitants

Example

Question: Is there a difference in mortality for ocean states and non-ocean states?

Wilcoxon for Stochastic Ordering of Alternatives

• Random sample $$X_1,\dots,X_{n_1}$$ with cdf $$F$$
• Random sample $$Y_1,\dots,Y_{n_2}$$ with cdf $$G$$
• Hypothesis test: $H_0: F(t) = G(t) \hspace{1cm} \text{versus} \hspace{1cm} H_A: F(t) \le G(t)$
• For the alternative, $$X$$ is stochastically larger than $$Y$$: $\operatorname{P}(X > t) \ge \operatorname{P}(Y > t)$

Wilcoxon for Stochastic Ordering of Alternatives

• Both samples are combined in one sample
• Combinded sample ranked from $$1$$ to $$n = n_1+n_2$$ (low to high)
• Denote $$\operatorname{R}(Y_j)$$ rank of observation $$j$$ in combined sample
• The Wilcoxon test statistics $T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)$
• $$H_0$$ is rejected for small values of $$T$$
• Under $$H_0$$ the two sample are from the same population:
Any subset of ranks is equaliy likely as any other of the same size

Wilcoxon for Stochastic Ordering of Alternatives

• For example, the probability that a subset of $$n_2$$ rankings is assigned to $$Y$$'s is $\frac{1}{n \choose n_2}$
• There is no mention of the population distribution, so this test is distriubiton free
• So $$p$$-value can be calculated exactly by looking at the distribution of $$T$$ or aproximated asymptotically in case of large $$n_1+n_2$$

Wilcoxon for Stochastic Ordering of Alternatives Example

Data: Case-control study of esophageal cancer in Ile-et-Vilaine, France (Breslow et al. 1980)

Hypothesis: Alcohol consumption same in the two groups

library(datasets); data(esoph); head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

Wilcoxon for Stochastic Ordering of Alternatives Example

wilcox.test(x,y,alternative = "greater")
##
##  Wilcoxon rank sum test with continuity correction
##
## data:  x and y
## W = 135610, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

Conclusion: Reject the null hypothesis in favor of the alternative that alcohol consumption is higher in participants who suffer from esophageal cancer

Analyses for a Shift in Location

• Now for two-sample location problem
• The parameter $$\Delta$$, for $$-\infty < \Delta < \infty$$ is shift in location $G(t) = F(t - \Delta) \hspace{1cm} \text{and} \hspace{1cm} g(t) = f(t - \Delta)$
• For example, difference in means and medians
• Location model assume that the scale parameter of $$X$$ and $$Y$$ are the same (e.g. variance in normal distribution)

Analyses for a Shift in Location

• Random sample $$X_1,\dots,X_{n_1}$$ with cdf $$F(t)$$ and pdf $$f(t)$$
• Random sample $$Y_1,\dots,Y_{n_2}$$ with cdf $$F(t-\Delta)$$ and pdf $$f(t-\Delta)$$
• Hypothesis test: $H_0: \Delta = 0 \hspace{1cm} \text{versus} \hspace{1cm} H_A: \Delta \ne 0$
• Additionally, we can esimate $$\widehat{\Delta}$$ and form confidence intervals

Analyses for a Shift in Location

• Wilcoxon test statistics (same as before): $T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)$ among the combined samples $$X_1,\dots,X_{n1},Y_1,\dots,Y_{n2}$$
• Mannâ€“Whitney test statistic (equivalent): $T^+ = \#_{i,j}\{ X_i < Y_j \}$ among $$n_1 n_2$$ possible pairs
• Relating the two: $T^+ = T + \frac{n_2(n_2+1)}{2}$

Equivalence between $$T$$ and $$T^+$$

• Show that $$T^+ = T + \frac{n_2(n_2+1)}{2}$$
• Rank of $$Y_j$$ is equal to the number of $$X$$'s less than $$Y_j$$
plus the number of $$Y$$'s less than $$Y_j$$ plus one $\operatorname{R}(Y_j) = \#_i\{ X_i < Y_j \} + \#_{j'}\{ Y_{j'} < Y_j \} + 1$
• Substituting in $$T = \sum_{j=1}^{n_2} \operatorname{R}(Y_j)$$ gives $T^+ = \#_{j,i}\{ X_i < Y_j \} + \#_{j,j'}\{ Y_{j'} < Y_j \} + n_2$
• First term is $$T$$
• Second term is $$\{0 + 1 + 2 + \dots + n_2 - 1\} + n_2 = \frac{n_2(n_2+1)}{2}$$

Analyses for a Shift in Location Example

$$t$$-distribution with 5 degrees of freedom and
a true shift parameter $$\Delta$$ was set at the value $$8$$

n1 = 42
n2 = 50
trails = 10
x = round(rt(11,5)*trails+n1,1)
y = round(rt(9,5)*trails+n2,1)
sort(x)
##  [1] 20.0 27.5 29.7 36.5 41.7 42.1 45.5 46.6 47.9 49.0 50.6
sort(y)
## [1] 25.7 32.4 37.6 38.0 39.4 52.6 55.0 59.7 80.4

Analyses for a Shift in Location Example

Use exact null distribution of $$T+$$:

wilcox.test(x,y,exact=TRUE,correct=TRUE)
##
##  Wilcoxon rank sum test
##
## data:  x and y
## W = 39, p-value = 0.4561
## alternative hypothesis: true location shift is not equal to 0

Analyses for a Shift in Location Example

Use asymptotics:

wilcox.test(x,y,exact=FALSE,correct=FALSE)
##
##  Wilcoxon rank sum test
##
## data:  x and y
## W = 39, p-value = 0.425
## alternative hypothesis: true location shift is not equal to 0

Permutations and Mann-Whitney

• Rank all $$n_1+n_2$$ observations
• Color the first sample red and the second sample blue
• Count how many moves it takes to unscramble the two populations