Permutation Tests (Part 1)

Stanford University, Spring 2016, STATS 205

The Lady and the Tea

Fisher starts his second chapter in his famous book on "The Design of Experiments" first published in 1935:
Quote: A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.
Fisher proposed the the following experiment to test this
Prepare eight cups of tea
In four, add tea first
In the other four, add milk first
Hold constant other features of the cups of tea (size, temperature, etc.)
Present cups in random order
Lady tasts them and judges
She knows that there are four of each type

The Lady and the Tea

Our theory is that the lady can taste the difference
Which translates into our null hypothesis:
\(H_0:\) lady cannot taste difference
The test statistics is the number of correct judgements
Under the null all judgments are equally likely
What is the null distribution of correct judgments?
Once we have that compare it to the experiment and compute the probability of having observed the experimantal outcome or a more extreme case

The Lady and the Tea

Because of randomization, all \(8! = 40320\) permutations of the cups are equally likely, and each one has its own number of correct judgements
There are \({8 \choose 4} = 70\) ways of choosing 4 cups from 8 cups
Think of it as, choosing 4 cups in succession by first choosing from 8 cups, then 7, 6, 5, which gives \(8 \times 7 \times 6 \times 5 = 1680\)
But by doing so, we have chosen not only every possible set of 4, but every possible set in every possible order
Since 4 cups can be rearranged in order in \(4 \times 3 \times 2 \times 1 = 24\), we correct by \(1680/24 = 70\)

The Lady and the Tea

##           [,1]  [,2]   [,3]   [,4]   [,5]  [,6]  [,7]   [,8]  
## truth     "tea" "milk" "milk" "milk" "tea" "tea" "tea"  "milk"
## judgement "tea" "milk" "milk" "milk" "tea" "tea" "milk" "tea"

Under the null, the reasons for the lady's judgements are unkown, except that they have nothing to do with the truth
The judgements are what they are; they are fixed
All \(70\) ways to line up cups are equally likely
But only one line up will match the lady's judgements

The Lady and the Tea

Of the four cups where the tea was poured first, select \(t\) of them to say "tea" correctly, and \(4−t\) to say "milk" incorrectly

t = 0:4
probability = (choose(4,t)*choose(4,4-t))/choose(8,4)
data.frame(t=t,probability=round(probability,digits=3))

##   t probability
## 1 0       0.014
## 2 1       0.229
## 3 2       0.514
## 4 3       0.229
## 5 4       0.014

The chances of a match is \(\frac{1}{70} = 0.014 < 0.05\)
We need perfect match!

The Lady and the Tea

Monte Carlo simulations

judgement = c("tea","milk","milk","milk","tea","tea","milk","tea")
nSim = 10000
permutations = replicate(nSim,sample(8,replace = FALSE))
matches = apply(permutations,2,function(i) sum(judgement[i] == judgement))
data.frame(t=t,
           probability=round(probability,digits=3),
           monte=round(table(matches)/nSim,digits=3))

##   t probability monte.matches monte.Freq
## 1 0       0.014             0      0.013
## 2 1       0.229             2      0.224
## 3 2       0.514             4      0.518
## 4 3       0.229             6      0.230
## 5 4       0.014             8      0.015

The Lady and the Tea

t = 0:7
probability = (choose(7,t)*choose(7,7-t))/choose(14,7)
data.frame(t=t,probability=round(probability,digits=4))

##   t probability
## 1 0      0.0003
## 2 1      0.0143
## 3 2      0.1285
## 4 3      0.3569
## 5 4      0.3569
## 6 5      0.1285
## 7 6      0.0143
## 8 7      0.0003

If she tasted 14 cups, it would be possible to reject \(H_0\) without requiring perfect judgement
We need either perfect match or one miss to get a \(p\)-value below \(0.05\)

Darwin and Fisher

In chapter 3 of Fisher's book, he then introduced the first nonparametric test
He illustrated his approach on data from Darwin
Darwin raised cross- and self-fertilized corn (Zea mays) plants
He planted equal numbers of each in four different pots, but not same number in each pot
Darwin measured heights of plants when they were between 12 and 24 inches in height

Darwin and Fisher

Original data (plant height in inches) from Darwin listed in Fisher's book

##    pair pot  cross   self   diff
## 1     1   1 23.500 17.375  6.125
## 2     2   1 12.000 20.375 -8.375
## 3     3   1 21.000 20.000  1.000
## 4     4   2 22.000 20.000  2.000
## 5     5   2 19.125 18.375  0.750
## 6     6   2 21.500 18.625  2.875
## 7     7   3 22.125 18.625  3.500
## 8     8   3 20.375 15.250  5.125
## 9     9   3 18.250 16.500  1.750
## 10   10   3 21.625 18.000  3.625
## 11   11   3 23.250 16.250  7.000
## 12   12   4 21.000 18.000  3.000
## 13   13   4 22.125 12.750  9.375
## 14   14   4 23.000 15.500  7.500
## 15   15   4 12.000 18.000 -6.000

Darwin and Fisher

Fisher proceeds by testing the hypothesis
\(H_0:\) No difference between crossed and self-fertilized plants

He peforms the paired sample \(t\)-test

## 
##  Paired t-test
## 
## data:  ZeaMays$cross and ZeaMays$self
## t = 2.148, df = 14, p-value = 0.0497
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.003899165 5.229434169
## sample estimates:
## mean of the differences 
##                2.616667

Darwin and Fisher

ZeaMays$diff

##  [1]  6.125 -8.375  1.000  2.000  0.750  2.875  3.500  5.125  1.750  3.625
## [11]  7.000  3.000  9.375  7.500 -6.000

Then he invents the first permutation test
Under \(H_0\) that self-fertilized versus cross-fertilized does not matter, only chance determined whether \(A-B\) or \(B-A\)
So the absolute value of the difference is what it is, but the plus or minus sign is by chance alone
Test statistic is sum of the differences
There are \(2^{15} = 32768\) ways to swap the plus and minus signs, all equaly likely under \(H_0\)
Calculate statistic for each one to get null distribution

Darwin and Fisher

Random samples \(X_1,\dots,X_n\) from \(X\)
Test \(H_0: E(X) = 0\)
Write the usuall statistic as \[T = \sum_{i=1}^n X_i = \sum_{i=1}^n |X_i| \delta_i\]
and split samples into two parts \[(|X_1|,\dots,|X_n|) \hspace{1cm}\text{and}\hspace{1cm} (\delta_1,\dots,\delta_n)\]
Under \(H_0\) the two parts are independent and \(\delta_i\) is a fair coin flip \(\{-1,1\}\)
The distribution of \(T\) is not well defined under \(H_0\)

Darwin and Fisher

But if we fix \(|X_i|\) at their observed values and regard \(\delta_i\) as random
Then the condition null of \(T\) is well defined
The only thing left to do is to compute \[P\left(T \ge t \, \big| \, H_0, |X_1|, \dots, |X_n| \right)\] by listing all possible samples of type \[\left(\pm \, |X_1|, \dots, \pm \, |X_n|\right)\]
This is usually called the Fisher's randomization test
Permutation tests describe a special case of randomization
In permutation tests, we permute group labels on the observations

Darwin and Fisher

Using Monte Carlo simulations

nSim = 10000
n = length(ZeaMays$diff)
absDiff = abs(ZeaMays$diff)
signs = matrix(replicate(nSim*n,sample(c(-1,1),size=1)),nrow = nSim)
TNull = apply(signs,1,function(row) sum(row * absDiff))
T0 = sum(ZeaMays$diff)
pvalue = 2*mean(TNull >= T0); pvalue

## [1] 0.054

Darwin and Fisher

We already looked at similar tests
The sign test:
- Test statistic: \(S = \#_i \{ X_i \}\)
- Wilcoxon signed-rank test: \(W = \sum_{i=1}^n \operatorname{R}(|X_i|) \delta_i\)
In contrast to the randomization test, the Wilcoxon signed-rank test is not conditional on the ranks

Permutation Test – General Recipe

Decide on a test statistic \(T\)
List the possible values of \(T\)
Under \(H_0\), all ways of re-arranging the data are equally likely
\(P(T = t)\) is proportional to the number of ways of getting the value \(t\)
The permutation \(p\)-value is the probability of getting a value of \(T\) as extreme or more extreme as the value we observed, "extreme" meaning in a direction inconsistent with \(H_0\)

Permutation Test – One-Sample Problem

Location model: \(X_i = \theta + e_i\)
Assume error distribution has symmetric pdf \(f(t)\) around \(\theta\)
Hypothesis test: \(H_0: \theta = 0\) versus \(H_A: \theta > 0\)
Test statistic: Sum of deviations of \(\theta\) about \(0\)
Under null, positive and negative deviations cancel each other out, so sum should be close to zero
We reject the null in favor of alternative if sum is "large"
To find what large means we comaper to null distribution using permutation
What to permute?
We use principle of sufficiency

Permutation Test – One-Sample Problem

Assume that we had lost the signs information
Under the null, we can recover the orginal distribution by randomly assigning signs to each deviation
In this case, we say that the deviations are sufficient
Under the alternative of a location parameter larger than zero, randomizing the signs of the deviations should reduce the sum from what it was originally
List all possible reassignments of plus and minus signs \(2^n\)
Compare observed sum with all possible sums

Permutation Test – Two-Sample Problem

Test statistic: Sum of observations from second population
Assume that we had lost the group assignments
Under the null, we can recover the original distribution by randomly assigning groups to each observation
In this case, we say that the observations (without group labels) are sufficient
Under the alternative one group will have a larger sum
List all possible group reassignments and compute sum of second population
Compare observed sum to all possible sums

Assumption: Exchangeable Observations

A sufficient condition for permutation test is exchangeable of observations
Consider a collection of random variables \[ X_1,\dots,X_n \]
If their joint distribution are equal under permuations \(\pi\) \[P_{X_1,\dots,X_n}(A) = P_{X_{\pi(1)},\dots,X_{\pi(n)}}(A)\]
then \(X_1,\dots,X_n\) are called exchangable
This is a weaker assumption than indepdendence of observations in the combined sample

Assumption: Exchangeable Observations

Some examples:
Independent and identically distributed observations are exchangeable
Samples without replacement from a finite population are exchangeable:
- An urn contains \(b\) black balls, \(r\) red balls, \(y\) yellow balls, and so forth
- A series of balls are extracted from the urn
- After the \(i\)th extraction, the color of the ball \(X_i\) is noted and \(k\) balls of the same color are added to the urn,
- where \(k\) can be any integer, positive, negative, or zero
- The set of random events \(\{X_i\}\) form an exchangeable sequence
Dependent normally distributed random variables \(\{X_i\}\) for which the variance of \(X_i\) is a constant independent of \(i\) and the covariance of \(X_i\) and \(X_j\) is a constant independent of \(i\) and \(j\)

Assumption: Exchangeable Observations

Sometimes a simple transformation will ensure that observations are exchangeable
For example, if we know that \(X\) comes from a population with mean \(\mu\) and distribution \(F(x−\mu)\) and
an independent observation \(Y\) comes from a population with mean \(v\) and distribution \(F(x−v)\)
then the independent variables \(X'=X−\mu\) and \(Y'=Y−v\) are exchangeable

How to Compute It?

Small sample size:
Enumerations of all permutations using Gray codes
(takes expotional time in \(n\))
Medium sample size:
Fast Frourier Transform
(takes polynomial time in \(n\))
Large sample size:
Monte Carlo simulations
(time depends on accuracy requirements)

References

Good (2005). Permutations, Parametric, and Boostrap Test of Hypothesis
Basu (1980). Randomization Analysis of Experimential Data: The Fisher Randomization Test
Pagano and Tritchler (1983). On Obtaining Permutation Distributions in Polynomial Time
Diaconis And Holmes (1994). Gray Codes for Randomization Procedures