Stanford University, Spring 2016, STATS 205

## Test of Independence

• Observations are two measurements on the same subject (e.g. eye color and hair color)
• Random samples $$(X_1,Y_1),\dots,(X_n,Y_n)$$
• $$X$$ and $$Y$$ with different ranges $$\{1,2,\dots,I\}$$ and $$\{1,2,\dots,J\}$$
• Consider hyothesis test: \begin{align} H_0: P(X=i,Y=j) & = P(X=i)P(Y=j) \text{ for all } i \text{ and } j \\ H_A: P(X=i,Y=j) & \ne P(X=i)P(Y=j) \text{ for some } i \text{ and } j \end{align}
• Construct contingency table of $$n$$ observations $O_{ij} = \#_{1 \le l \le n} \{ (X_l,Y_l) = (i,j) \}$

## Test of Independence

Some definitions:

\begin{align} \text{Row sum:} & O_{i+} \\ \text{Column sum:} & O_{+ j} \\ \text{Total sum:} & O_{++} \\ \text{Row vector i:} & O_i \\ \text{Column vector j:} & O_j \end{align}

## Test of Independence

Expected frequencies under independence assumption are product of marginals $\widehat{E}_{ij} = O_{++} \frac{O_{i+}}{O_{++}} \frac{O_{+ j}}{O_{++}}$

##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Test statistics $$\chi^2 = \sum_i \sum_j \frac{(O_{ij}-\widehat{E}_{ij})^2}{\widehat{E}_{ij}}$$

Asymptotically $$\chi^2$$ with $$(I−1)(J−1)$$ degrees of freedom

## Test of Independence

• This is a goodness-of-fit test
• Quantifies how well the independence model fits observations
• Measures discrepancy between observed values and expected values under the model
• Rather than reject this global null hypthesis, can we find what is driving the statistic?
• Can we visualize this?
• Two ways:
• Visualization of contingency table with association and mosaic plots
• Visualization of deviation from model with correspondence analysis

## Hair and Eye Color of Statistics Students

Survey of students at the University of Delaware reported by Snee (1974)

HairEye
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Rows hair color of 592 statistics students
Columns eye color of the same 592 statistics students

## Hair and Eye Color of Statistics Students

chisq.test(HairEye)
##
##  Pearson's Chi-squared test
##
## data:  HairEye
## X-squared = 138.29, df = 9, p-value < 2.2e-16

## Association Plots

• This is a followup on $$\chi^2$$ tests when they are rejected
• Pearson residuals $r_{ij} = \frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}$
• $$\chi^2$$ statistics is squaring and summing over all cells $\chi^2 = \sum_j \sum_i r_{ij}^2$

## Association Plots

• Each cell of contingency table is represented by a rectangle encoding information in width, height, location, and color of the rectangle
• Height: Proportional to Pearson residual $$\frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}$$
• Width: Proportional to $$\sqrt{E_{ij}}$$
• Area: Proportional to $$O_{ij}-E_{ij}$$
• Baseline:
• If $$O_{ij} > E_{ij}$$, then rectanble above
• otherwise rectangle below
• Color: Standardized Pearson residuals that are asymptotically standard normal

## Mosaic Plots

• Cell frequency $$O_{ij}$$, cell probability $$p_{ij} = O_{ij} / O_{++}$$
• Take unit square
• Divide unit square vertically into rectangles
• Height proportional to observed marginal frequencies $$O_{i+}$$
• which is proportional to marginal probabilities $$p_i = O_{i+} / O_{++}$$
• Subdivide resulting rectangles horizontally
• Width proportional to $$p_{j|i} = O_{ij}/O_{i+}$$
• which is the conditional probabilities of the second variable given the first
• Area is proportional to the observed cell frequency and probability $$p_{ij} = p_{i} \times p_{j|i} = ( O_{i+}/O_{++} ) \times ( O_{ij} / O_{i+} )$$

## Mosaic Plots

• The order of conditioning matters
• Color: Standardized Pearson residuals that are asymptotically standard normal

## Mosaic Plots

Marginal probabilities of first variable

frequency = rowSums(HairEye)
proportions = frequency/sum(frequency)
data.frame(frequency=round(frequency,0),proportions=round(proportions,2))
##       frequency proportions
## Black       108        0.18
## Brown       286        0.48
## Red          71        0.12
## Blond       127        0.21

## Mosaic Plots

Marginal probabilities of second given first variable

HairEye/rowSums(HairEye)
##        Eye
## Hair         Brown       Blue      Hazel      Green
##   Black 0.62962963 0.18518519 0.13888889 0.04629630
##   Brown 0.41608392 0.29370629 0.18881119 0.10139860
##   Red   0.36619718 0.23943662 0.19718310 0.19718310
##   Blond 0.05511811 0.74015748 0.07874016 0.12598425

## Mosaic Plots

If hair color and eye color were independent $$p_{ij} =p_i \times p_j$$, then then the tiles in each row would all align

## Correspondence Analysis

• Exploratory data analysis for non-negative data matrices
• Converts matrix into a plot where rows and columns are depicted as points
• Its algebraic form first appeared in 1935
• Actively developped in 1965 in France
• We focus on the application of correspondence analysis for contingency tables
• Analogous to principal components analysis, but appropriate to categorical rather than to continuous variables

## Correspondence Analysis

• Define a distance between rows
• Use the distance $$d_{\chi^2}(\boldsymbol{x},\boldsymbol{y})$$ between two vector $$\boldsymbol{x}$$ and $$\boldsymbol{y}$$ as $d_{\chi^2}(\boldsymbol{x},\boldsymbol{y}) = (\boldsymbol{x}-\boldsymbol{y})^T D_c^{-1} (\boldsymbol{x}-\boldsymbol{y})$
• where $$D_c$$ is a diagonal matrix with $\operatorname{diag}(D_c) = \boldsymbol{c} = \sum_i p_i \boldsymbol{a_i}$
• with $$\boldsymbol{a_i} = (p_{j|i})$$ being the probability vector conditioned on row $$i$$
• and the centroid $$\boldsymbol{c}$$ being the weighted average of conditional row probabilities