Stanford University, Spring 2016, STATS 205

Test of Independence

  • Observations are two measurements on the same subject (e.g. eye color and hair color)
  • Random samples \((X_1,Y_1),\dots,(X_n,Y_n)\)
  • \(X\) and \(Y\) with different ranges \(\{1,2,\dots,I\}\) and \(\{1,2,\dots,J\}\)
  • Consider hyothesis test: \[ \begin{align} H_0: P(X=i,Y=j) & = P(X=i)P(Y=j) \text{ for all } i \text{ and } j \\ H_A: P(X=i,Y=j) & \ne P(X=i)P(Y=j) \text{ for some } i \text{ and } j \end{align} \]
  • Construct contingency table of \(n\) observations \[O_{ij} = \#_{1 \le l \le n} \{ (X_l,Y_l) = (i,j) \}\]

Test of Independence

Some definitions:

\[ \begin{align} \text{Row sum:} & O_{i+} \\ \text{Column sum:} & O_{+ j} \\ \text{Total sum:} & O_{++} \\ \text{Row vector i:} & O_i \\ \text{Column vector j:} & O_j \end{align} \]

Test of Independence

Expected frequencies under independence assumption are product of marginals \[\widehat{E}_{ij} = O_{++} \frac{O_{i+}}{O_{++}} \frac{O_{+ j}}{O_{++}}\]

##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Test statistics \(\chi^2 = \sum_i \sum_j \frac{(O_{ij}-\widehat{E}_{ij})^2}{\widehat{E}_{ij}}\)

Asymptotically \(\chi^2\) with \((I−1)(J−1)\) degrees of freedom

Test of Independence

  • This is a goodness-of-fit test
  • Quantifies how well the independence model fits observations
  • Measures discrepancy between observed values and expected values under the model
  • Rather than reject this global null hypthesis, can we find what is driving the statistic?
  • Can we visualize this?
  • Two ways:
    • Visualization of contingency table with association and mosaic plots
    • Visualization of deviation from model with correspondence analysis

Hair and Eye Color of Statistics Students

Survey of students at the University of Delaware reported by Snee (1974)

HairEye
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Rows hair color of 592 statistics students
Columns eye color of the same 592 statistics students

Hair and Eye Color of Statistics Students

chisq.test(HairEye)
## 
##  Pearson's Chi-squared test
## 
## data:  HairEye
## X-squared = 138.29, df = 9, p-value < 2.2e-16

Association Plots

  • This is a followup on \(\chi^2\) tests when they are rejected
  • Pearson residuals \[r_{ij} = \frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}\]
  • \(\chi^2\) statistics is squaring and summing over all cells \[\chi^2 = \sum_j \sum_i r_{ij}^2\]

Association Plots

  • Each cell of contingency table is represented by a rectangle encoding information in width, height, location, and color of the rectangle
  • Height: Proportional to Pearson residual \(\frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}\)
  • Width: Proportional to \(\sqrt{E_{ij}}\)
  • Area: Proportional to \(O_{ij}-E_{ij}\)
  • Baseline:
    • If \(O_{ij} > E_{ij}\), then rectanble above
    • otherwise rectangle below
  • Color: Standardized Pearson residuals that are asymptotically standard normal

Association Plots

Mosaic Plots

  • Cell frequency \(O_{ij}\), cell probability \(p_{ij} = O_{ij} / O_{++}\)
  • Take unit square
  • Divide unit square vertically into rectangles
    • Height proportional to observed marginal frequencies \(O_{i+}\)
    • which is proportional to marginal probabilities \(p_i = O_{i+} / O_{++}\)
  • Subdivide resulting rectangles horizontally
    • Width proportional to \(p_{j|i} = O_{ij}/O_{i+}\)
    • which is the conditional probabilities of the second variable given the first
  • Area is proportional to the observed cell frequency and probability \(p_{ij} = p_{i} \times p_{j|i} = ( O_{i+}/O_{++} ) \times ( O_{ij} / O_{i+} )\)

Mosaic Plots

  • The order of conditioning matters
  • Color: Standardized Pearson residuals that are asymptotically standard normal

Mosaic Plots

Marginal probabilities of first variable

frequency = rowSums(HairEye)
proportions = frequency/sum(frequency)
data.frame(frequency=round(frequency,0),proportions=round(proportions,2))
##       frequency proportions
## Black       108        0.18
## Brown       286        0.48
## Red          71        0.12
## Blond       127        0.21

Mosaic Plots

Marginal probabilities of second given first variable

HairEye/rowSums(HairEye)
##        Eye
## Hair         Brown       Blue      Hazel      Green
##   Black 0.62962963 0.18518519 0.13888889 0.04629630
##   Brown 0.41608392 0.29370629 0.18881119 0.10139860
##   Red   0.36619718 0.23943662 0.19718310 0.19718310
##   Blond 0.05511811 0.74015748 0.07874016 0.12598425

Mosaic Plots

Mosaic Plots

If hair color and eye color were independent \(p_{ij} =p_i \times p_j\), then then the tiles in each row would all align

Correspondence Analysis

  • Exploratory data analysis for non-negative data matrices
  • Converts matrix into a plot where rows and columns are depicted as points
  • Its algebraic form first appeared in 1935
  • Actively developped in 1965 in France
  • We focus on the application of correspondence analysis for contingency tables
  • Analogous to principal components analysis, but appropriate to categorical rather than to continuous variables

Correspondence Analysis

  • Define a distance between rows
  • Use the distance \(d_{\chi^2}(\boldsymbol{x},\boldsymbol{y})\) between two vector \(\boldsymbol{x}\) and \(\boldsymbol{y}\) as \[d_{\chi^2}(\boldsymbol{x},\boldsymbol{y}) = (\boldsymbol{x}-\boldsymbol{y})^T D_c^{-1} (\boldsymbol{x}-\boldsymbol{y})\]
  • where \(D_c\) is a diagonal matrix with \[\operatorname{diag}(D_c) = \boldsymbol{c} = \sum_i p_i \boldsymbol{a_i}\]
  • with \(\boldsymbol{a_i} = (p_{j|i})\) being the probability vector conditioned on row \(i\)
  • and the centroid \(\boldsymbol{c}\) being the weighted average of conditional row probabilities