Stanford University, Spring 2016, STATS 205

Test of Independence

  • Observations are two measurements on the same subject (e.g. eye color and hair color)
  • Random samples \((X_1,Y_1),\dots,(X_n,Y_n)\)
  • \(X\) and \(Y\) with different ranges \(\{1,2,\dots,I\}\) and \(\{1,2,\dots,J\}\)
  • Consider hyothesis test: \[ \begin{align} H_0: P(X=i,Y=j) & = P(X=i)P(Y=j) \text{ for all } i \text{ and } j \\ H_A: P(X=i,Y=j) & \ne P(X=i)P(Y=j) \text{ for some } i \text{ and } j \end{align} \]
  • Construct contingency table of \(n\) observations \[O_{ij} = \#_{1 \le l \le n} \{ (X_l,Y_l) = (i,j) \}\]

Test of Independence

Some definitions:

\[ \begin{align} \text{Row sum:} & O_{i+} \\ \text{Column sum:} & O_{+ j} \\ \text{Total sum:} & O_{++} \\ \text{Row vector i:} & O_i \\ \text{Column vector j:} & O_j \end{align} \]

Test of Independence

Expected frequencies under independence assumption are product of marginals \[\widehat{E}_{ij} = O_{++} \frac{O_{i+}}{O_{++}} \frac{O_{+ j}}{O_{++}}\]

##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Test statistics \(\chi^2 = \sum_i \sum_j \frac{(O_{ij}-\widehat{E}_{ij})^2}{\widehat{E}_{ij}}\)

Asymptotically \(\chi^2\) with \((I−1)(J−1)\) degrees of freedom

Test of Independence

  • This is a goodness-of-fit test
  • Quantifies how well the independence model fits observations
  • Measures discrepancy between observed values and expected values under the model
  • Rather than reject this global null hypthesis, can we find what is driving the statistic?
  • Can we visualize this?
  • Two ways:
    • Visualization of contingency table with association and mosaic plots
    • Visualization of deviation from model with correspondence analysis

Hair and Eye Color of Statistics Students

Survey of students at the University of Delaware reported by Snee (1974)

HairEye
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Rows hair color of 592 statistics students
Columns eye color of the same 592 statistics students

Hair and Eye Color of Statistics Students

chisq.test(HairEye)
## 
##  Pearson's Chi-squared test
## 
## data:  HairEye
## X-squared = 138.29, df = 9, p-value < 2.2e-16

Association Plots

  • This is a followup on \(\chi^2\) tests when they are rejected
  • Pearson residuals \[r_{ij} = \frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}\]
  • \(\chi^2\) statistics is squaring and summing over all cells \[\chi^2 = \sum_j \sum_i r_{ij}^2\]

Association Plots

  • Each cell of contingency table is represented by a rectangle encoding information in width, height, location, and color of the rectangle
  • Height: Proportional to Pearson residual \(\frac{O_{ij}-E_{ij}}{\sqrt{E_{ij}}}\)
  • Width: Proportional to \(\sqrt{E_{ij}}\)
  • Area: Proportional to \(O_{ij}-E_{ij}\)
  • Baseline:
    • If \(O_{ij} > E_{ij}\), then rectanble above
    • otherwise rectangle below
  • Color: Standardized Pearson residuals that are asymptotically standard normal

Association Plots

Mosaic Plots

  • Cell frequency \(O_{ij}\), cell probability \(p_{ij} = O_{ij} / O_{++}\)
  • Take unit square
  • Divide unit square vertically into rectangles
    • Height proportional to observed marginal frequencies \(O_{i+}\)
    • which is proportional to marginal probabilities \(p_i = O_{i+} / O_{++}\)
  • Subdivide resulting rectangles horizontally
    • Width proportional to \(p_{j|i} = O_{ij}/O_{i+}\)
    • which is the conditional probabilities of the second variable given the first
  • Area is proportional to the observed cell frequency and probability \(p_{ij} = p_{i} \times p_{j|i} = ( O_{i+}/O_{++} ) \times ( O_{ij} / O_{i+} )\)

Mosaic Plots

  • The order of conditioning matters
  • Color: Standardized Pearson residuals that are asymptotically standard normal

Mosaic Plots

Marginal probabilities of first variable

frequency = rowSums(HairEye)
proportions = frequency/sum(frequency)
data.frame(frequency=round(frequency,0),proportions=round(proportions,2))
##       frequency proportions
## Black       108        0.18
## Brown       286        0.48
## Red          71        0.12
## Blond       127        0.21

Mosaic Plots

Marginal probabilities of second given first variable

HairEye/rowSums(HairEye)
##        Eye
## Hair         Brown       Blue      Hazel      Green
##   Black 0.62962963 0.18518519 0.13888889 0.04629630
##   Brown 0.41608392 0.29370629 0.18881119 0.10139860
##   Red   0.36619718 0.23943662 0.19718310 0.19718310
##   Blond 0.05511811 0.74015748 0.07874016 0.12598425

Mosaic Plots

Mosaic Plots

If hair color and eye color were independent \(p_{ij} =p_i \times p_j\), then then the tiles in each row would all align

Correspondence Analysis

  • Exploratory data analysis for non-negative data matrices
  • Converts matrix into a plot where rows and columns are depicted as points
  • Its algebraic form first appeared in 1935
  • Actively developped in 1965 in France
  • We focus on the application of correspondence analysis for contingency tables
  • Analogous to principal components analysis, but appropriate to categorical rather than to continuous variables

Correspondence Analysis

  • Define a distance between rows
  • Use the distance \(d_{\chi^2}(\boldsymbol{x},\boldsymbol{y})\) between two vector \(\boldsymbol{x}\) and \(\boldsymbol{y}\) as \[d_{\chi^2}(\boldsymbol{x},\boldsymbol{y}) = (\boldsymbol{x}-\boldsymbol{y})^T D_c^{-1} (\boldsymbol{x}-\boldsymbol{y})\]
  • where \(D_c\) is a diagonal matrix with \[\operatorname{diag}(D_c) = \boldsymbol{c} = \sum_i p_i \boldsymbol{a_i}\]
  • with \(\boldsymbol{a_i} = (p_{j|i})\) being the probability vector conditioned on row \(i\)
  • and the centroid \(\boldsymbol{c}\) being the weighted average of conditional row probabilities

Greenacre and Hastie (1987)
The Geometric Interpretation of Correspondence Analysis

Correspondence Analysis

  • Compare \(d_{\chi^2}(\boldsymbol{x},\boldsymbol{y})\) metric \[d_{\chi^2}(\boldsymbol{x},\boldsymbol{y}) = (\boldsymbol{x}-\boldsymbol{y})^T D_c^{-1} (\boldsymbol{x}-\boldsymbol{y})\]
  • to \(\chi^2\) statistic \[\chi^2 = O_{++} \sum_i p_i (\boldsymbol{a_i}-\boldsymbol{c})^T D_c^{-1} (\boldsymbol{a_i}-\boldsymbol{c})\]
  • \(\chi^2/O_{++}\) is weighted average of \(d_{\chi^2}(\boldsymbol{a_i},\boldsymbol{c})\) distances of conditional row probabilities \(\boldsymbol{a_i}\) to their row centroid \(\boldsymbol{c}\)

Correspondence Analysis

  • First "principle component" is best line fit trough cloud of conditional row probabilities
  • Modify standard Principal Component Analysis (PCA) to incorporate point weights and weighted metric
  • Use \(p_i\) as weights, which provides decomposition of \(\chi^2\) statistics into components
  • After translating points to the origin \(\boldsymbol{c}\), the best fit line is the principle eigenvector of matrix \[\sum_i p_i (\boldsymbol{a_i} - \boldsymbol{c}) (\boldsymbol{a_i} - \boldsymbol{c})^T D_c^{-1}\]
  • The trace of this matrix is equal to \(\chi^2/O_{++}\)

Greenacre and Hastie (1987)
The Geometric Interpretation of Correspondence Analysis

Correspondence Analysis

  • Repeat same for column marginal proportions
  • Can be done in one step by Generalized Singular Value Decomposition (GSVD) \[D_r^{-1/2} \left( \left( p_{ij} \right) - \boldsymbol{r}\boldsymbol{c}^T \right) D_c^{-1/2} = X D_{\alpha} Y^T\] with constrains \(X^TX = Y^TY = \operatorname{Id}\)
  • Singular values are the square roots of the principal inertias \(D_{\alpha} = D_{\lambda}^{1/2}\)
  • Principal axes of row \(D_r^{1/2}Y\)
  • Principal axes of column \(D_c^{1/2}X\)

Correspondence Analysis

  • Row and column projection are usually displayed in same plot
  • Row-to-column distances are meaningless
  • In standard PCA, principle components explain variance
  • In CA, principle components explain devation from independence

Correspondence Analysis

Color code: Rows: Hair, Columns: Eye

Mosaic Plots

Mosaic Plots

Test of Independence and Test of Homogeneity

  • The null hypothesis of row-column independence: \(p_{ij} = p_i \times p_j\)
  • is equivalent to the hypothesis of homogeneity of the rows: \[p_{j|i=1} = p_{j|i=2} = \dots = p_{j|i=I}\]
  • Each row of \(O\) follows a multinomial distribution
  • Under homogeneity assumption, this distribution has common probability vector
  • Maximum likelihood estimates of probability vector is equal to centroid \(\boldsymbol{c}\)
  • Thus a significant \(\chi^2\) is a significant deviation of rows from their centroid

Test of Independence and Test of Homogeneity

  • Several discrete random variables with same categories \(1,\dots,I\)
    • Test of homogeneity
    • Are the observations generated from one multinomial distribution with common probability vector?
    • Possibly different sample size among random variables
    • E.g. crime example, types of crimes (theft, fraud, etc.) commited by alcoholic and non-alcoholic criminals
  • Two discrete random variables with different categories \(1,\dots,I\) and \(1,\dots,J\)
    • Test of independence
    • Are the pairwise observations independent?
    • Paired random variables, so same sample size
    • E.g. hair color, eye color example