Proportion Problems and Chi-Squared Tests

Stanford University, Spring 2016, STATS 205

Previous Lectures

One-sample tests:
- Sign test
- Signed-Rank Wilcoxon
Estimators, confidence intervals, and robustness to outliers
Bootstrap
- Error of estimators and significance for hypothesis testing
- Complete enumerations
- Tail probability
Observations from continuous distributions

Today

Observations from discrete distributions
Proportion problems
\(\chi^2\) Tests

Proportion Problems

Discrete variables
The random variable \(X\) consists of categories
For now, we focus on binary categories failure (0) and success (1)
\(X\) is a random variable distributed according to the Bernoulli distribution
- with success probabiliy \(p\) and
- failure probabililty \(1-p\)
We know that \(\operatorname{E}(X) = p\) and \(\operatorname{Var} = p(1-p)\)

Proportion Problems

Statistical problems can be
- estimating \(p\)
- forming confidence intervals around estimate \(\widehat{p}\)
- and testing hypothesis \[H_0: p = p_0 \text{ versus } H_A: p \ne p_0\]
Let \(X_1,\dots,X_n\) iid Bernoulli with success probability \(p\) and \(S\) be the total number of successes
Then \(S\) follows a binomial distribution with \(n\) trials and success probability \(p\)
If \(p\) is unkown, we estimate \(p\) with \(\widehat{p} = \frac{S}{n}\)

Proportion Problems

n = 10; p = 1/2; nsim = 10000; obsv = rbinom(nsim, size = n, prob = p)

Proportion Problems

n = 100; p = 1/2; nsim = 10000; obsv = rbinom(nsim, size = n, prob = p)

Proportion Problems

As \(n \to \infty\) while \(p\) is fixed:
- The de Moivre–Laplace theorem (a special case of the central limit theorem) says \(S\) approach a normal distribution
- Easy confidence interval (just evaluate cdf of the normal)
- Approximation of binomial distribution with \(n\) trials and \(p\) success by \(N(np,np(1-p))\)

Example: Squeaky Hip Replacements

143 subjects with ceramic hip replacements

Ten report that their hip replacements squeaked

phat = 10/143
phat

## [1] 0.06993007

zcv = qnorm(0.975)
phat+c(-1,1)*zcv*sqrt(phat*(1-phat)/143)

## [1] 0.02813069 0.11172945

We estimate between roughly 3 and 11% of patients who receive ceramic hip replacements will report squeaky replacements

Hypothesis Testing

Reject null hypotheis if \(|z|\) is large \[z = \frac{\widehat{p}-p_0}{\sqrt{p_0(1-p_0)/n}} \sim N(0,1)\]
\(z\) is asymptotically standard normal
An equivalent test statistic of \(|z|\) is \(z^2\)
Squared normal is distributed as \(\chi^2\) distribution

Example: Left-Handed Professional Ball Players

Theory:
- Professional baseball players have different proportion of left handed player than left-handed people in genral population
- From previous study, we know that general public is has a proportion of \(p_0 = 0.15\)
Hypothesis testing:
- \(H_0: p = 0.15 \text{ versus } H_A: p \ne 0.15\)

Example: Left-Handed Professional Ball Players

library(Rfit)
head(baseball)

##   height weight bat throw field average
## 1     74    218   R     L     0   3.330
## 2     75    185   R     R     1   0.286
## 3     77    219   L     L     0   3.040
## 4     73    185   R     R     1   0.271
## 5     69    160   S     R     1   0.242
## 6     73    222   R     R     0   3.920

Example: Left-Handed Professional Ball Players

ind = with(baseball,throw=='L')
n = length(ind)
phat = sum(ind)/n
phat

## [1] 0.2542373

p0 = 0.15
z = (phat-p0)/(sqrt(p0*(1-p0)/n))
pvalue = 1-pchisq(z^2,df=1)
pvalue

## [1] 0.02494189

What is Nonparametrics Statistics Again?

In all three nonparametric test (sign, signed-rank, \(\chi^2\)) no assumption on variances of observations
In contrast, in the \(t\)-test variances are estimated

Why Not Use Finite Sample Binomial Test?

Since we know that \(S\) follows a binomial distribution, why don't we use it?

## 
##  Exact binomial test
## 
## data:  sum(ind) and n
## number of successes = 15, number of trials = 59, p-value = 0.04192
## alternative hypothesis: true probability of success is not equal to 0.15
## 95 percent confidence interval:
##  0.1498208 0.3844241
## sample estimates:
## probability of success 
##              0.2542373

Finite sample \(p\)-values have upper bounded significance level \(\alpha\)
Asymptotic \(p\)-values may be above above \(\alpha\)

Why Not Use Finite Sample Binomial Test?

Problem is due to discreteness

Example: \(n = 5\) and test \(H_0: p = 0.5\) versus \(H_A: p \ne 0.5\)

Null distribution of \(S\) is binomial with \(n = 5\) and \(p = 0.5\)

Suppose outcome is \(S = 5\) (most extreme observation)

n = 5; S = 5; p = 0.5
phat = S/n
pvalue = 2*p^S; pvalue

## [1] 0.0625

Problem is that the null hypethsis can never be true below \(\alpha = 0.05\)

So in this case, \(\alpha\) has no meaning

Discrete Random Variable (RV)

Extension from two categories to multiple caterogries
Consider discrete RV \(X\) with \(1,2,\dots,c\) categories
Let \(p(j) = P(X = j)\) define the probabiliy mass function
We wish to test: \[H_0: p(j) = p_0(j), j = 1,\dots,c\] \[H_A: p(j) \ne p_0(j), \text{ for some } j\]

Discrete RV

\(X_1,\dots,X_n\) is a random sample on \(X\)
Let \(O_j = \#\{ X_i = j \}\)
Observed frequencies are constrained \(\sum_{j=1}^c O_j = n\)
The expected frequency for category \(j\) is \(\operatorname{E}_j = \operatorname{E}_{H_0}(O_j)\)
Two cases for \(H_0\)

Discrete RV

Case 1:
- All \(p_0(j)\) are specified
- So we get \(E_j = np_0(j)\)
- Test stastitics is \[\chi^2 = \sum_{j=1}^c \frac{(O_j-E_j)^2}{E_j}\]
Hypothesis \(H_0\) is rejected in favor of \(H_A\) for large values of \(\chi^2\)
Observed frequencies, \((O_1,\dots,O_c)^T\) has a multinomial distribution, so exact distribution can be obtained
Asymptotically \(\chi^2\) distribution with \(c−1\) degrees of freedom
If we know \(c-1\) frequencies, we can calculate the \(c\)th from total \(n\)

Discrete RV Example

Roll a dice \(n = 370\) times

Observe frequencies

O = c(58,55,62,68,66,61)
n = sum(O); n

## [1] 370

Test whether dice is fair \(p(j) \equiv 1/6\)

p0 = 1/6
E = rep(n*p0,6)
Chi2_0 = sum((O-E)^2/E); Chi2_0

## [1] 1.902703

Discrete RV Example

Assymptotically equal to \(\chi^2\) with \(c-1\) degress of freedom

pvalue = 1-pchisq(Chi2_0,df=6-1); pvalue

## [1] 0.8624375

Thus there is no evidence to support the dice being unfair

Discrete RV

Case 2:
- Only form of pmf is known
- Have to estimate \(p\)
- Same test stastitics but now with estimate \(\widehat{p}\) \[\chi^2 = \sum_{j=1}^c \frac{(O_j-E_j)^2}{E_j}\]
Hypothesis \(H_0\) is rejected in favor of \(H_A\) for large values of \(\chi^2\)

Discrete RV Example

Number of males in the first seven children for \(n = 1334\) Swedish ministers of religion

males = 0:7
ministers = c(6,57,206,362,365,256,69,13)
n = sum(ministers); n

## [1] 1334

df = data.frame(ministers=ministers,males=males); t(df)

##           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ministers    6   57  206  362  365  256   69   13
## males        0    1    2    3    4    5    6    7

For example, 206 ministers had 2 sons and 5 daughters in their first 7 children

Discrete RV Example

The maximum likelihood estimator of \(p\) is

nChildren = n*7
nMale = sum(df$ministers*df$males)
phat = nMale/nChildren; phat

## [1] 0.5140287

p0 = dbinom(males,7,phat)
E = n*p0

##           [,1] [,2]  [,3]  [,4]  [,5]  [,6] [,7] [,8]
## E          8.5 63.2 200.6 353.7 374.1 237.4 83.7 12.6
## ministers  6.0 57.0 206.0 362.0 365.0 256.0 69.0 13.0
## males      0.0  1.0   2.0   3.0   4.0   5.0  6.0  7.0

Discrete RV Example

Chi2_0 = sum((df$ministers-E)^2/E)
pvalue = 1-pchisq(Chi2_0,df=8-1-1); pvalue

## [1] 0.4257546

No evidence to refute a binomial probability model for the number of sons in the first seven children of a Swedish minister

Discrete RV

Alternatively to testing for deviation from a model
We can get a confidence interval on pairwise difference in proportions \(\widehat{p}_j - \widehat{p}_k\)
The confidence intervals are easy again because, we assume asymptotic normallity

Discrete RV Example

Difference in the probabilities of all daughters or all sons

##           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ministers    6   57  206  362  365  256   69   13
## males        0    1    2    3    4    5    6    7

6 ministers had no sons, and 13 ministers had all sons

n = 1334; p0 = 6/n; p7 = 13/n
se = sqrt((p0+p7-(p0-p7)^2)/n)
zcv = qnorm(0.975)
lb = p0-p7 - zcv*se; ub = p0-p7 + zcv*se; res = c(p0-p7,lb,ub); res

## [1] -0.005247376 -0.011645444  0.001150692

Confidence interval covers 0, thus no significant difference in the proportions

Several Discrete RVs

Goal is to compare several discrete RV, which have same range \(\{ 1,2,\dots,c \}\)
Consider hypothesis test:
- \(H_0:\) \(X_1,\dots,X_r\) have the same distribution
- \(H_A:\) Distributions of \(X_i\) and \(X_j\) differ for some \(i \ne j\)
Total number of samples \(n = \sum_{i=1}^r n_i\)
Observed frequencies: \[O_{ij} = \#\{ \text{sample items in sample drawn on } X_i \text{ such that } X_i = j\},\]
for \(i = 1,\dots,r\) and \(j = 1,\dots,c\)
\(O_{ij}\) is a \(r \times c\) matrix of observed frequency
They are called contingency tables

Several Discrete RVs

Compare observed frequencies to the expected frequencies under \(H_0\)
Estimate the common distribution \((p_1,\dots,p_c)^T\), where \(p_j\) is the probability that category \(j\) occurs
Estimate probability of category \(j\) overall \[ \widehat{p}_j = \frac{\sum_{i=1}^r O_{ij}}{n}, j = 1,\dots,c\]
Estimate expected frequencies \(\widehat{E}_{ij} = n_i \widehat{p}_j\)
Notice that the sample size can vary between variables

Several Discrete RVs

Test statistics \[\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij}-\widehat{E}_{ij})^2}{\widehat{E}_{ij}}\]
Asymptotically \(\chi^2\) with \((r−1)(c−1)\) degrees of freedom
This is called test for homogeneity

Several Discrete RVs Example

Distribution of alcoholic status same for different type of crime?

Contingency table with frequencies of
criminals who committed crimes (6 RV's) and
their alcoholics status (category: alcoholic and non-alcoholic)

##          Alcoholic Non-Alcoholic
## Arson           50            43
## Rape            88            62
## Violence       155           110
## Theft          379           300
## Coining         18            14
## Fraud           63           144

Several Discrete RVs Example

## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 49.731, df = 5, p-value = 1.573e-09

Chi2_0 = (chifit$observed-chifit$expected)^2/chifit$expected; Chi2_0

##            Alcoholic Non-Alcoholic
## Arson     0.01617684    0.01809979
## Rape      0.97600214    1.09202023
## Violence  1.62222220    1.81505693
## Theft     1.16680759    1.30550686
## Coining   0.07191850    0.08046750
## Fraud    19.61720859   21.94912045

Several Discrete RVs Example

Most of the contribution to the test statistic comes from the crime fraud

Eliminate fraud and retest

## 
##  Pearson's Chi-squared test
## 
## data:  ct[-6, ]
## X-squared = 1.1219, df = 4, p-value = 0.8908

Conclusion:
Conditional on the criminal not committing fraud,
cannot reject that alcoholic status has same distribution for all crimes