Homework 3

Exercise 1

From our textbook (Ex. 2.8.17) (Stanford library link):
In a large city, four candidates (Smith, Jones, Martinelli, and Wagner) are running for Mayor. A poll was conducted by random dialing with the following results:

candidate = c("Smith","Jones","Martinelli","Wagner","Others")
votes = c(442,208,460,180,205)
t(data.frame(candidate=candidate,votes=votes))

##           [,1]    [,2]    [,3]         [,4]     [,5]    
## candidate "Smith" "Jones" "Martinelli" "Wagner" "Others"
## votes     "442"   "208"   "460"        "180"    "205"

Using a 95% confidence interval, determine if there is a significant difference between the two front runners.

Exercise 2

From our textbook (Ex. 2.8.18) (Stanford library link):
In Example 2.7.1 we tested whether or not a dataset was drawn from a binomial distribution. For this exercise, generate a sample of size \(n = 500\) from a truncated Poisson distribution as illustrated with the following R code:

x <- rpois(500,3)
x[x >= 8] = 7

Obtain a plot of the histogram of the sample.
Obtain an estimate of the sample proportion (phat<-mean(x/7)).
Test to see if the sample has a binomial distribution with \(n = 7\), (i.e., use the same test as in Example 2.7.1).

Exercise 3

From our textbook (Ex. 2.8.21) (Stanford library link):
Even though the \(\chi^2\)-tests of homogeneity and independence are the same, they are based on different sampling schemes. The scheme for the test of independence is one-sample of bivariate data, while the scheme for the test of homogeneity consists of one-sample from each population. Let \(C\) be a contingency table with \(r\) rows and \(c\) columns. Assume for the test of homogeneity that the rows contain the samples from the \(r\) populations. Determine the (large sample) confidence intervals for each of the following parameters under both schemes, where \(p_{ij}\) is the probability of cell \((i,j)\) occurring. Write R code to obtain these confidence intervals assuming the input is a contingency table.

\(p_{11}\)
\(p_{11}−p_{12}\)

Exercise 4

Correspondence analysis is useful for text analysis. For instance, to compare texts from different authors in terms of common word frequency. The idea: Make a contingency table where rows are books from different authors and columns are words.

Perform this type of analysis on the following three books form the languageR package. Use association plots, mosaic plots, and correspondence analysis, to determine if there are any words that contribute heavily to inhomogeneity among books. Hint: Use package vcd and ca. Limit your analysis to words that occur frequently in both books.

library(languageR)
data(alice,moby,oz)
alice[1:20]

##  [1] "ALICE"       "S"           "ADVENTURES"  "IN"          "WONDERLAND" 
##  [6] "Lewis"       "Carroll"     "THE"         "MILLENNIUM"  "FULCRUM"    
## [11] "EDITION"     "3"           "0"           "CHAPTER"     "I"          
## [16] "Down"        "the"         "Rabbit-Hole" "Alice"       "was"

moby[1:20]

##  [1] "MOBY"        "DICK"        "OR"          "THE"         "WHALE"      
##  [6] "by"          "Herman"      "Melville"    "ETYMOLOGY"   "Supplied"   
## [11] "by"          "a"           "Late"        "Consumptive" "Usher"      
## [16] "to"          "a"           "Grammar"     "School"      "The"

oz[1:20]

##  [1] "THE"       "WONDERFUL" "WIZARD"    "OF"        "OZ"       
##  [6] "1"         "The"       "Cyclone"   "Dorothy"   "lived"    
## [11] "in"        "the"       "midst"     "of"        "the"      
## [16] "great"     "Kansas"    "prairies"  "with"      "Uncle"

Deadline

Please email your solutions in the form of a Rmd file to Nan Bi and Lexi Guan.

This homework is due on Monday, April 25th at 1:30 pm.

You are encouraged to work through this homework with your peers. But write your own code and explain the results in your own words.