Logistics and Introduction

Stanford University, Spring 2016, STATS 205

Stats 205

You can find everything on our course website: http://christofseiler.github.io/stats205/
Homework every 8 days (send solutions to TA's)
Midterm and finals will be a project
Instructor's office hours: Wednesdays 10:00 to 11:30 in Sequoia Hall 105
TA's Office hours: 2 x 1.5 hours (time and place will be announced)

Why?

Few assumptions about underlying populations from which data is obtained,
e.g. populations don't need to follow a normal distribution
Often easier to understand and apply than parametric tests
Slightly less efficient than parametric test when parametric assumptions hold, but
if assumptions don't hold then wildly more efficient
Can be used in many practical situations where theory is intractable
Bayesian methods are available so prior information can be incorporated

Goals

Overview

Get an overview of classical, Bayesian, and modern methods
Learn how to implement methods yourself and use existing R packages
Be aware and understand underlying assumptions
Apply to modern data analysis problems that you care about

Goals

Specific learning goals

The students will learn to apply methods and explain the statistical assumptions of

Monte Carlo simulations for analytically intractable problems
rank-based methods for parameter estimation, confidence intervals, and hypothesis testing
permutations test for hypothesis testing
the bootstrap for confidence intervals

Textbooks

Our main textbook with lots of practical computations in R (Stanford library link):
Kloke and McKean (2015). Nonparametric Statistical Methods Using R
In-depth coverage of the bootstrap with lot of examples:
Efron and Tibshirani (1994). An Introduction to the Bootstrap

More Textbooks

A Bayesian view (Stanford library link):
Müller, Quintana, Jara, and Hanson (2015). Bayesian Nonparametric Data Analysis
Very comprehensive covering most of the material of the previous books (Stanford library link):
Hollander and Wolfe, and Chicken (2013). Nonparametric Statistical Methods
Classical textbook for rank-based method with lot of mathematical details:
Lehmann (2006). Nonparametrics Statistical Methods Based on Ranks

Grading: Homework and Projects

Weekly homework assignments (40%), mostly R exercises
Class participation (10%)
Project (50%)
Goal of project: Write a paper on
- applying nonparametric statistics to your field of interest or
- study one particular theoretical aspects that your care about

Grading: Details on Projects

The project will be split in two parts:

Midterm project (3 pages with references) (10%):
- Project proposal and outline of planned tasks
Final project (12 pages plus references) (40%):
- A theoretical part: Explanation of the method studied and its properties
- A computational part: Preferably in R
- A data-analysis part: Plots and interpretations

Modern Statistics

To quote Andrew Gelman (source):

"If you wanted to do foundational research in statistics in the mid-twentieth century, you had to be bit of a mathematician, ... if you want to do statistical research at the turn of the twenty-first century, you have to be a computer programmer."

History of Nonparametric Statistics

Check on Google Ngram Viewer.

History 1930's to 1970's

Beginning of nonparametric statistics (Hotelling and Pabst 1936)
Wilcoxon (1945) introduced the two-sample rank sum test for equal sample sizes, and Mann and Whitney (1947) generalize it
Pitman (1948), Hodges and Lehmann (1956), and Chernoff and Savage (1958) showed desirable efficiency properties
Jackknife, introduced by Quenouille (1949) as a bias-reduction technique and extended by Tukey (1958, 1962) to provide approximate significance tests and confidence intervals
Hodges and Lehmann (1963) showed how to derive estimators from rank tests and established that these estimators have desirable properties
Cox (1972) model and methods for survival analysis

History 1970’s to Now

Efron’s (1979) bootstrap makes use of increasing computational resources to provide standard errors and confidence intervals where difficult, if not impossible, to use a parametric approach

Some examples from special issue of Statistical Science (Randles, Hettmansperger, and Casella, 2004):

Robust analysis of linear models (McKean, 2004)
Density estimation (Sheather, 2004)
Data modeling via quantile methods (Parzen, 2004)
Kernel smoothers (Schucany, 2004)
Permutation-based inference (Ernst, 2004)
Multivariate signed rank tests in time series problems (Hallin and Paindaveine, 2004)
Generalizations for nonlinear manifolds (Patrangenaru and Ellingson 2015)

Bayesian History

Ferguson (1973) introduced nonparametric Bayesian methods
Susarla and van Ryzin (1976) derived nonparametric Bayesian estimators of survival curves
Dykstra and Laud (1981) developed a Bayesian nonparametric approach to reliability
Hjort (1990b) proposed nonparametric Bayesian estimators to model the cumulative hazard
In the late 1980s and the 1990s, there was a surge of activity in Bayesian methods due to the Markov chain Monte Carlo (MCMC) methods, e.g. Gelfand and Smith (1990), Gamerman (1991), West (1992), Smith and Roberts (1993), and Arjas and Gasbarra (1994)
Key algorithms for developing and implementing modern Bayesian methods include the Metropolis-Hastings-Green algorithm (see Metropolis et al. (1953), Hastings (1970), and Green (1995)) and the Tanner-Wong (1987) data augmentation algorithm

R and R Markdown Basics

For This Course

During this course will write a lot of R code and run computer simulations. All my examples and homework will be written in R. Choosing R for your homework solutions is highly recommended.

What is R:

R is an interpreted programming language, which means you will not have to compile your code.
R is very interactive, which means that you can play around with vectors and matrices, plot results.
To keep track of your progress and to be able to construction your analysis steps, we will use R markdown.
R markdown is a format to make web reports that you can share with your collaborators.

Vectors

Make a vector:

x = c(11,218,123,36,1001)
y = rep(1,5)
z = seq(1,5,by=1)
x + y

## [1]   12  219  124   37 1002

z + 10

## [1] 11 12 13 14 15

Operations on Vectors

Some operations:

sum(y)

## [1] 5

c(mean(z),sd(z))

## [1] 3.000000 1.581139

length(z)

## [1] 5

Coin Tossing

And most importantly there is a lot of functions for statistics.

Like tossing a coin three times:

set.seed(1)
coin = c('H','T')
samples = sample(coin,100,replace = TRUE)

Should come about half of the times head:

sum(samples == 'H')

## [1] 52

Matrices

We can combine vectors of the same type into matrices.

X = cbind(x,y,z)
X

##         x y z
## [1,]   11 1 1
## [2,]  218 1 2
## [3,]  123 1 3
## [4,]   36 1 4
## [5,] 1001 1 5

Data Frames

Data frames are used to combine variables of different types in one object.

subjects = c('Jim','Jack','Joe','Mary','Jean')
sex = c('M','M','M','F','F')
score = c(85,90,75,100,70)
D = data.frame(subjects,sex,score)
D

##   subjects sex score
## 1      Jim   M    85
## 2     Jack   M    90
## 3      Joe   M    75
## 4     Mary   F   100
## 5     Jean   F    70

Generating Random Data

R provides a wide variety of distributions that we can sample from. All the ones we know from intro stats courses. For instance the normal distribution:

z = rnorm(1000)
mean(z)

## [1] -0.01782957

sd(z)

## [1] 1.040367

Basic Plotting

Plotting the sample from last slide.

hist(z,breaks=30)

Fancy Plotting

The ggplot2 package is very popular to make more sophisticated plots.

library(ggplot2)

You will have to learn the grammar of ggplot. There are many tutorials online. Here is one example link.

Let's see how it looks like in action on sleep study data.

library(Lock5Data)
data(SleepStudy)

Fancy Plotting

ggplot(SleepStudy, aes(x=Drinks,y=GPA)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  geom_smooth() + xlab('number of alcoholic drinks per week')

Repeated Tasks

In addition to the usual for loops are R provides apply and tapply functions.

##         x y z
## [1,]   11 1 1
## [2,]  218 1 2
## [3,]  123 1 3
## [4,]   36 1 4
## [5,] 1001 1 5

apply(X,1,mean)

## [1]   4.333333  73.666667  42.333333  13.666667 335.666667

apply(X,2,mean)

##     x     y     z 
## 277.8   1.0   3.0

Repeated Tasks

##   subjects sex score
## 1      Jim   M    85
## 2     Jack   M    90
## 3      Joe   M    75
## 4     Mary   F   100
## 5     Jean   F    70

tapply(D$score,D$sex,mean)

##        F        M 
## 85.00000 83.33333

User Defined Functions

To define your own functions:

mSummary = function(x) {
  q1 = quantile(x,.25)
  q3 = quantile(x,.75) 
  list(med=median(x),iqr=q3-q1)
}
xsamp = 1:13
mSummary(xsamp)

## $med
## [1] 7
## 
## $iqr
## 75% 
##   6

Monte Carlo Simulations

Generate a matrix with 100 rows and 10 columns elements drawn from a normal distribution:

X = matrix(rnorm(10*100),ncol=10)

Each row is a distinct sample of size 10. The sample mean of each sample is

xbar = apply(X,1,mean)

and the variance of the sample means is

var(xbar)

## [1] 0.1217215

compare to theoretical result \(\frac{\sigma^2}{n} = 0.1\)

R Packages

In addition to the base functionality, there are thousands of packages available. This is the command for installing the package bootstrap:

install.packages("bootstrap")

once installed, you can use it in your code:

library(bootstrap)
data(law)
head(law)

##   LSAT GPA
## 1  576 339
## 2  635 330
## 3  558 281
## 4  578 303
## 5  666 344
## 6  580 307

Stats 205

Why?

Goals

Overview

Goals

Specific learning goals

Textbooks

More Textbooks

Grading: Homework and Projects

Grading: Details on Projects

Modern Statistics

History of Nonparametric Statistics

History 1930's to 1970's

History 1970’s to Now

Bayesian History

R and R Markdown Basics

For This Course

Vectors

Operations on Vectors

Coin Tossing

Matrices

Data Frames

Generating Random Data

Basic Plotting

Fancy Plotting

Fancy Plotting

Repeated Tasks

Repeated Tasks

User Defined Functions

Monte Carlo Simulations

R Packages

Homework with R Markdown