We will first load in the libraries for this homework assignment. There are quite a few packages I used for small conversions this week.

# Libraries

Exercise 1

In this exercise, we want to explore the relationship between the Dirichlet Process (namely its parameter \(\alpha\)) to a fixed base distribution.

  1. We will first write R code to draw 100 sample curves from the Dirichlet Process (DP) prior for a range of concentration parameters \(\alpha\) and one fixed base distribution \(F_0\) (for example, the normal distribution).

We will assume that our fixed base distribution \(F_0\) is a standard normal distribution with a mean of 0 and a variance of 1.

Our function dir_curves first draws a sample of size n (set to a default of \(n = 100\)) from a standard normal distribution. Then we draw a sample size of n from a beta distribution with the parameters of 1 and \(\alpha\).

Next, we calculate the weights for each point in the normal distribution by taking the Dirichlet Process product as defined in lecture \(\prod_{i = 1}^n 1 - \beta_i\). Note that \(\beta_i\) are the values we sampled from the beta distribution earlier.

Finally, we sample from the normal sample we obtained in the first step of the function weighted by the weights obtained using the Dirichlet Process.

dp_curves <- function(alpha, n = 100) {
  # Draw from standard normal
  s <- rnorm(n = n, mean = 0, sd = 1)
  # Draw from beta distribution
  v <- rbeta(n = n, shape1 = 1, shape2 = alpha)
  w <- array(0, n)
  w[1] <- v[1]
  # Calculate weights
  w[2:n] <- sapply(2:n, function(i) v[i] * prod(1 - v[1:(i-1)]))
  theta <- sample(s, prob = w, replace = TRUE) 
  1. What happens when we change \(\alpha\)?

\(\textbf{As we increase the alpha value, the Dirichlet curves will follow the base distribution more closely due to the evenness in the spread of the weights}\). We will show this fact using our simulation below.

We will use our function dp_curves to explore this question. We use a range of different \(\alpha\)-values ranging from 1 to 1000. Then we generate DP curves from our function in the previous section for the different types of alpha values. Lastly, we plot cumulative step functions obtained in order to plot the Dirichlet curves.

n <- 100
alpha_val <- c(1, 5, 10, 25, 50, 100, 200, 500, 1000)

vals_array <- matrix(0, nrow = length(alpha_val), ncol = n + 1)
vals_array[,1] <- alpha_val

for (i in seq_len(length(alpha_val))) {
  vals_array[i,2:ncol(vals_array)] <- dp_curves(alpha_val[i], n)

The graph generated using the code below depicts the DP curves generated with the alpha value noted at the top of the section. One observation is that the curves generated with lower alpha values are more chunky with larger jumps. We overlay the cumulative distribution function in red for each of the graphs. In the graph with \(\alpha = 1\), we see that the Dirichlet curve doesn’t follow the normal distribution ecdf very closely. However, as we increase the alpha, we see that the curve ecdfs get less chunky with less jumps. By \(\alpha = 1000\), we see that the DP curve follows the normal ecdf pretty closely.

data.frame(vals_array) %>% 
  gather(group, value, X2:X101) %>% 
  select(-group) %>%
  rename(group = X1) %>%
  ggplot() + 
  stat_ecdf(mapping = aes(x = rnorm(length(alpha_val)*n)), color = "red") +
  stat_ecdf(mapping = aes(x = value)) +
  facet_wrap( ~ group) + 
  labs(x = "Value", y = "Cumulative Probability")

Notice that for smaller values of alpha, the generated weights seem to be concentrated in a couple of areas (in the \(\alpha = 1\), we see that the weights are concentrated near the two extremes). As the alphas increase, we see that the weights become more evenly spread. Thus, \(\textbf{as we increase the alpha value, we increase the spread of the points to follow the base distribution more closely}\).

  1. What does this corresponds to in a more standard distribution, say the normal distribution?

\(\textbf{The change in alpha values correspond to the spread or variance in a standard distribution}\). Particularly in the normal distribution example we used, our cdf functions had more jumps and more frequently as our alphas got to be larger. This means that the spread of the weights on our points are more even across the entire range instead of concentrating in a few clusters.

Exercise 2

In this problem, we will explore hourly wages across states and occupations.

  1. First, we will load in occupational employment statistics from the Bureau of Labor Statistics (the occupational employement statistics page).

We set the null values to be NA so it is more interpretable for R.

employment <- read_excel("employment_stats.xlsx", na = "NA")
  1. Next, we construct a two-way table of median hourly wages with states as rows and job titles as columns.

We first do some filtering our the data set by selecting out the features of interest (namely, state, job title, and hourly median wage). Then, we convert hourly_median_wage to a numerical value for easier analysis later down the road. Lastly, we spread the data in order to obtain a two-way table where the rows are states and the columns are job titles. Each cell contains the median hourly wages for that particular state and job.

occ_stats <- employment %>%
  rename(state = STATE,
         job_title = OCC_TITLE,
         hourly_median_wage = H_MEDIAN) %>%
  dmap_at("hourly_median_wage", as.numeric) %>%
  select(state, job_title, hourly_median_wage) %>%
  spread(job_title, hourly_median_wage)

We add in row names as the states for the data set.

row.names(occ_stats) <- occ_stats$state

Lastly, we remove the states column and convert the data frame to a matrix.

occ_stats <- occ_stats %>%
  select(-state) %>%

Taking a look at a section of our two-way table, we see that it has quite a few missing values. In the next step, we will impute the missing values in order to perform Friedman’s test.

occ_stats[1:5, 1:3]
##            Accountants and Auditors Actors Actuaries
## Alabama                       29.38     NA     43.27
## Alaska                        35.77     NA        NA
## Arizona                       28.18  14.73     41.68
## Arkansas                      27.55     NA        NA
## California                    35.12     NA     48.30
  1. Next, we perform the Friedman’s test on the constructed table. What is our null and alternative hypothesis? What are our conclusions?

We will first get rid of columns that have more than 15% missing values. This gets rid of occupations for which we don’t have much data across different states. We will use a threshold of 15% for a stricter bound on the density of our two-way table. Having too many null values might influence the median statistic and the results of the Friedman test.

na <- NULL
for (i in seq_len(ncol(occ_stats))) {
  if (mean(is.na(occ_stats[,i])) > 0.15) {
    na <- c(na,i)
occ_stats <- occ_stats[,-na] %>% as.matrix()

Next, we will inpute the missing values within each column. We will replace missing values with the median. Since Friedman’s test is a nonparametric method where the test statistic depends on the rank of the observations, we see that by imputing the median, we aren’t changing the ranking of the observations. We do obtain an increase in the median rank, but it’s something smaller that we can overlook.

We end up with 54 states (like before) and 490 jobs (almost half of what we had previously - 844 jobs).

for (i in nrow(occ_stats)) {
  for (j in ncol(occ_stats)) {
    if (is.na(occ_stats[i,j])) {
      occ_stats[i,j] <- median(occ_stats[,j])

We perform the Friedman test on our two-way table of median hourly wages across states. The Friedman test is used for one-way repeated measures analysis of variance by ranks.

In our data set, our states are our “blocks” and our job titles are the “treatments”. By performing the Friedman test, we want to answer the question of whether any job consistently has higher or lower median wages than the other jobs.

We use the function friedman.test to perform Friedman’s test.

##  Friedman rank sum test
## data:  occ_stats
## Friedman chi-squared = 3239.6, df = 489, p-value < 2.2e-16

We obtain a statistic of 3239.6, degrees of freedom of 489 (obtained from the number of treatments 490 minus 1), and a p-value of less than 0.01.

  1. Lastly, we perform median polish on this table. What are our conclusions?

We now follow the lecture notes from class (ANOVA Lecture) and perform median polish on our two-way table. From median polish, we are trying to analyze the following model \(Y_{ij} = \mu + \alpha_i + \beta_j + \epsilon{ij}\).

# Compute the median
mu <- median(occ_stats, na.rm = T)

# Compute median residuals
med_polish <- as.matrix(occ_stats) - mu

# Row median (adding and residuals)
row_median <- apply(med_polish, 1, median, na.rm = T)
med_polish <- med_polish - row_median
med_polish <- cbind(roweff = c(row_median), med_polish)

# Column median (adding and residuals)
med_polish <- rbind(coleff = rep(0, ncol(med_polish)), med_polish)
med_polish[1,1] <- mu
col_median <- apply(med_polish[2:nrow(med_polish),], 2, median, 
                    na.rm = T)
med_polish[1,] <- med_polish[1,] + col_median

# Create new residual table from column medians
med_polish[2:nrow(med_polish),] <- sweep(med_polish[2:nrow(med_polish),],
                                         2, col_median) 

# Second iteration adding row median
second_row_median <- apply(med_polish[,2:ncol(med_polish)], 1, median, na.rm = T)
med_polish[,1] <- med_polish[,1] + second_row_median
med_polish[,2:ncol(med_polish)] <- sweep(med_polish[,2:ncol(med_polish)], 1, second_row_median) 

# Second iteration adding column median
second_col_median <- apply(med_polish[2:nrow(med_polish),], 2, median, na.rm = T)
med_polish[1,] <- med_polish[1,] + second_col_median 
med_polish[2:nrow(med_polish),] <- sweep(med_polish[2:nrow(med_polish),],
                                        2, col_median)

We end up with the following median polish residual table. At a first glance, we see that Alaska, Washington DC, Massachusetts, New York, and Washington have the highest median hourly wages. This is interesting as while the latter four states are well known urban hubs, Alaska is a sparse state with not too much industry.

We definitely see that the the column effects of the job title is a stronger factor in distinguishing among the median hourly wages than the geographic region (states). We have effect values in the 10s to 30s for job title, but effects for state lingers around less than 10 consistently.

We see that the residuals are the portion of the median hourly wages that can’t be explained by either state or job title. There are several jobs in which the residuals are also fairly large. Across rows, these residuals seem to even out. However, for some occupations such as Administrative Law Judges, we see that the residuals are in the -20 and +20 range. This suggests that some occupations are not well explained by state and job title.

Here’s a section of our final median polish table.

med_polish[1:5, 1:3]
##            roweff Accountants and Auditors
## coleff   19.04000                 10.62000
## Alabama  -0.56250                -10.28250
## Alaska    5.09500                 -9.55000
## Arizona   0.30000                -12.34500
## Arkansas -1.86875                -10.80625
##          Administrative Law Judges, Adjudicators, and Hearing Officers
## coleff                                                        23.75750
## Alabama                                                        5.34750
## Alaska                                                       -21.70000
## Arizona                                                      -24.33500
## Arkansas                                                     -55.58625

Do model fit evaluations using a Tukey additivity plot. What are your conclusions?

We will first make a Tukey additivity plot. We will use the function medpolish to repeat the process we performed manually above.

tukey <- medpolish(occ_stats, na.rm = T)
## 1: 59528.3
## 2: 57951.14
## Final: 57936.25