We will first load in the libraries for this homework assignment. There are quite a few packages I used for small conversions this week.

# Libraries
library(datasets)
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(purrr)
library(waveslim)
library(reshape2)

Exercise 1

In this problem, we want to understand wavelets and function estimation using the sunspots data. The sunspots data stores monthly mean relative sunspot numbers from 1749 to 1983. The data were collected in Zurich, Switzerland until 1960 and then in Tokyo, Japan from then onward.

For some data background, sunspots are temporary phenomena on the photosphere of the sun that appear visibly as dark spots compared to surrounding regions. The measurements in the data correspond to concentrations of magnetic field flux that inhibit convection and result in reduced surface temperature compared to the surrounding regions.

We first use the dwt command to obtain the wavelet coefficients of the first 512 components of the sunspots data. We will use \(n.levels = 9\). The dwt command determines the wavelet coefficients at each resolution level (in this case, we have 9 resolution levels).

sun_comps <- sunspots[1:512]
# Wavelet Decomposition
sun_dwt <- dwt(sun_comps, n.levels = 9)

Part A

In this part of the problem, we explore the sparsity of the data representation. We use the unlist command on the output of the dwt command to create a single vector of coefficients from all resolution levels (from the highest to the lowest).

sun_coefs <- sun_dwt %>% unlist()

Next, we make two histograms: one of the coefficient vector from the unlist function and one of the untransformed sunspots data (first 512 components).

How do these histogram shapes illustrate the sparsity of the wavelet representation of the data?

We first create a data frame that binds the two sets of values that we want and create the two histograms. We see that the left panel is a histogram of the coefficients obtained from dwt and the right panel is a histogram of the original sunspots data.

We limited the values to see the distribution a little more clearly and for easier comparison between the two distributions. At a first glance, we see that the untransformed sunspots data is right-skewed and all positive whereas the distribution of the coefficients seem to be concentrated around 0.

sun_hist <- data_frame(Untransformed = sun_comps, Coefficients = sun_coefs)

# Plot histograms side-by-side
sun_hist %>%
  gather(set, value, Untransformed:Coefficients) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = value), binwidth = 15) +
  facet_wrap( ~ set, scales = "free_y") +
  coord_cartesian(x = c(-110, 180)) + 
  labs(x = "Values", y = "Count")

Taking a closer look at the coefficient histogram by itself without limiting the axes, we see that there are a lot of coefficients centered on or around 0. There are a few select coefficients that are significantly negative and a few coefficients that are overwhelmingly positive.

data_frame(sun_coefs = sun_coefs) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = sun_coefs), binwidth = 15) +
  labs(x = "Wavelet Coefficients", y = "Count")

Looking at the untransformed sunspots data, we see that the data is heavily right-skewed (as shown before). There are few “extreme” values and a lot of the data seem to be around the 0 to 100 range.

data_frame(sun_comps = sun_comps) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = sun_comps)) +
  labs(x = "Untransformed Sunspot Data", y = "Count")

Since most of our coefficients are near 0 or 0, we see that these coefficients weight our data such that a lot of them are regularized towards 0. This means that a lot of our data becomes small and trivial. Some of the data is more heavily weighted since we have a few really large coefficients.

Part B

In this problem, we want to create a histogram using only the highest level of detail coefficients. Do these coefficients appear to be symmetric about 0? Normal?

We see that the highest level of detail coefficients are indeed centered around 0. However, we don’t see the distribution as being completely normal because there seems to be a slight skew to the right.

data_frame(high_coef = sun_dwt$d1) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = high_coef), binwidth = 7) +
  labs(x = "Highest Level of Detail Coefficients", y = "Count")

Part C

In this part of the problem, we want to compare the two thresholding measures SureShrink and VisuShrink (hard and soft thresholds). We will use the package waveslim and then go through the differences in the reconstructions.

\(\textbf{SureShrink}\)

From the referenced book, we will use the hybrid.thresh function with the default settings of max.level = 4. This thresholds the wavelet coefficients in the four detail resolution levels. We then use the inverse function to obtain the original points for the function estimation.

sure_thresh <- hybrid.thresh(sun_dwt) 
sure_idwt <- idwt(sure_thresh)

We overlay the SureShrink fit onto the original data. The original data is the black line while the SureShrink fit is noted in red. The SureShrink fit gives a pretty smooth fit of the original data. Note that SureShrink only uses soft thresholding.

ggplot() + 
  geom_line(mapping = aes(x = 1:length(sun_comps), y = sun_comps)) + 
  geom_line(mapping = aes(x = 1:length(sun_comps), y = sure_idwt), 
            color = "red",
            size = 1) + 
  labs(x = "", y = "Monthly Mean Relative Sunspot Numbers", 
       title = "SureShrink Estimation")