As always, we will start by creating a new R script called LastName_Lab11. Make sure that when you open R Studio, you open it by double clicking on your GEOG316 R project. The file extension is .Rproj. By opening the project and then opening (or creating) a script from inside the project you ensure that your working directory is correctly specified. Don’t forget to load the tidyverse!

1 Time series

Time series data—also known as temporal data—are just what they sound like, a series of measurements made at regular time intervals. Often the interest in a time series is to forecast the future: the stock market, climate change, electricity consumption. The primary difference between time series data and other data is that they are ordered. It is important to consider this order when analyzing the data. Time series data may be characterized by seasonality—regular patterns of variation, stationarity—whether or not the mean is constant over time, and auto-correlation—similarity between observations as a function of the time lag between them. This article has nice descriptions with visuals for these terms.

Let’s look at a data set that records the widths of tree rings from a Douglas fir that was felled in Idaho in 1889. Each year as trees grow, they produce couplets of light and dark colored wood, which form a tree ring. Tree rings are an important paleoclimate proxy because their thickness is related to moisture availability. Wider rings usually indicate a wetter year and thin rings may indicate drought years.

tree_rings <- read_csv("Datasets/treerings.csv")

range(tree_rings$year)
## [1] 1276 1889

You can see that the rings go all the way back to 1276. That’s an old tree! The ring corresponding to the year 1276 is numbered one and is the ring at the center of the trunk. Consecutive rings count outward to the last ring from 1889. The ring widths are measured in millimeters.

Let’s plot the data to take a look.

ggplot(tree_rings) +
  geom_line(aes(x = year, y = ringwidth))

You can see that there is a lot of fluctuation with many years of less growth between around 1300 and 1600 (this roughly corresponds to the Little Ice Age!) and years of more growth between 1600 and 1800.

Often with time series data, the value of a variable is less important than how that variable deviates from the average. This is known as an anomaly—calculating anomalies let’s you see during which years was the tree ring width significantly different than the mean. These anomalies will have a mean of 0 with positive numbers representing years with more growth than average and negative numbers representing years with less growth than average. To do this we will calculate the mean tree ring width and then subtract it from the individual observations.

mean.width <- mean(tree_rings$ringwidth)

tree_rings <- tree_rings %>% 
  mutate(anomaly = ringwidth - mean.width)

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red")

The pattern of the data hasn’t changed at all since we subtracted the same number from each observation but now it is easier to see which years had higher than average growth. It is still difficult to see the overall trend, however, because there is so much year to year variability. With ggplot it is easy to add a smoothed trend line called a loess regression line. It even adds confidence intervals! We will talk about regression in the next lab but for now just now that it is a method for fitting a best fit line to the data. With loess regression you fit a polynomial surface based on a window of observations. This creates a curved line. The size of this window, or span, determines whether the fit is more local—based on fewer observations—or more global—affected by more of the observations. Let’s take a look.

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess")

The loess function is in blue and the confidence intervals are in grey around it. You can’t really see them because the background is also grey. Let’s change that.

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green") +
  theme_bw()

If we change the span, the line will become wigglier (is that a technical term?). The default span is 0.75, meaning that for each y value, 75% of the values surrounding the year that the window is centered on are considered when creating the line equation.

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green", span = 0.5) +
  theme_bw()

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green", span = 0.3) +
  theme_bw()

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green", span = 0.1) +
  theme_bw()

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green", span = 0.01) +
  theme_bw()

You can see that as the span decreases the smoothed line gets closer and closer to being the same as the actual line. However, as the fit gets better, the confidence interval gets wider and wider. This is because the line is overfit. An overfit line would fail to accurately predict future values because it is too dependent on the observed data.

Let’s look at the default span of 0.75 again. Let’s add more x-axis ticks too.

ggplot(tree_rings, aes(x = year, y = anomaly)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red") +
  geom_smooth(method = "loess", fill = "green") +
  scale_x_continuous(breaks = c(1300, 1400, 1500, 1600, 1700, 1800)) +
  xlab("Year") +
  ylab("Width anomaly") +
  ggtitle("Tree ring widths (mm) of Douglis Fir felled in 1889") +
  theme_bw()

You can very clearly see the trends in the data—between around 1300 and 1550 there was less growth than average, more growth than average between 1550 and 1825, and then less again after 1825 until the tree was felled.

2 Runs test of randomness

The runs test measures whether or not observations in a time series are generated by a random process. The null hypothesis \(H_0\) is that they are. The alternative hypothesis \(H_1\) is that the observations are generated by a non-random process. This indicates that there is a trend in the data and that consecutive observations are not independent of each other. The runs test works by transforming the observed quantitative variable into a variable—whether the observation is above the mean or below the median. The number of runs (a sequence of identical values) is compared to the expected number of runs and then the test statistic is compared to the theoretical distribution to get the p-value.

In R you can use the runs.test() function from the randtests package. Remember that you need to uncomment the line to install the package.

# install.packages("randtests")
library(randtests)

Let’s see if the tree ring data are random. It doesn’t matter if you test the widths or the width anomalies because they are distributed the same.

runs.test(tree_rings$anomaly, alternative = "two.sided",  plot = T)

## 
##  Runs Test
## 
## data:  tree_rings$anomaly
## statistic = -18.385, runs = 58, n1 = 240, n2 = 304, n = 544, p-value <
## 0.00000000000000022
## alternative hypothesis: nonrandomness

Totally nonrandom! The p-value of less than 0.001 indicates that we should reject the null hypothesis and accept the alternative hypothesis of nonrandomness. This indicates that there is some kind of nonrandom process that is affecting the tree ring widths. And of course we know that they are highly dependent on moisture availability! The next step in an analysis would be to get precipitation data (probably hard for the year 1276!) and see if the distribution of the tree ring widths and the precipitation are the same. Essentially you want to see if high precipitation years are also large tree ring years.

The plot shows the below average observations in red and the dashed vertical lines are the boundaries of the runs.

3 Exercises

  1. Download the KC weather data from BB and put it in your datasets folder. It contains meteorological data for Kansas City for the years 2000-2012.
  • AWND = average wind speed
  • PRCP = precipitation level
  • TAVG = average temperature in degrees Celsius
  • TMIN = minimum temperature in degrees Celsius
  • TMAX = maximum temperature in degrees Celsius

Select the year 2009 and plot the average wind speed by day. Add a horizontal line showing the average wind speed for that year and a descriptive title. You will need to include na.rm = T in the mean() function because there is an NA value. Save your plot and upload it to BB along with your script. Because the date format in R is a type of string, you can use:

filter(str_detect(DATE, "2012"))
  1. Calculate the wind speed anomaly and plot it by day. Add a horizontal line at zero and a descriptive title. Add a loess line to your plot using a span of 0.5. Change the color of the confidence intervals and make the background white. Save your plot and upload it to BB. Does it seem like some months are windier than others? Describe how the wind speed changes through the year.

  2. Write out a null and alternative hypothesis for the runs test for the 2009 wind speed. Conduct the runs test. What is your p-value? Should you reject the null hypothesis? Use \(\alpha=0.05\). What does this tell you about the wind speed in 2009?