Lab 12: Correlation & Regression

Measures of correlation

univariate—involving one variable bivariate—involving two variables multivariate—involving more than one variable

So far, the descriptive statistics that we have been discussing have been univariate. However, much of the time we are interested in the co-distribution of two variables. If you know the value of \(x\), what does it tell you about the value of \(y\)? The correlation coefficient provides information about the strength and direction of a bivariate relationship and is denoted with the lowercase letter r whose value ranges from -1 to 1 (Fig. 1). The stronger the correlation, the closer r is to ± 1. If the two variables are perfectly correlated (r = ± 1), they lie on a perfectly straight line. Do two variables vary in the same direction or opposite directions? Do they vary together at all? How strongly is the change in one variable related to change in the other variable?

Correlation between two variables indicates the strength and direction of the relationship. Figure 1: Correlation between two variables indicates the strength and direction of the relationship.

The correlation between the two variables is r = 0.008 although it is clear that there is a strong non-linear relationship. Figure 2: The correlation between the two variables is r = 0.008 although it is clear that there is a strong non-linear relationship.

The most common correlation coefficient is Pearson’s r (also called Pearson’s correlation coefficient). It is calculated by dividing the covariance of two variables by the product of their standard deviations. There are two very important things to remember about correlation:

covariance—an extension of variance that measures how too variables vary together. linear—able to be represented as a straight line on a graph. confounding variable—a factor other than the one that you are studying that is associated with both the dependent variable and the independent variable.

  1. Correlation coefficients only capture linear relationships. Some relationships are very strong but nonlinear, and a correlation coefficient might indicate that the relationship is weak (Fig. 2).

  2. Correlation is not causation! Just because there is a relationship between two variables does not mean that one causes the other. A frequently given example is that both ice cream sales and murder rates go up in the summer. Does that mean that eating ice cream causes people to go into a murderous rage? Of course not! Part of what is actually occurring is that both ice cream sales and murder increase when it is hot outside. Although the two variables are highly correlated, there is no causal relationship at all. In this case, increased temperatures are what is known as a confounding variable.

Let’s try it out. Download climate.csv from BB and save it in your Datasets folder. As always, we will start by creating a new R script called LastName_Lab12. Make sure that when you open R Studio, you open it by double clicking on your GEOG316 R project. The file extension is .Rproj. By opening the project and then opening (or creating) a script from inside the project you ensure that your working directory is correctly specified. Don’t forget to load the tidyverse!

First, load the climate csv file. This data shows the global mean \(CO_2\) level in parts per million (ppm) and global mean temperature in degrees Celsius going back to 1958.

dat <- read_csv("Datasets/climate.csv")

Let’s plot our data to see what it looks like. Scatterplots are the most common way of representing bivariate data.

ggplot(dat) +
  geom_point(aes(x = CO2, y = Temperature))

That is definitely a linear relationship! Looks like a pretty strong one too. Let’s calculate the correlation coefficient so that we can quantify the strength of the relationship. It is very simple to do this in R. The cor() function takes two vectors as arguments and calculates the correlation coefficient. It does not matter which order the variables are in because the correlation coefficient only tells you the strength and direction of the relationship between two variables and is not concerned if one variable is dependent on the other. You can also specify to remove missing values (na.rm = T) or which test to use (method = “pearson”, “kendall”, or “spearman”). The Pearson correlation coefficient is the default method. You can always find out more about a function by typing it into the seach bar on the Help tab in the lower-right panel of RStudio.

cor(dat$CO2, dat$Temperature)
## [1] 0.9429157

Pretty simple, huh? A perfect correlation is ± 1 which our correlation coefficient is pretty close to. But this is just a kind of descriptive statistic. It doesn’t allow us to say anything about the population from this sample. We only have annual mean measurements of the data. The measurements that are averaged are going to come from stations all over the world but they aren’t everywhere. How do we know that this trend is truly representative of global levels? That’s where regression comes in.

Regression

linear regression—Linear regresion also returns the test statistic \(t\) which tests whether the slope of the regression line is significantly different from zero.

Linear regression is one of the most, if not THE most, used inferential statistical techniques. It is simple yet powerful, and broadly applicable. Linear regression creates a line of best fit between an independent variable and a dependent variable (Fig. 3). The equation of this line is such that it minimizes the squared distance between each observation and the line. Remember that squaring the difference gets rid of negative numbers. This is known as the sum of least squares. Let’s look at the climate data again.

The regression line is a line of best fit that minimizes the residual error. Residuals are indicated by the dashed lines. Figure 3: The regression line is a line of best fit that minimizes the residual error. Residuals are indicated by the dashed lines.

residual—the distance, or error, between the observed value and the predicted value.

x-intercept—the point where a line crosses the x-axis and the value of \(y\) is 0.

slope—a number that describes the direction and steepness of a line. It is the ratio of the vertical change (rise) to the horizontal change (run) between any two points. A linear equation has a constant slope.

The regression line shows the predicted \(y\) value for any given \(x\) value. The dotted lines are known as residuals. The regression equation minimizes these residuals so that their squared sum is the least that it can possibly be. \[y=\beta_0+\beta_1x+\epsilon \quad \textrm{or} \quad y=a+bx+e \] In this equation, x is the independent variable, y is the dependent variable, \(\beta_0\) is the x-intercept, \(\beta_1\) is the slope, and \(\epsilon\) or \(e\) is a term representing the random error. Nothing is perfectly predictable! The significance of the linear regression equation is quantified using a t-test. Is the slope significantly different from zero? If it is not, you cannot say that there is a relationship between \(x\) and \(y\) (Fig. 4). A zero slope would mean that no matter what the value of \(x\), the value of \(y\) would remain the same. No relationship means that knowing the value of \(x\) does not provide you with any information about the value of \(y\). If the slope is significant, it represents the average change in \(y\) per unit change of \(x\).

The p-value of the slope is 0.48. As you can see there is no significant relationship between the variables. Figure 4: The p-value of the slope is 0.48. As you can see there is no significant relationship between the variables.

coefficient of determination\(R^2\), the proportion of the variance in the dependent variable accounted for by the independent variable. homoscedasticity—equal variance of the residuals. autocorrelation—also known as serial correlation, a relationship between observations due to a temporal (time) or spatial lag between them. multicollinearity—high correlation between independent variables.

If the slope of the line is significant, it is appropriate to consider the correlation coefficient. If you square this value, you get the \(R^2\) (pronounced r squared) value, otherwise known as the coefficient of determination. It represents the proportion of the variation in the dependent variable that is accounted for by the variation of the independent variable. In the case of \(CO_2\) and temperature, the slope of the linear fit is significant at greater than \(p=0.001\) and the \(R^2=\) 0.8890899. This means that almost 90% of the variation in temperature can be accounted for by the variation in \(CO_2\).

In addition to assuming that the data are i.i.d., linear regression also assumes that the relationship between the independent and dependent variables is linear and that there is no pattern to the residuals. This is known as homoskedasticity (Fig:5). The opposite case is known as heteroskedasticity and this can indicate either outliers or the omission of important variables (Fig. 6). The data is also assumed to have no autocorrelation. Autocorrelation means that the data is correlated with itself, like in a time series. Although the temperature tomorrow will be affected by many things, one of the most influential factors will be the temperature today.

There is no apparent pattern to the residuals mean they are homoskedatic. Figure 5: There is no apparent pattern to the residuals mean they are homoskedatic.

The residuals are heteroskedistic because the variance increases as x increases. Figure 6: The residuals are heteroskedistic because the variance increases as x increases.

Linear regression can be simple—having only one independent variable—or multivariate—having multiple independent variables. In a multiple linear regression, the dependent variable is a linear combination of the independent variables. For instance, salmon weight could be modeled as a combination of body length, body width, and species. In a multiple linear regression, there is the additional assumption that there is not an excess of multicollinearity. This just means that none of the independent variables are highly correlated with each other.

When there are multiple independent variables the equation takes the form \[y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3...\beta_nx_n+\epsilon \] where \(x_1...x_n\) are the independent variables (\(n\) just indicates that there are however many variables you specify) and \(\beta_1...\beta_n\) are the corresponding slope coefficients. The way that you interpret these coefficients is that they represent the average change in y based on a one unit increase in x, holding all other variables constant. So in our salmon weight example, the slope coefficient for body length would be the average increase in salmon weight per one unit increase in body length when the body width and species are held constant. The coefficient for body width would be the increase in weight per increase in body width holding length and species constant, and so forth.

There are many kinds of regression techniques, including those for data that is not normally distributed. These are known as generalized linear models and include logistic regression (also known as logit regression) for a binary dependent variable, Poisson regression for an integer (count) dependent variable, and polynomial regression to model a non-linear relationship.

Creating a linear regression model in R is done with the lm() function for linear model. Load KansasCountyWheatAg.csv from your Datasets folder. These are measurements of planting and production from wheat farms in Kansas, aggregated by county. Let’s examine the relationship between acres planted and bushels produced.

wheat <- read_csv("Datasets/KansasCountyWheatAg.csv")

independent variable—a variable whose variation does not depend on another measured variable. Research often focuses on whether or not an independent variable has a direct effect on a dependent variable. dependent variable—a variable whose variation depends on the independent variable. This is the outcome of interest in the research.

Like cor(), lm() takes two vectors as arguments but this time the order matters. This time we are saying that one of the variables is dependent on the other so the relationship between the two is not symmetrical. The bushels produced is the dependent variable because how much wheat is produced is dependent upon how much was planted. It is not true that the amount of wheat planted is dependent upon how much was produced. To specify the formula for your linear model you use the ~. The variable on the left of the ~ is dependent on the variable on the right.

First we will plot the data. The independent variable goes on the x-axis and the dependent variable goes on the y-axis.

ggplot(wheat) +
  geom_point(aes(x = acresplanted, y = bushelsproduced))

As I’m sure you imagined, it looks like there is a strong linear relationship between the acres harvested and the number of bushels produced. Let’s see what the linear model has to say.

wheat.model <- lm(wheat$bushelsproduced ~ wheat$acresplanted)
wheat.model <- lm(bushelsproduced ~ acresplanted, data = wheat)

Both of these lines of code do the same thing. I tend to specify models like the first line because it makes use of the autocomplete feature in RStudio so I am less likely to make a typo. The second line is easier to read however.

Whichever way you choose, if you save the linear model as a variable, it makes it easy to inspect the specifics of the model using summary().

## 
## Call:
## lm(formula = bushelsproduced ~ acresplanted, data = wheat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2768200  -265131   -34953   357714  2289857 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  374921.57  140499.54   2.668              0.00918 ** 
## acresplanted     21.97       1.18  18.617 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 778700 on 82 degrees of freedom
## Multiple R-squared:  0.8087, Adjusted R-squared:  0.8063 
## F-statistic: 346.6 on 1 and 82 DF,  p-value: < 0.00000000000000022

Figure 7: Explanation of the key features of the lm output

Explanation of the key features of the lm output

You can see that both the intercept and the slope parameters are significant at better than 0.01 meaning that we can interpret the adjusted \(R^2\) to mean that approximately 81% of the variation in the number of bushels produced can be accounted for by the variation in acres planted. The equation for the regression line is: \[ \text{bushels produced} = 374921.57 + 21.97 * \text{acres planted} \]

It is also easy to add the regression line to a ggplot. Just like we added the loess line to the time series plots, we can use geom_smooth() to add a regression line.

ggplot(wheat, aes(x = acresplanted, 
                  y = bushelsproduced)) +
  geom_point() +
  geom_smooth(method = "lm")

You can see that as the number of acres planted increases, confidence interval increases. This is because there is more variability in the data there so we are less confident in our estimation of the slope parameter.

If we save the summary of our model output as a variable, we can extract values from it like we did with t-tests. Let’s spruce up our plot. We will change the axes labels, add an informative title, and add the \(R^2\) value to the plot. Notice in the title that we say that the dependent variable is regressed on the independent variable. This is the correct way to talk about the variables in the regression.

# Save the summary as a variable
model.sum <- summary(wheat.model)

# Extract the adjusted r-squared 
# value from the summary
r.sq <- model.sum$adj.r.squared

# Round to three decimal places 
# because it looks better
r.sq <- round(r.sq, 3)

# Create a text label to add to 
# the plot the paste() function puts 
# things together as a string. Here we 
# have some text and add the r-squared 
# value at the end. Notice how that is 
# not a string but the variable that 
# holds the value.
r.sq.label <- paste("Adj. R-squared = ", r.sq)

# The annotate function adds text to 
# the plot. Specifying the geom as "text" 
# says to just add text to the plot. 
# The x and y are the plot coordinates 
# where you want your text to be placed. 
# You need to put it somewhere that it 
# won't overlap with your plot. The text 
# will be centered around these values. 
# With label you specify what text that 
# you want to add. Here we specify the 
# string that we saved as a variable.
ggplot(wheat, aes(x = acresplanted, 
                  y = bushelsproduced)) +
  geom_point() +
  geom_smooth(method = "lm") +
  annotate(geom = "text", 
           label = r.sq.label, 
           x = 300000, y = 2500000) +
  xlab("Acres planted") +
  ylab("Bushels produced") +
  ggtitle("Bushels produced regressed on acres planted") 

# You could also put the label here without 
# interfering with the plot.
ggplot(wheat, aes(x = acresplanted, 
                  y = bushelsproduced)) +
  geom_point() +
  geom_smooth(method = "lm") +
  annotate(geom = "text", 
           label = r.sq.label, 
           x = 100000, y = 7500000) +
  xlab("Acres planted") +
  ylab("Bushels produced") +
  ggtitle("Bushels produced regressed on acres planted") 

Exercises

  1. Let’s revisit the mtcars dataset that is built into R. The data is from the Motor Trend US magazine, and contains variables relating to the design and performance of 32 cars from 1974. Load the data with data(mtcars). We will look to see if there is a relationship between weight (wt) and miles per gallon (mpg). It seems likely that heavier cars would not be as fuel efficient as lighter cars. Create a scatterplot with weight on the x-axis and mpg on the y-axis. Describe the relationship between the variables. Is it linear? Does it appear to a strong relationship? What direction is it?

  2. What is the Pearson’s correlation coefficient for the variables? What does this tell you about the relationship between the two variables?

  3. Create a linear regression model. Remember that mpg is the dependent variable. Describe the model in words from looking at the summary. Is it significant? What is the equation for the regression line (write it out)? What is the adjusted \(R^2\) and what does it tell you about the relationship between the two variables?

  4. Make a scatterplot with weight on the x-axis and mpg on the y-axis. Change the axis labels, add an informative title, and add the adjusted \(R^2\) to the plot just like we did in the lab. Make sure that the text does not overlap the plotted data. Save your plot and submit it to BB along with your script.