Quantitative Methods Handbook

Elizabeth Jane Wesley

Introduction

The best approach to studying the relationships between society and environmental change is to combine information, techniques, and perspectives from multiple disciplines. This is known as interdisciplinary research and forms the basis for environmental studies. Usually this means an integration of quantitative and qualitative research methods. Very generally speaking, quantitative research is concerned with testing theories, while qualitative research is concerned with generating theories.

The purpose of this handbook is to provide an introduction to the principles and techniques of quantitative research. The material is purposefully broad and brief, with the goal of providing the reader with a scientific vocabulary that will enable them to confidently pursue further information in the manner that is best suited to their needs.

Quantitative methods

Quantitative methods employ deductive reasoning. Simply put, this involves starting with a general theory and seeing if evidence from a specific situation supports that theory. In contrast, qualitative methods are largely inductive—they work in the other direction, starting with specific evidence and using it to generate a theory. Additionally, quantitative research focuses on increasing our understanding of the world in an objective way rather than privileging the social construction of knowledge. Although both types of research have their pros and cons, and in reality are often used in combination, this guide will focus exclusively on quantitative methods.

quantify—to express in numbers

Generally speaking, quantitative research involves quantifying relationships between variables by using statistics. The next section will proceed step by step through this process in more detail.

Research design

The basic process for quantitative research is as follows: theory—a broad explanation of how the world works hypothesis—a testable explanation of the way the world works in a specific instance

  1. Begin with a scientific theory.
  2. Generate a hypothesis.
  3. Collect/gather data.
  4. Conduct statistical analysis.
  5. Make conclusions.

Theory

Theories are widely tested over time until enough supporting evidence accumulates that the theory is widely accepted. For instance, the theory of gravity has been verified so many times, and is so readily apparent in our daily lives, that it is considered a natural law.

Hypothesis

hypotheses—the plural form of the word hypothesis falsifiability—the capacity to be proven false

Hypotheses are generated from theory in order to verify whether or not that theory is true. You will sometimes hear hypotheses referred to as “educated guesses.” This means that a hypothesis is a statement of what you think will be true in a specific situation based on what you understand about the relevant theory. The most common form of hypothesis testing requires that the theory be falsifiable. A hypothesis should also be specific and testable. Returning to the theory of gravity, our hypothesis might be “If I release an orange from my hand it will fall to the ground.” This hypothesis is specific—an orange released from my hand is a single instance to test the theory, testable—when I release the orange it will do something that either supports my hypothesis or refutes it, and falsifiable—if the orange floats upwards it would seem to prove our theory wrong.

null hypothesis—the observations are the result of a random process. alternative hypothesis—the observations are the result of a non-random process.

The null and alternative hypothesis are denoted as: \[H_0:\textrm{There is no relationship}\] \[H_1:\textrm{There is a relationship}\] A hypothesis consists of two parts: the null hypothesis and the alternative hypothesis. The alternative hypothesis is your research hypothesis—the relationship that you are trying to investigate. The null and the alternative hypotheses must be mutually exclusive, meaning that if one is true the other is false and vice versa. You test whether or not to reject the null hypothesis by using statistics. If there is sufficient evidence supporting the alternative hypothesis you can reject the null.

Data & sampling

primary data—data that you collect yourself, like collecting water samples from a stream secondary data—data that someone else has collected, like census data

In order to have confidence in your analysis, the data that you collect to test your hypothesis must meet certain criteria. Confidence in your conclusions requires strong evidence, which in turn requires appropriate, well-collected data. Sometimes you will use primary data and sometimes you will use secondary data depending on what is appropriate.

GDP—the total value of the goods and services produced within a country during a set amount of time metric—a standard of measurement validity—the degree to which something measures what it claims to represent reliability—consistency in measurement

The data that you use in your analysis should represent the variables involved in your hypothesis. For the most part, the data you use will be approximations of the variables in your hypothesis because often these variables are not directly measurable. For instance, the level of economic development of a country is often measured by its gross domestic product (GDP). The true level of economic development is a theoretical concept and cannot be directly measured but GDP does well as a representative metric. This characteristic of representativeness is known as validity. Do your variables reflect the actual meaning of the concept you are trying to measure?

Your data also need to be reliable. If the data is collected in the same manner twice, will the observations be the same? This does not mean that stream height measurements from two consecutive days should be equal, rather that the technique provides a reliable means of measuring stream height. If your methods are unreliable it is meaningless to make comparisons between your data.

population—all members of the group of interest. This could be all people living in the United States, all the pine trees in Hiawatha National Forest, all the Sun-like stars in the galaxy, etc. sample—a subset of members drawn from the population. Since it would be impossible to study each member of the above groups, data is collected from a sample of the population. The characteristics of the sample need to represent the characteristics of interest in the population. random sample–a sample where all members of the population or population subset have an equal likelihood of being selected

Often we cannot collect data from the entire population that we wish to study. Instead, we must rely on a sample. To ensure that conclusions drawn from the sample can accurately tell us something about the entire population, the sample must be representative. This requires a random sample. If the sample size is large enough, random sampling preserves the patterns of interest from the population in the sample (Fig. 1).

Although the sample contains only 10 percent of the population, 
the general relationship between the independent and dependent variables is maintained. Figure 1: Although the sample contains only 10 percent of the population, the general relationship between the independent and dependent variables is maintained.

There are four primary strategies for sampling populations:

  1. Simple random sampling

Each member of the population has an equal chance of being selected.

  1. Systematic random sampling

Members of the population are chosen at defined intervals, for example, every fourth member is selected.

  1. Stratified random sampling

The population is first split into groups of interest and then members are chosen randomly from each group. This ensures that all groups of interest are represented in the sample and that they are represented proportionally. For example, if the population of fish you are studying is 3/4 salmon and 1/4 trout, your sample should contain three times as many salmon as trout.

  1. Cluster sampling

The population is first split into groups of interest and then groups are chosen randomly. For example, students are surveyed in a random selection of classrooms.

Analysis

Using statistical analysis allows you to ask certain kinds of questions of your data. Generally speaking, you can answer questions like:

The primary goal of the rest of this handbook will be to introduce you to some basic statistical methods that can help you answer these questions.

Conclusions

Based on your analysis you will either accept or reject your hypothesis. Most often you will adjust your hypothesis based on your findings and iterate through the process again.

Statistics

Statistics involve describing and analyzing data, especially for the purpose of making inferences about characteristics of a population. Statistics, are a means of telling a story with numerical data. Statistics provide a common language that can facilitate communication between scientists and stakeholders, including community leaders and government officials. Using statistical evidence may be necessary to defend the conclusions of your research, especially when trying to influence public policy.

Statistics can answer such questions as:

Data types

The type of data that you have will dictate which types of statistical methods are appropriate for your analysis. These data types are also known as measurement scales.

  1. Categorical

Categorical data represent characteristics, thereby placing observations into categories. Although sometimes represented by numbers in a coding environment, categorical data are not mathematically meaningful—you cannot perform mathematical operations on them like addition or subtraction. They are better understood as word descriptions rather than as numbers.

  • Nominal: Unordered, discrete variables. Female, male, gender non-binary. There is no inherent order to the categories and the categories are mutually exclusive.
  • Ordinal: Ordered, discrete variables. First, second, third. Although the categories are ordered, the differences between the categories are unknown. For instance, although first comes before second, and second before third, the distance between first and second may be greater or less than the distance between second and third.
  1. Numerical

Mathematical operations can be performed on these data types in a meaningful way. Numerical data are required for most statistical techniques.

Interval: Data measured along a scale with an equal distance between adjacent values and no true zero. Degrees Celsius. The difference between each degree is the same and zero degrees does not mean the absence of temperature. These values can be added and subtracted but not meaningfully multiplied or divided.

Ratio: Data measured along a scale with an equal distance between adjacent values and a true zero. Height in centimeters. Zero centimeters means something has no vertical dimension. Two centimeters means something is twice as tall as something that is one centimeter. All mathematical operations can be meaningfully applied to this data type.

Descriptive statistics

Descriptive statistics are ways to summarize your data that can reveal meaningful patterns and the first step in a data analysis. Descriptive statistics are often best represented by data visualizations. Most notably there are boxplots, histograms, and density plots (Fig. 2).

Descriptive statistics only apply to your sample; they cannot be used to make inferences about the broader population of interest. The following measures are all summary values that describe how your data is distributed. The data distribution tells us how often each value occurs or how many values occur in intervals of the data. Unless otherwise indicated, the terms data or dataset refer to observations of a single variable, like the height in centimeters of 30 students.

From top to bottom: boxplot, histogram, density plot. These are all visualizations of the distribution of a single dataset. Figure 2: From top to bottom: boxplot, histogram, density plot. These are all visualizations of the distribution of a single dataset.

Measures of position

Measures of position provide information about the position of a data point relative to the other values in the dataset—where that data point lies in the data distribution. Measures of position are based on rank-ordering a dataset. This means that all the values have been put in order from lowest to highest.

range—the difference between the minimum value and the maximum value of a variable.

If you divide the range of your data into 100 equal intervals, the values that define these intervals are called percentiles. Have you ever heard someone say that their score on a standardized test was in the 85th percentile? Or that they are in the 85th percentile for height? Being in the 85th percentile means that their score or their height were greater than 85% of the other people measured in that sample. Quantiles are basically the same as percentiles but they are expressed as proportions instead of percents—the 0.85 quantile means pretty much the same thing as the 85th percentile. An important measure of position is the median. The median of a variable is the 50th percentile, meaning that 50% of the data values are lower than the median value and 50% are higher.

The vertical lines of the box correspond to the three quartiles that divide the data values into four equal intervals. Figure 3: The vertical lines of the box correspond to the three quartiles that divide the data values into four equal intervals.

Quartiles are a special case of quantiles. They divide the rank-ordered dataset into four equal parts. The different between the third quartile (Q3) and the first (Q1) is known as the interquartile range, which contains the middle 50% of the values. Sometimes in datasets there are values that are so far from the rest of the data points that they are considered outliers. Sometimes outliers can be caused by measurement or input errors, but a lot of the time they are just part of the data distribution. One example of this is household income—in 2019 in the United States, the median household income was around $69,000 yet the top 1% of households earned over half a million dollars per household. One definition of outliers is points that are either 1.5 times the IQR above Q3 or 1.5 times the IQR below Q1. Those values are represented as the points beyond the whiskers (horizontal lines) of the boxplot (Fig. 3).

Measures of central tendency

Measures of central tendency summarize the center—or middle—of your data. What values do the data tend to have? What are the most typical or average values?

There are three common measures of central tendency. The mean is the arithmetic average of the data values. You simply add up all of the values for a variable and divide by the number of observations. The median is the middle value of the data. 50 percent of the values are above it and 50 percent are below. The mode is the value that occurs most frequently in the data. This measure is most appropriate when the data represent counts of something. When the data is continuous (having decimal places), it is unlikely that two values will be exactly the same, so no values occur more than once.

One way to visualize the distribution of the data while also getting an idea of the most common values is by using a histogram (Fig. 4). Histograms take all the values in a dataset and put them into equally sized intervals called bins. The tallest bin contains the most values. This roughly corresponds to the mode, although in this case the mode is the interval in which most values occur rather than a single value. Histograms are a good generalization of the data—they show you the overall distribution without giving you too much detail.

When a distribution is normal, the mean, median, and mode are all approximately the same number and occur in the center of the data. Figure 4: When a distribution is normal, the mean, median, and mode are all approximately the same number and occur in the center of the data.

When the data is normally distributed, the mean, median, and mode are approximately the same value and the data are relatively symmetrical on both sides of the distribution. The normal distribution, otherwise known as a Gaussian distribution or bell curve, is an important probability distribution. Many natural phenomena—such as human height, tree trunk diameters, and IQ scores—follow the normal distribution. The probability of occurrence of a value is greater for the values close to the middle and decreases as you move towards the tails (away from the center) of the distribution.

Measures of dispersion

Measures of dispersion summarize how clustered or dispersed the values are around the center of the data. Just how variable are your variables? Are most of the values close to the mean or are they spread out?

Another common way to visualize distributions is the density plot (Fig. 5). Density plots are a smoothed version of a histogram—instead of representing the data as discrete counts within intervals, the data is shown as a continuous probability distribution. In a density plot the area under the curve sums to one, meaning that it represents the entire probability set of all values. Like in a histogram, the probability of occurrence is greater for the values under the tallest point on the curve.

In a normal distribution, roughly 68 percent of the values are within one standard deviation of the mean. Figure 5: In a normal distribution, roughly 68 percent of the values are within one standard deviation of the mean.

The standard deviation is essentially the average distance of the data points from the mean. The larger the standard deviation, the more spread out the data. For normally distributed data, approximately 68% of the observations will fall within one standard deviation of the mean (Fig. 5).

The variance is also a measure of how spread out the data is. Square the difference between each observation and the mean (the squaring gets rid of negative numbers), then take the average of those differences. The standard deviation is the square root of the variance. The variance is a bit harder to interpret than the standard deviation but you have to calculate it first!

In a negatively skewed distribution the mean is to the left of the median. In a positively skewed distribution, the mean is to the right of the median. In a normal distribution they are approximately the same. Figure 6: In a negatively skewed distribution the mean is to the left of the median. In a positively skewed distribution, the mean is to the right of the median. In a normal distribution they are approximately the same.

Skewness is a measure of how unbalanced or asymmetrical the values are around the mean. If the data are positively skewed that means that the right-hand tail of the data distribution is stretched to the right, while negatively skewed data have a left-hand tail that stretches out the the left (Fig. 6). Sometimes these tails are caused by outliers—data values that are much higher or lower than the rest. If your data is skewed, the median can be more representative measure of central tendency than the mean. The reason that the median income is often reported instead of mean income is because those top one percent of households pull the mean of household income to the right, positively skewing the data. The mean income is not a representative measure of central tendency because it is much higher than what most households earn.

Kurtosis is a measure of how heavy the tails of the distribution are. A heavy-tailed distribution (leptokurtic) has a greater number of outliers than a light-tailed distribution (platykurtic) (Fig. 7). These terms seldom come up but if they do, at least you’ll have seen them before!

Heavy tails (leptokurtic) mean that there are more extreme values. Figure 7: Heavy tails (leptokurtic) mean that there are more extreme values.

Measures of correlation

univariate—involving one variable bivariate—involving two variables multivariate—involving more than one variable

So far, the descriptive statistics that we have been discussing have been univariate. However, much of the time we are interested in the co-distribution of two variables. If you know the value of \(x\), what does it tell you about the value of \(y\)? The correlation coefficient provides information about the strength and direction of a bivariate relationship and is denoted with the lowercase letter \(r\) whose value ranges from -1 to 1 (Fig. 8). The stronger the correlation, the closer \(r\) is to ± 1. If the two variables are perfectly correlated (\(r=\) ± 1), they lie on a perfectly straight line. Do two variables vary in the same direction (positive \(r\)) or opposite directions (negative \(r\))? Do they vary together at all? How strongly is the change in one variable related to change in the other variable?

Correlation between two variables indicates the strength and direction of the relationship. Figure 8: Correlation between two variables indicates the strength and direction of the relationship.

The correlation between the two variables is $r=0.008$ although it is clear that there is a strong non-linear relationship. Figure 9: The correlation between the two variables is \(r=0.008\) although it is clear that there is a strong non-linear relationship.

The most common correlation coefficient is Pearson’s r (also called Pearson’s correlation coefficient). It is calculated by dividing the covariance of two variables by the product of their standard deviations. There are two very important things to know about correlation:

covariance—an extension of variance that measures how two variables change together. linear—able to be represented as a straight line on a graph. confounding variable—a factor other than the one that you are studying that is associated with both the dependent variable and the independent variable.

  1. Correlation coefficients only capture linear relationships. The correlation coefficient might indicate that the relationship between two variables is weak when the relationship is actually strong but nonlinear (Fig. 9).

  2. Correlation is not causation! Just because there is a relationship between two variables does not mean that one causes the other. A frequently given example is that both ice cream sales and murder rates go up in the summer. Does that mean that eating ice cream causes people to go into a murderous rage? Of course not! Part of what is actually occurring is that both ice cream sales and murder increase when it is hot outside. Although the two variables are highly correlated, there is no causal relationship between them. In this case, increased temperatures are what is known as a confounding variable.

Inferential statistics

statistic—a numerical summary of a sample. parameter—a numerical summary of a population. independent variable—a variable whose observed values do not vary based on another measured variable. Research often focuses on whether or not an independent variable has a direct effect on a dependent variable. dependent variable—a variable whose observed values vary based on the independent variable. This is the outcome of interest in the research.

While descriptive statistics simply describe the data, inferential statistics are used to extend conclusions about a sample to the target population. In other words, they let us make inferences from our data. We use statistics to estimate parameters which provide us with information about the relationships between the variables of interest. Most often inferential statistics deal with independent and dependent variables. In a bivariate experiment there is one independent variable and one dependent variable. In a multivariate experiment there can be multiple independent variables. Inferential statistics are based on probability and answer the question: “What is the probability that the pattern I saw in my sample is a trustworthy representation of the pattern in my population of interest?”

Probability

Since statistics rely on probability to draw conclusions from data, it’s worth a quick overview of the fundamentals of probability. The study of probability is a branch of mathematics unto itself but let’s go over the basics:

Hypothesis testing

p-value—the probability of getting a test statistic as extreme or more extreme than the observed results.

test statistic—the calculated output of a statistical test that is used to reject or accept the null hypothesis. \(\alpha\)-level—the probability of rejecting the null hypothesis when it is actually true.

Statistical tests are used to evaluate hypotheses. The output of a statistical test is usually a single number, known as the test statistic. Different tests will give you different test statistics, but generally speaking the test statistic is compared to a theoretical probability distribution to see how likely it is to get that particular value. The probability of getting this value if the null hypothesis is true is called a p-value. If the p-value is low it means that it is very unlikely to get the value of the test statistic if the null hypothesis is true. This is an indication to reject the null hypothesis. Conversely, a high p-value indicates that the probability of obtaining the test statistic if the null hypothesis is true is high. In this case we would fail to reject the null.

So how low is low? The p-value is compared to what is known as the alpha level or the significance level, denoted as \(\alpha\). This value is the threshold at which you can be reasonably certain that there is a real relationship in your data. Most researchers choose \(\alpha=0.05\), meaning that there is less than a 5% chance that the observed statistical outcome is due to chance, and therefore is likely to indicate a real relationship.

It is worth noting that \(0.05\) is just a convention, not some kind of natural law, and is just an arbitrary cut off point. A p-value of 0.06 does not necessarily indicate that there is no relationship, nor does a p-value of 0.04 indicate that there definitely is. The p-value is simply a measure of the strength of your evidence and should not be the only factor your conclusions are based on. Remember the ice cream?

Types of tests

This section serves as a broad introduction to the types of statistical tests available—what kinds of questions they answer and when they are appropriate to use. Details on calculations are not provided—modern statistical software will do the calculations for you (which is certainly more accurate and more efficient than doing it by hand!) and there are many good resources available should you want to learn more about any given test. The primary objective as stated in the introduction is to give you a vocabulary that will enable you to confidently pursue the course of learning best suited to your needs.

As we saw earlier, data comes in different types and not all statistical methods can be used on all types of data. In fact, the type of data that you have is one of the most important considerations when deciding whether or not a statistical test is appropriate. Another consideration is how the data is distributed. There is much more flexibility in statistical testing for numerical data and more information can be inferred.

Chi-square

Chi-square—the Chi-square test returns the Chi-square test statistic, denoted by \(\chi^2\). frequency—the number of times a value occurs. non-parametric—making no assumptions about the distribution of the data.

The Chi-square test is the most widely used test for nominal data and is based on the frequencies of the observed outcomes. It tests whether or not outcomes are independent by comparing the observed frequencies with the frequencies that would be expected if there were no relationship between the variables. The Chi-square test measures the association but not the strength of the relationship between the variables. You can also do a one-sample Chi-square test to see if there is a difference in frequencies between categories. Chi-square is a non-parametric statistic, meaning that there are no requirements as to the distribution of the data although it does assume that each observation is independent of the others. The null hypothesis in a Chi-square test is that the variables are independent and there is no relationship between them.

For example, is there a relationship between iris species and big or small sepal length? In other words, are iris species and sepal length independent? The Chi-square test compares the frequencies observed in the data (Fig. 10) to the frequencies that would be expected if the null hypothesis was true (Fig. 11). You can see that these frequencies are quite different than those observed.

Observed frequencies Figure 10: Observed frequencies

Expected frequencies Figure 11: Expected frequencies

Running the Chi-square test on the dataset, you get a test statistic of \(\chi^2\) = 86.0345134 which corresponds to a p-value of less than 0.001. This p-value is very low, indicating that it is highly unlikely to get a value as or more extreme than the calculated test statistic if the null hypothesis was true. You would reject the null hypothesis and accept the alternative hypothesis that the sepal length is dependent on the species.

Other similar tests for nominal variables include Cramér’s V and Fisher’s exact test. Use either the Spearman rank coefficient or Kendall’s Tau for ordinal variables.

Student’s t-test

t-test—the t-test returns the test statistic \(t\).

binary—taking only two values like yes or no, 0 or 1, cat or dog. parametric—making assumptions about the distributions of the data. i.i.d.—independent and identically distributed. This means that all observations are independent of each other and that they all come from the same probability distribution. sample size—the number of observations in a dataset. A general rule of thumb is that you need at least 30 observations in a sample to obtain reliable statistics.

Most often referred to simply as a t-test, this method tests whether there is a significant difference in the measured numeric values between two groups (Fig. 12). Group membership is a nominal, binary variable. The two samples are compared to determine if they were drawn from different populations. The t-test works by evaluating the probability that the difference in the means of the two groups is different from zero. The null hypothesis states that there is no difference between the means of the groups and that they are drawn from the same population. Unlike the Chi-square test, a t-test is parametric, meaning that assumptions are made about the distribution of the data. T-tests are only appropriate if the dependent variable is normally distributed and the observations are independent (unless it is a paired t-test in which case the pair differences are assumed to be independent). This is a common statistical assumption and often abbreviated i.i.d. It should be noted, however, that t-tests are considered robust, meaning that even if the dependent variable is not distributed exactly normally, the test outcome is not greatly influenced. This is especially true as the sample size increases.

The t-test for treatment 1 has a p-value of 0.0087 indicating that the true difference in means is not significantly different from zero. The t-test for treatment 2 has a p-value of 0 indicating that the true difference in means is not zero. Figure 12: The t-test for treatment 1 has a p-value of 0.0087 indicating that the true difference in means is not significantly different from zero. The t-test for treatment 2 has a p-value of 0 indicating that the true difference in means is not zero.

There are three basic types of t-tests:

  1. One sample t-test

This test is used to compare sample data to the mean of a known population. For instance, you selected a sample of orange cats in a shelter. The shelter keeps information on the weight of all the cats who live there. Do the orange cats have a different mean weight than the full population of cats living in the shelter?

  1. Independent samples t-test

This test is used to compare two independent samples. Do the salmon that live in the stream that runs by the power plant weigh less than the salmon who live in a stream that does not?

  1. Dependent samples (paired samples) t-test

This test is used to compare two groups where the participants are the same in each group. Usually this involves collecting data on a group through time. Is the pollution load in a stream significantly different after a flood?

Analysis of variance

Analysis of variance—ANOVA returns the test statistic \(F\). post hoc—a commonly used Latin phrase meaning “after the event”. In this case this means follow-up tests that are done after the initial test.

Analysis of variance (ANOVA) is also concerned with the difference in means between groups but can compare three or more samples. Although calculated differently, an ANOVA for two samples would give the same result as a t-test. ANOVA works by comparing the within group variation—how much individuals values vary from their group mean—to the between group variation—how much the group mean differs from the overall mean (sometimes called the grand mean). The data is again assumed to be i.i.d., with the observations within each category being normally distributed. The test also assumes that the variances within the categories are equal. Like the t-test, however, the test is considered robust and violations of these assumptions (with the exception of independence) are not too serious. There are many types of ANOVA tests, but we will only address two basic types here.

  1. One-way ANOVA tests for a difference in means based on one category, for instance, acorn weight differences between five species of oak trees. There is only one independent variable.

  2. Two-way ANOVA deals with two categories, for instance, acorn weight differences between five species of oak trees in drought years versus wet years. This means that there are two independent variables. The two-way ANOVA not only tests for differences in mean weight between the five species, it also tests for differences in mean weight between the two moisture regimes. On top of that, it tests to see if the species of oak and the moisture regimes interact in a manner that influences weight.

Although the \(F\) statistic reports on whether or not there is a difference in means, it only indicates that there is at least one significant difference between groups. It does not report between which groups. There are a number of post hoc tests that compliment ANOVA, including the Tukey test (also known as Tukey’s Honest Significant Difference test) that can compare all possible groups to find where the differences lie. The null hypothesis is that all of the groups have the same mean, indicating that there is no difference between groups.

Regression

linear regression—Linear regresion also returns the test statistic \(t\) which tests whether the slope of the regression line is significantly different from zero.

Linear regression is one of the most, if not THE most, used statistical tests. It is simple yet powerful, and broadly applicable. Linear regression creates a line of best fit between an independent variable and a dependent variable (Fig. 13). The equation of this line is such that it minimizes the squared distance between each observation and the line. Remember that squaring the difference gets rid of negative numbers. This is known as the sum of least squares. Let’s look at an example. This data shows the global mean \(CO_2\) level and global mean temperature going back to 1958.

The regression line is a line of best fit that minimizes the residual error. Residuals are indicated by the dashed lines. Figure 13: The regression line is a line of best fit that minimizes the residual error. Residuals are indicated by the dashed lines.

residual—the distance, or error, between the observed value and the predicted value.

x-intercept—the point where a line crosses the x-axis and the value of \(y\) is 0.

slope—a number that describes the direction and steepness of a line. It is the ratio of the vertical change (rise) to the horizontal change (run) between any two points. A linear equation has a constant slope.

The regression line shows the predicted \(y\) value for any given \(x\) value. The dotted lines are known as residuals. The regression equation minimizes these residuals so that their squared sum is the least that it can possibly be and is given by: \[y=\beta_0+\beta_1x+\epsilon \quad \textrm{or} \quad y=a+bx+e \] In this equation, x is the independent variable, y is the dependent variable, \(\beta_0\) is the x-intercept, \(\beta_1\) is the slope, and \(\epsilon\) or \(e\) is a term representing random measurement error.

The significance of the linear regression equation is quantified using a t-test. Is the slope significantly different from zero? If it is not, you cannot say that there is a significant linear relationship between \(x\) and \(y\) and knowing the value of \(x\) does not provide you with any information about the value of \(y\) (Fig. 14). If the slope is significant, it represents the average change in \(y\) per unit change of \(x\).

The p-value of the slope is 0.48. As you can see there is no significant relationship between the variables. Figure 14: The p-value of the slope is 0.48. As you can see there is no significant relationship between the variables.

coefficient of determination\(R^2\), the proportion of the variance in the dependent variable accounted for by the independent variable. homoscedasticity—equal variance of the residuals. autocorrelation—also known as serial correlation, a relationship between observations due to a temporal (time) or spatial lag between them. multicollinearity—high correlation between independent variables.

If the slope of the line is significant, it is appropriate to consider the correlation coefficient. If you square this value, you get the \(R^2\) (pronounced r squared) value, otherwise known as the coefficient of determination. The \(R^2\) represents the proportion of the variation in the dependent variable that can be accounted for by the variation of the independent variable. In the case of \(CO_2\) and temperature, the slope of the linear fit is significant at greater than \(p=0.001\) and the \(R^2=\) 0.8890899. This means that almost 90% of the variation in temperature can be accounted for by the variation in \(CO_2\).

In addition to assuming that the data is i.i.d., linear regression also assumes that the relationship between the independent and dependent variables is linear and that there is no pattern to the residuals. The quality of the residuals having no pattern is called homoskedasticity (Fig:15). The opposite case is known as heteroskedasticity and this can indicate either the existence of outliers in the data or the omission of important variables (Fig. 16). The data is also assumed to have no autocorrelation. Autocorrelation means that the data is correlated with itself, like in a time series. Although the temperature tomorrow will be affected by many things, one of the most influential factors will be the temperature today.

There is no apparent pattern to the residuals, meaning they are homoskedatic. Figure 15: There is no apparent pattern to the residuals, meaning they are homoskedatic.

The residuals are heteroskedistic because the variance increases as x increases. Figure 16: The residuals are heteroskedistic because the variance increases as x increases.

Linear regression can be simple—having only one independent variable—or multivariate—having multiple independent variables. In a multiple linear regression, the dependent variable is a linear combination of the independent variables. For instance, salmon weight could be modeled as a combination of species and gender. In a multiple linear regression, there is the additional assumption that there is not an excess of multicollinearity. This just means that none of the independent variables are highly correlated with each other.

When there are multiple independent variables the equation takes the form \[y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3...\beta_nx_n+\epsilon \] where \(x_1...x_n\) are the independent variables (\(n\) just indicates that there are however many variables you specify) and \(\beta_1...\beta_n\) are the corresponding slope coefficients. The way that you interpret these coefficients is that they represent the average change in y based on a one unit increase in x, holding all other variables constant. So in our salmon weight example, the slope coefficient for body length would be the average increase in salmon weight per one unit increase in body length when the body width and species are held constant. The coefficient for body width would be the increase in weight per increase in body width holding length and species constant, and so forth.

There are many kinds of regression techniques, including those for data that is not normally distributed. These are known as generalized linear models and include logistic regression (also known as logit regression) for a binary dependent variable, Poisson regression for an integer (count) dependent variable, and polynomial regression to model a non-linear relationship.

Advanced methods

There are many, many, many statistical tests and methods available to you and it is well beyond the scope of this handbook to detail them. What follows is a brief and very general introduction to some of the more advanced statistical methods.

Time series analysis

seasonality—variation that occurs over regular periods.

Time series analysis is concerned with predicting future values, whether it be stock prices, temperatures, or the height of ocean tides. Methods appropriate for the analysis of time series data must account for autocorrelation and the seasonality of observations. Techniques include: moving averages and smoothing, decomposition into high and low frequency fluctuations, and differencing—analyzing the observational differences between times instead of the observations at the times themselves.

Clustering

Clustering techniques divide observations into groups that share common characteristics. There are several types of clustering techniques but they all work on the same basic premise of minimizing the within group differences while maximizing the between group differences. With k-means clustering you specify the number of groups ahead of time and with hierarchical clustering the ideal number of clusters is determined from the data. Clustering techniques are descriptive rather than inferential and can reveal patterns in the data. They can be either spatial or non-spatial and are often implemented through machine learning algorithms.

Principal components analysis (PCA)

Principal components analysis is a dimensionality-reducing technique that works by transforming a large set of variables into a smaller one while preserving the maximum amount of information. The principal components are constructed to account for as much of the variance of the original dataset as possible while being uncorrelated with each other. Each principal component is a combination of some percentage of the original variables.

This website has a compelling visual explanation of hierarchical models.

Hierarchical models

Hierarchical models (also known as mixed effects models) are nested models where the observations fall into hierarchical groups. The technique is similar to regression but the coefficients are allowed to vary between groups. A classic example of hierarchical modeling is educational outcomes. For instance, a student’s score on a test will depend on their individual characteristics—time spent studying, the characteristics of their classroom—how many students there are and how many years of experience the teacher has, and the characteristics of their school—access to educational resources and funding. In this example students are nested in classrooms which are nested in schools.

Spatial statistics

The term spatial statistics refers to the application of statistical techniques to data where the geographical location is considered an important part of the analysis. Spatial data have characteristics which make them special to deal with including spatial autocorrelation and scale. Tobler’s first rule of geography states that “everything is related to everything else, but near things are more related than distant things” so observations cannot be considered independent in space. This quality of spatial data violates the assumptions of many statistical tests. Additionally, the area of analysis chosen has a significant affect on the outcome of the analysis—this is known as the modifiable aerial unit problem or MAUP. Methods range from considering observations within simple grids to deriving characteristic scales of interactions through wavelet or Fourier analysis. Ultimately, spatial statistics are concerned with characterizing patterns in space and include geographically-weighted descriptive statistics, clustering techniques like hot spot analysis, and interpolation techniques like Kriging.

Implementation

As referenced earlier, data analysis is done with modern statistical software. There are many to choose from (Stata, SPSS, Matlab, R) and they all have their merits and user communities. R (and the GUI RStudio) is a high quality option because it is free and open source, you can easily use it to produce publication-worthy graphics, and there is a dedicated user community. You can also create just about any kind of document in R including interactive web applications, blogs, and slideshows. In fact, this entire handbook was created using R!

R and RStudio

R for Data Science

Python for Data Analysis

Resources

Khan Academy

Online Statistics Education: An Interactive Multimedia Course of Study

Open Stax Introductory Statistics

Interactive Statistical Test Flowchart