As always, we will start by creating a new R script called LastName_Lab09. Make sure that when you open R Studio, you open it by double clicking on your GEOG316 R project. The file extension is .Rproj. By opening the project and then opening (or creating) a script from inside the project you ensure that your working directory is correctly specified. Don’t forget to load the tidyverse!

1 R as a GIS

One of the really cool things about R is that because it is open source people are constantly creating packages to enhance its functionality. And the really cool thing about the tidyverse in particular is that all its packages work together and use a common grammar. That is not always the case in R and it can get confusing. While there are many packages that exist to work with spatial data in R, recently the package sf() was specifically designed to simplify using R as a GIS by integrating with the other tidyverse functions. What this means is that you can work with spatial data in a similar way to how we have been working with spatial data all semester: things like filter(), mutate(), and ggplot() work with sf objects too. A lot of spatial data already exists in R packages too, so often you don’t have to spend a lot of time looking for basic spatial data like country or county boundaries. Let’s make a basic map of playa lakes in Kansas.

First you will need to install and load the sf package. Remember that you install the package only once (at least until you need to update your version of R in which case you have to reinstall most of your packages) but you need to load it every time that you need to use it in a new R session. The install commands are commented out in this document so you will need to remove the hashtag in order for them to run.

#install.packages("sf")
library(sf)

Now you need to load the playa csv file. Remember that this code will only work if your current working directory is your lab folder! In the lower-right panel if you click More and then Go to current working directory you should see your datasets folder.

playa <- read_csv("Datasets/playa.csv")

Now we will load spatial data for the counties in Kansas. We will use a package maps that contains lots of premade geometries. First, install and load the package.

#install.packages("maps")
library(maps)

This code does two things: it loads the Kansas counties from the maps package, AND it changes the counties into an sf object so that we can easily use it with ggplot and the rest of the functions in the sf package.

kansas <- st_as_sf(map(database = "county", 
                       regions = "kansas", 
                       plot = FALSE, 
                       fill = TRUE))

Let’s plot both the counties and the lakes on our map.

ggplot() +
  geom_point(playa, 
             mapping = aes(Longitude, Latitude)) +
  geom_sf(data = kansas) 

Whoops! Just like in a regular GIS software, the order of the layers matters. The playa lakes don’t show because they are underneath the counties. Let’s change that.

ggplot() +
  geom_sf(data = kansas)  +
  geom_point(playa, 
             mapping = aes(Longitude, Latitude)) 

There are so many playa lakes that they all overlap each other. Let’s change the transparency of the points so that we can get an idea of their density. This is done by changing the alpha level. Anything less than one will make the color more transparent. You just kind of have to play around with it until it looks how you want it to.

ggplot() +
  geom_sf(data = kansas)  +
  geom_point(playa, 
             mapping = aes(Longitude, Latitude), 
             alpha = 0.05, 
             color = "blue") 

Let’s install another package that makes it easier to make good maps by adding scale bars and north arrows. It also works well with ggplot(). For the location you can specify top left (tr), top right (tr), bottom left (bl), and bottom right (br) and then use pad_x and pad_y to adjust the position further. Again, you just have to play around with it until you like the way the map looks.

#install.packages("ggspatial")
library(ggspatial)

ggplot() +
  geom_sf(data = kansas)  +
  geom_point(playa, 
             mapping = aes(Longitude, Latitude), 
             alpha = 0.05, 
             color = "blue") +
  
  # Add a title
  ggtitle("Playa lakes in Kansas counties") +
  
  # Add a scale bar and north arrow
  annotation_scale(location = "tr", pad_y =  unit(0, "cm")) +
  annotation_north_arrow(location = "tr", 
                         pad_y =  unit(0.5, "cm"), 
                         height = unit(1, "cm"), 
                         width = unit(0.5, "cm"))

2 Quadrat analysis

Next, let’s examine the distribution of playas in a single county—Haskell—and test the null hypothesis that the distribution is random. To do this we will use the package spatstat. While it is not specifically compatible with sf objects, it is pretty easy to get around that and it is a well-respected and well-maintained package for analyzing spatial point patterns. You can go here for more information about the package.

#install.packages("spatstat")
library(spatstat)

First we need to make our playa lakes into a format that spatstat can use. This is called a ppp object for point pattern. The function ppp() is specified with the x and y coordinate of the data and their ranges. We will also specify the window in which to analyze them. First need to create an observation window from Haskell county and subset the data to just the playa lakes within Haskell county. We will also plot the ppp object to make sure that it looks right.

# Select Haskell county from the counties layer
# str_detect tells the filter to keep only the 
# rows where the ID column contains the string haskell. 
# Note that case matters here and all of the
# data in the ID column are lower case.

haskell <- kansas %>% 
  filter(str_detect(ID, "haskell"))

# Get the bounding box (ymin, ymax, xmin, xmax) coordinates 
# and use those to make an observation window

haskell.window <- as.owin(st_bbox(haskell))

# Select just the playas in Haskell county

haskell.playa <- playa %>% 
  filter(County == "Haskell") 

# Create the ppp using our observation window

haskell.ppp <- ppp(x = haskell.playa$Longitude, 
                   y = haskell.playa$Latitude,
                   xrange = range(haskell.playa$Longitude),
                   yrange = range(haskell.playa$Latitude),
                   window = haskell.window)

plot(haskell.ppp)

The quadrat.test() calculates the number of observations within the specified number of quadrats (nx and ny) and then computes the Chi-square goodness of fit test to determine whether or not the count frequencies per quadrat are what would be expected from a random process, otherwise known as complete spatial randomness.

test <- quadrat.test(haskell.ppp, nx = 10, ny = 10)
test
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  haskell.ppp
## X2 = 1002.8, df = 99, p-value < 0.00000000000000022
## alternative hypothesis: two.sided
## 
## Quadrats: 10 by 10 grid of tiles

As you can see, the p-value is very, very low. The p-value of less than 0.001 indicates that we can reject the null hypothesis of complete spatial randomness. You can also specify the alternative hypothesis. The test we just did only tells us that the data are non-random but we don’t know if they are clustered or dispersed. To test for clustering we add alternative = "clustered". To test for regularity (dispersion) we add alternative = "regular".

test.clust <- quadrat.test(haskell.ppp, nx = 10, ny = 10, alternative = "clustered")
test.clust
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  haskell.ppp
## X2 = 1002.8, df = 99, p-value < 0.00000000000000022
## alternative hypothesis: clustered
## 
## Quadrats: 10 by 10 grid of tiles

Clustered indeed! We can also plot the counts per quandrat. CSR would look, well, completely random with no order. You can see that the data appears to be clustered.

count <- quadratcount(haskell.ppp, 10, 10)

# This formats the data from the quadrat count so that 
# we can use it in ggplot. You can compare the count variable to 
# q.count to see how the data structure changes if you are 
# interested.

q.count <- as_tibble(count) %>% 
  separate(y, c("y.From", "y.To"), sep = ",") %>% 
  separate(x, c("x.From", "x.To"), sep = ",") %>% 
  mutate(y.From = as.numeric(str_sub(y.From, 2, -1)),
         y.To = as.numeric(str_sub(y.To, 1, -2)),
         x.From = as.numeric(str_sub(x.From, 2, -1)),
         x.To = as.numeric(str_sub(x.To, 1, -2)))

# With geom-rect you specifiy the coordinates of the four 
# corners of a rectangle.

ggplot(q.count) +
  geom_rect(aes(xmin = x.From,
                xmax = x.To,
                ymin = y.From,
                ymax = y.To,
                fill = n)) +
  scale_fill_viridis_c(option = "inferno")

3 Nearest neighbor analysis

One of the problems with quadrat analysis and with spatial analysis in general is that it can be highly dependent on your choice of analysis scale. What is clustered at one scale may be dispersed at a smaller scale (Fig. 3.1). See this super awesome paper if you want to know more.
Pattern changes with analysis scale.

Figure 3.1: Pattern changes with analysis scale.

The results of a quadrat analysis is partially determined by the number of quadrats that you select. Another spatial analysis technique that is less dependent on scale is nearest neighbor analysis. This calculates for every point the distance to the nearest point. We can then compare the mean nearest neighbor distance (\(\bar{d}\)) to the mean that would be expected if the points were completely spatially random (\(\delta^*\)).

# Calculate the nearest neighbor distances & the mean nnd
haskell.nnd <- nndist(haskell.ppp)
mean.d <- mean(haskell.nnd)

Remember that \[\delta^* = \frac{1}{2\sqrt{\lambda}}\] where \[\lambda = \frac{n}{A} \] with \(n\) being the number of observations within an area \(A\).

# Number of playas in Haskell county
n <-  haskell.ppp$n

# Area of Haskell county
A <-  area.owin(haskell.window)

# Calculate lamda
lambda <-  n / A

# Calculate delta-star
delta.star <- 1 / 2 * sqrt(lambda)

We know that the distances and the area are in the same units because they are all calculated from latitude and longitude so they are in decimal degrees.

The nearest neighbor index \(R\) is: \[ R = \frac{\bar{d}}{\delta^*}\]

# Calculate the nearest neighbor index
R <- mean.d / delta.star

\(R=0.0000947\) meaning that the points are more clustered than what would be expected from complete spatial randomness. Values less than 1 are clustered and values greater than 1 are dispersed.

The test statistic is given by: \[z = \frac{\bar{d} - \delta^*}{\sigma_{\bar{d}}} \] where \[\sigma_{\bar{d}} = \frac{0.261}{\sqrt{n\lambda}} \]

sigma.d <- 0.261 / sqrt(n * lambda)

z <- (mean.d - delta.star) / sigma.d

To get the p-value we compare our calculated z to the z distribution to get the probability of getting this value if the null hypothesis (complete spatial randomness) were true. We use the pnorm() function to do this. For a left-tailed test add lower.tail = T, for a right-tailed test use lower.tail = F, and for a two-tailed test use lower.tail = F and then multiply the result by 2. We will use a left-tailed test because our hypothesis is that the data are clustered. You would use a right-tailed test to see if the data are dispersed. A two-tailed test would just tell you whether or not the data were csr. Use this if you don’t have a specific reason to think that the data are either clustered or dispersed.

pnorm(z, lower.tail = T)
## [1] 0

The area under the sampling distribution to the left of our z score is so close to zero that it doesn’t even print in scientific notation. And if you look at our calculated z-score the magnitude of the number is quite large! This means that the probability that the null hypothesis is true is so small, that it can be considered negligible.

4 Exercises

  1. Using the playas dataset, filter to just the playas in Thomas county. Filter the counties to just Thomas county as well. Make a map of the playas in Thomas county that includes a north arrow, a scale bar, and a title. Use the alpha to change the transparency of the points. Save your map as a png file and upload it to BB with your script. How would you describe the distribution of the playa lakes in Thomas county?

  2. Using quadrat analyses for 15 × 15 quadrats in the county, test to see if the distribution of the playas follows a random pattern (assume an \(\alpha\)-level of 0.05). Remember that you will need to create a new observation window from Thomas county. Use a two-sided test. What does the test say? Write a sentence specifying what the test result says about the null hypothesis. Refer back to the quadrat section to see what this sentence should read like.

  3. Recalculate the test using both alternative hypotheses. Is the data clustered or dispersed? Write a sentence for each alternative hypothesis and what they say about accepting or rejecting the null.

  4. Calculate the nearest neighbor index, R, for this dataset. What does this index indicate about the process that produced the pattern of playas in this county?

  5. Calculate the z test statistic. What is the probability that you could get this value if the null hypothesis was true? Does this mean you should accept or reject the null? Use the alternative hypothesis that you accepted in the quadrat analysis.