Regression

Earlier in the course we saw how visualising relationships between variables can be used as part of an exploratory data analysis. In this tutorial we will look at how to measure the strength of these relationships (correlation) and at regression, an extremely versatile family of methods for formally modelling them.

Prerequisites

By the end of this tutorial, you should be able to:

Calculate correlation coefficients in R
Fit a linear model to data
Assess the quality of fit of a linear model visually and using residual analysis

Objectives

Chapter 8 (“Relationships between Two Numeric Variables: Correlation and Regression”), in Stephen Shennan (1998), Quantifying Archaeology, 2nd edition. https://doi.org/10.1515/9781474472555-009
Chapter 9 (“When the Regression Doesn’t Fit”), in Stephen Shennan (1998), Quantifying Archaeology, 2nd edition. https://doi.org/10.1515/9781474472555-010

Recap: visualisating relationships

We have already learned how to use a scatter plot (or scattergram) to visualise the relationship between two numeric variables. For example, we could look at the relationship between the area of sites on Islay and the total number lithics found there:

library(ggplot2)
library(islay)

ggplot(islay_lithics, aes(area, total)) +
    geom_point()

We have also used geom_smooth() to add a “smoothed conditional mean” (in ggplot2’s words) to plots like this:

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth()

The message generated by ggplot gives us some clues as to what it is doing under the hood:

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It has automatically fitted a model to the data using LOESS (method = 'loess'), a method of nonlinear regression. geom_smooth() always picks a nonlinear method by default: either LOESS or GAM (generalised additive models), depending on the number of observations. Both methods combine multiple regression models fitted to subsets of the data – producing the characteristically ‘wiggly’ output. This is useful for exploratory data analysis because they can fit relationships with many different shapes without further tuning.

You can also specify the method used by geom_smooth manually. Here, for example, it looks like a linear model might be more useful. Use method = "glm" to fit a generalised linear (regression) model (GLM) to the data:

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm")

Later we will learn how we can measure the strength of correlation and goodness fit quantitatively, and further manipulate the linear model. For now, we can inspect it visually. The blue line represents the estimated mean trend or ‘line of best fit’. If our model is successful, should roughly match the central trend of the surrounding points.

Equally important is the grey area around the line, representing the confidence interval around the mean. By default, it is the 95% confidence interval. You can control this with the level argument:

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm", level = 0.68) # one sigma

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm", level = 0.997) # three sigma

As always with confidence intervals, it is important to remember that the level chosen is arbitrary.

Another useful argument to geom_smooth() is fullrange. We can set this to TRUE to extend the model beyond the actual range of the data, which can be useful for prediction:

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm", fullrange = TRUE) +
    scale_x_continuous(limits = c(0, 300000))

Note the use of scale_x_continuous() to extend the range of the x axis. Prediction needs to be approached with common sense and a knowledge of the data. For example, we can readily ask the model to ‘predict’ the number of lithics found at sites with an area of less than zero:

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm", fullrange = TRUE) +
    scale_x_continuous(limits = c(-100000, 200000))

One final way we can try to improve the model fit, without actually changing the model itself, is to consider the scale of the variables. In this case, both the area of sites and the total number of lithics appear to follow a log-normal distribution:

ggplot(islay_lithics, aes(area)) + geom_density()

ggplot(islay_lithics, aes(total)) + geom_density()

As we’ve already seen, these kind of distributions are easy to normalise using a log transform. We could create new log variables, as we did last week, but since at this point we’re only concerned with visualisation, we can instead do the transformation directly on the plot itself with scale_*_log10():

ggplot(islay_lithics, aes(area, total)) +
    geom_point() +
    geom_smooth(method = "glm") +
    scale_x_log10() +
    scale_y_log10()

This is called a log–log scale. If there is a linear correlation on a log-log scale, as there is here, it indicates a power law correlation on a normal scale.

Exercises

Why does it make sense to log transform the variables total and area specifically?
Choose two lithic types (flakes, etc.) and fit an appropriate model to them using geom_smooth().
Is there a correlation? Why or why not?

Measuring correlation

A correlation coefficient measures the strength of correlation between two variables. We can calculate them with cor.test() function in R.

The most common, and default for cor.test(), is the product–moment correlation coefficient (method = "pearson"), which works well for linear relationships with the standard assumptions (cf. Shennan):

cor.test(islay_lithics$area, islay_lithics$total)

The output should be familiar from other hypothesis tests. The most useful parts are the p-value—which we can interpret in the usual way—and cor, which is the correlation coefficient r. The latter measures the strength of the correlation, and is a value between 0 (no correlation) and 1 (perfect correlation). Another useful statistic is R², which measures how much of the variance in the response variable can be predicted by the independent variable. To calculate this, we need to extract the actual value of r from the object returned by cor.test():

correlation <- cor.test(islay_lithics$area, islay_lithics$total)
correlation$estimate^2

This isn’t very promising: the p-value is relatively low, but so is the correlation coefficient and R². However, we’ve already seen above that the correlation looks stronger on a log-log scale, i.e. it is probably not linear. Rather than the product-moment coefficient, we should therefore use a nonparametric coefficient like Spearman’s ρ or Kendall’s τ. To this we simply need to change the method argument of cor.test():

cor.test(islay_lithics$area, islay_lithics$total, method = "spearman")

As expected, the correlation appears to be stronger using this method.

Exercises

Calculate Kendall’s τ for total vs. area (this is usually preferred to Spearman’s).
Describe the correlation (or lack thereof) you see, based on Kendall’s τ.
Choose two lithic types (flakes, etc.) and calculate an appropriate correlation coefficient (hint: you will first need to determine whether the correlation appears to be linear or not).

Linear regression

Linear models might seem underwhelming compared to LOESS or GAM, but unlike these ‘black boxes’ we can easily build our own linear models, and in doing so learn a lot about the underlying properties of the data. We should also remember that regression fitting algorithms will always fit the model you give them; with non-linear models, this can easily produce over-fitted leading to spurious interpretations. By restraining ourselves to a linear relationships encourages to also keep our interpretations simple, and be more likely to spot when the model simply doesn’t fit.

To build our own linear models, we will use the function lm() directly. The following replicates the linear model we fitted with geom_smooth above:

lm(total~area, data = islay_lithics)

Like many modelling functions in R, it uses a special formula syntax, where the response variable(s) (left) are separated from the independent variable(s) (right) with ~. By giving our data frame as the data argument, we can use column names directly in the formula.

The result is a model object which includes the fitted values (used e.g. to reproduce a graph) as well as information on the model and a variety of measures of its performance – see ?lm for a full list. The format of this object is a little inconvenient. The broom package allows us to extract information from it as a data frame, which is a bit easier to work with. tidy() gives us the parameters of the model itself:

library(broom)
model <- lm(total~area, data = islay_lithics)
tidy(model)

glance() summarises measures of its performance:

glance(model)

And augment() gives us the original data alongside the fitted model predictions and residuals:

augment(model, data = islay_lithics)

We can use augment to reproduce a geom_smooth()-style plot from our model created with lm():

fitted <- augment(model, data = islay_lithics, interval = "confidence")
ggplot(fitted, aes(area, total)) +
    geom_point() +
    geom_line(aes(y = .fitted)) + 
    geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)

But crucially, we can go beyond what geom_smooth() can do and analyse the fit of the model in more detail. For example, the result of augment() includes the residuals, which can be very useful for further exploratory data analysis of the regression model (cf. Shennan, ch. 9).

Let’s see if we can make a better model. Our correlation coefficients and log-log plots have indicated that there probably is a correlation in this data, but that it is nonlinear – following a power law. We can model this by including log() in the formula specification of our model:

lm(total~log(area), data = islay_lithics) |>
    augment(data = islay_lithics, interval = "confidence") |>
    ggplot(aes(area, total)) +
    geom_point() +
    geom_line(aes(y = .fitted)) +
    geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)

Exercises

Choose two lithic types (flakes, etc.) and fit an appropriate linear model to them.
Produce a density plot of the residuals from the area–total regression. What can we learn from this?