Hypothesis testing

This tutorial introduces the most common methods in the null-hypothesis significance testing (NHST) framework, a highly influential frequentist approach to statistics developed in the 20th century. The basic idea of this approach is to use probability theory to test whether an observed pattern can be considered statistically significant. More formally, it tries to quantify how confident we can be about rejecting the null hypothesis - that any pattern or relationship we think we see is just down to chance. In practice, this amounts to choosing a test statistic that is appropriate to the sample you have available and comparing it to a reference distribution to calculate the probability of observing that statistic under the null hypothesis (the ‘p-value’).

Prerequisites

Chapter 5 (“An Introduction to Statistical Inference”), in Stephen Shennan (1998), Quantifying Archaeology, 2nd edition. https://doi.org/10.1515/9781474472555-005

Objectives

This tutorial should enable you to:

Understand the key concepts behind the NHST framework
Recognise a normal distribution and know how to deal with non-normal data
Select an appropriate statistic to test for significant differences in a) numeric and b) categorical distributions and compute it in R

Is my data normal?

Most NHST tests are parametric, meaning that they make assumptions about the way the data was collected. These assumptions essentially describe data collected in randomised experiments (which, not surprisingly, is where NHST originated): that we have a sample of the population we’re interested in; that this sample was drawn randomly, each observation independent of the others; and that the resulting variable has a normal distribution. The first two assumptions are out of our control as data analysts but the third—the normality assumption—is something we can investigate and, possibly, correct.

How can we tell if data is normally distributed? The most obvious way, which is frequently all you need, is to inspect it graphically using the techniques we learned for visualising distributions. Plotting the distribution of bill length of Adelie penguins from the palmerpenguins dataset, for example:

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)

adelie <- filter(penguins, species == "Adelie")
ggplot(adelie, aes(bill_length_mm)) + geom_density()

We can see that the distribution definitely looks normal. If we wanted to be really sure (or convince someone else!) that it was normal enough to work with in NHST, we can use the Shapiro-Wilk test of normality:

shapiro.test(adelie$bill_length_mm)

This reports the test statistic (W) and an associated p-value. The null hypothesis of the Shapiro-Wilk test is that the variable is normally distributed; here we see a high p-value, meaning that we cannot be confident about rejecting this null hypothesis. In other words, it indicates that the distribution is probably drawn from a normal distribution.

Unfortunately, as we have already found out, it is quite common to see markedly not-normal distributions in archaeological data. Take the number of lithics found at sites on Islay, for example:

library(islay)

ggplot(islay_sites, aes(total_chipped_stone)) + geom_density()

In case we weren’t already certain, Shapiro-Wilk tells us we can very confidently reject the null hypothesis of normality:

shapiro.test(islay_sites$total_chipped_stone)

Given a non-normal distribution like this, we have two options. The first is to use one of a number of nonparametric tests—NHST techniques that do not make assumptions about the distribution of the data—which we will look at shortly.

Or, we can try to make the data normal. This option is particularly attractive here because the distribution of chipped stone counts resembles a log-normal distribution, which can easily be transformed into a normal distribution by taking its logarithm:

islay_sites <- mutate(islay_sites, log_chipped_stone = log(total_chipped_stone))

ggplot(islay_sites, aes(log_chipped_stone)) + geom_density()

shapiro.test(islay_sites$log_chipped_stone)

Thus we could precede with parametric NHST using log_chipped_stone instead of total_chipped_stone.

A question of power

You might wonder, if parametric tests are so fussy about what kind of data they’ll accept, but nonparametric tests will take anything, why don’t we just always use nonparametric ones? One reason for this is historical: parametric tests make the assumptions to do because they make it much easier to calculation the test statistic and its p-value. In the pre-computer era, ease of calculation was extremely important, and this parametric testing became the norm. Another reason—now more relevant—is that these assumptions tend to give parametric more statistical power: they’re better at detecting small but significant effects, while avoiding false positives.

Exercises

Is the body mass of Adelie penguins normally distributed?
islay_land contains the area of Islay itself and the islets around it. Describe its distribution.
Transform area such that it could be used in a nonparametric test.

Testing the difference between numeric distributions

Having established whether our data is normal (or not), we can move on to testing hypotheses about it. A very common class of question is whether there is a statistically significant difference between two samples. Or, put another way, whether two sample were drawn from the same population. You can reduce a lot of types of archaeological questions to this type of hypothesis:

Are two groups of artefact different enough to be called “types”?
Do sites in different periods or contexts differ in a certain characteristic?
Reversing the question, is an assemblage from one site likely to be from the same ‘population’ as an assemblage in another?

We’ll look at whether the number of chipped stone artefacts differs between Mesolithic and Later Prehistoric sites. We can start, as always, by inspecting the data graphically:

filter(islay_sites, period %in% c("Mesolithic", "Later Prehistoric")) |>
    ggplot(aes(total_chipped_stone)) + 
        facet_wrap(~period, ncol = 1, scales = "free_y") + 
        geom_density()

They look a little different, but it’s hard to know if this is just because of unequal sampling. This is where NHST can be handy.

Student’s t test is a parametric test for the equality of the means of two distributions. If we inspect the documentation for its R function, ?t.test, we can see that we need to tell it whether the variance of the distributions is equal or not. To decide this, we can turn to another test, the F test for equal variance:

mesolithic <- filter(islay_sites, period == "Mesolithic")
later <- filter(islay_sites, period == "Later Prehistoric")
var.test(mesolithic$log_chipped_stone, later$log_chipped_stone)

Remember, we’ve found that the total chipped stone is not normally distributed, so to do these parametric tests we use log_chipped_stone instead. This tells us that the variances are probably equal, so we can precede with the t test using var.equal = FALSE:

t.test(mesolithic$log_chipped_stone, later$log_chipped_stone)

Though the p value is quite small, it is above 0.05, so cannot safely reject the null hypothesis. * Nonparametric: Mann-Whitney U * Nonparametric: Kolmogorov-Smirnov

Exercises

Is there a significant difference between the number of chipped stones at Mesolithic and Later Prehistoric sites on Islay?
Give a null hypothesis and alternative hypothesis that could help answer the following archaeological questions:
- A zooarchaeologist has measured the length of deer long bones at a hunter-gatherer site and an early farming site in the same region. She theorises that the two sites were hunting from different deer populations. Is this likely?
- You have a newly-discovered site from which an assemblage of bronze arrowheads have been excavated. You also have a set of measurements of similar arrowheads from a Bronze Age culture in an adjacent region. Based on the arrowheads, does your new site belong to this culture?
- Were Neolithic people shorter than Mesolithic people?
Two non-parametric versions of the t-test are the Kolmogorov-Smirnov test and the Mann-Whitney U test. Perform each in R using the (untransformed) number of chipped stones from sites on Islay. Do the results differ from the t test above?