Visualising relationships

In this tutorial we will continue to learn about visualisation as a tool for exploratory data analysis. We will look at ways of visualising the relationship between two or more variables using bar and column plots, scatterplots, additional aesthetics and facets.

Objectives

By the end of this tutorial you should:

  • Be able to generate questions about the relationship between two or more variables
  • Know how to produce bar plots, multiple density plots, stacked bar plots, and scatter plots in R using ggplot2
  • Be able to refine plots for communication and export them from R

Prerequisites

  • Edward Tufte, The Visual Display Of Quantitative Information (2nd edition), pp. 91–138:
    • Chapter 4, “Data–Ink and Graphical Redesign”
    • Chapter 5, “Chartjunk: Vibrations, Grids, and Ducks”
    • Chapter 6, “Data–Ink Maximization and Graphical Design”

Generating questions about relationships

Last week we looked at using visualisation to answer questions about the variation of a variable (its distribution). Although essential for describing and understanding the nature of your dataset, questions about a single variable have a fundamentally limited explanatory value.

This week we will start looking at the covariation between two (or more) variables – in plain terms, the relationship between them. With this we can start to gain insights into causality. In statistics, we say that there is a correlation between two variables if one can measurably predict the other. This is not a statement about causality, merely practicality: if you knew two variables were correlated, you could make a good guess about the value of the other.

This leads to the well-known adage, “correlation is not causation”. But equally, we should be aware that correlation can be a good hint about causation!

Exercises

Given a dataset on a burial ground, with the following variables:

  • sex of the individual
  • age of the individual
  • age of the burial (i.e. a radiocarbon date)
  • number of grave goods
  • number of metal objects amongst the grave goods
  1. What questions of covariation could we ask of the dataset?
  2. If there was a correlation between the age of the individual and the number of grave goods, could that imply causation?
  3. What about a correlation between the number of grave goods and the number of metal objects?

Visualising relationships

Work through section 2.5 and 2.6 of *R for Data Science” (2nd ed.)

You will then apply these techniques to an archaeological dataset.

Lithic assemblages from Islay

Load the islay_lithics dataset from islay:

library(islay)
data(islay_lithics)

We can use the head() function to get a quick preview of the data frame:

head(islay_lithics)
  site_code          region                         period   area flakes blades
1      LGM1 Loch Gorm South Mesolithic & Later Prehistoric 102450    159     15
2      LGM2 Loch Gorm South Mesolithic & Later Prehistoric  62497    125      6
3      LMG4 Loch Gorm South                           <NA>  37480     12      0
4      LGM5 Loch Gorm South                     Mesolithic  52473    128     18
5      LGM6 Loch Gorm South              Later Prehistoric  54971     56      4
6      LGM8 Loch Gorm South                           <NA>  49974     29      1
  chunks cores pebbles retouched total
1     16    24       0        15   229
2     11    20       4        16   182
3      1     1       6         3    23
4     17    27       7         5   202
5      8    18      12        10   108
6     20     3       0         5    58

Because this is an in-built dataset of the package, you can also enter ?islay_lithics to open the help page for the dataset, which contains more information on what it describes.

As with the last dataset, it will be useful to turn the period column into a factor now, so that it will automatically be ordered in our subsequent plots:

periods <- c("Mesolithic", "Mesolithic & Later Prehistoric", "Later Prehistoric")
islay_lithics$period <- factor(islay_lithics$period, periods)

Exercises

  1. Generate a plot showing the relationship between period and the number of retouched pieces. Is there a correlation? What could explain this?
  2. Try with two other types of lithics. Does it change your answer?
  3. Generate a plot showing the relationship between the number of two types of lithics.
  4. Add an aesthetic showing a categorical variable.
  5. Export the plot.