Multivariate analysis

In this tutorial we will look at multivariate analysis—methods that can accomodate multiple variables relating to the same observation—including cluster analysis, factor analysis, and multiple regression.

Prerequisites

None.

Objectives

By the end of this tutorial, you should be able to:

Calculate distance and dissimilarity indexes
Perform hierarchical and non-hierarchical clustering
Perform principle components analysis (PCA) and correspondence analysis
Perform multiple regression

Distance and dissimilarity

A common concept across many forms of multivariate analysis is the distance between observations. With numerical variables, this is analagous to measuring distances in physical space, using our data as the “coordinates”. In physical space, we might measure distances in two dimensions (e.g. on a map) or three (e.g. with a tape measure). In data space, we can use as many dimensions as we want – that is, the dimensionality of the problem is defined by the number of variables we want to use.

The simplest measure of distance for numerical variables is the Euclidean distance, which simply takes a straight line between a pair of observations. leather.csv contains measurements of leather objects from a Medieval site. We can use dist() (which defaults to the Euclidean distance) to measure the distance in length, width and thickness between each pair of objects:

library(readr)

leather <- read_csv("../data/leather.csv")
dist(leather[,c("length", "width", "thickness")])

The result is a large matrix giving the distance between each pair. The larger the number, the more distant the two corresponding rows of the original data frame are.

There are other distance measures for numerical data, notably the Chi-Square distance, Manhattan distance and Minkowski distance. We can control this using the method argument of dist():

dist(leather[,c("length", "width", "thickness")], method = "manhattan")

The strengths and weaknesses of these are for the most part only relevant when building more complex ‘machine learning’ models. Therefore, if you are interested in the distances in themselves, it makes sense to stick with the conceptually simple Euclidean distance. Otherwise, the choice of index generally depends on the type of model subsequently employed (e.g. correspondence analysis uses chi-square distances).

If we have categorical variables, it no longer makes sense to calculate distances. Instead, we use the analogous concept of dissimilarity (or similarity). Again, there are a wide range of different methods to choose from. The function vegdist() from the vegan package implements many of them. The most versatile is probably the Jaccard index (method = "jaccard").

One except is binary datasets (e.g. yes/no, true/false), which are common in archaeology where we want to record the presence or absence or feature or type. We can calculate the distance between observations in binary space using the Hamming distance, which simply counts the number of differences in each pair of observations.

Exercises

Why doesn’t it make sense to calculate the ‘distance’ between observations in a categorical space?
Calculate the Euclidean distance between the first two objects in leather.csv, without using the dist() function
kurgans.csv contains information on Bronze Age burial mounds. The characteristics of the burial are encoded as categorical variables in the columns INTERMENT through BODY_ORNAMENT. Calculate an appropriate distance/dissimilarity index for this data.

Cluster analysis

Clustering is the process of sorting observations into groups of minimal distance.

If we don’t know how many groups we want, we can use hierarchical clustering, which attempts to find the ‘natural’ breaks in the dataset. It generally tends towards “lumping” into larger groups. Hierarchical clustering is implemented in R as hclust(). For example, we can use it to cluster the leather objects by the Euclidean distances we calculated earlier:

leather_distance <- dist(leather[,c("length", "width", "thickness")])
leather_clusters <- hclust(leather_distance)

The easiest way to inspect the result is generally by plotting a dendrogram:

plot(leather_clusters)

Often we can get better results if we specify the number of groups we want on advance – this is called non-hierarchical clustering. In the best case scenario, we know from the context of the data that only a certain number of groups are possible. For example, if we are trying to cluster a group of animals into male and female populations based on measurements of animal bones. Otherwise, we can try and empirically determine an ‘optimal’ number from a prior hierarchical clustering. For example, if we inspect the heights at which the algorithm decided to split the leather objects into groups:

plot(rev(leather_clusters$height), type = "l")

We can see a distinct ‘elbow’ around 6, suggesting that this could be a good number of clusters to try.

k-means clustering is the most common non-hierarchical clustering method, implemented in the R function kmeans():

leather_kclusters <- kmeans(leather_distance, 6)

There is no method for plotting k-means clustering in base R, but the fpc package provides one.

library(fpc)
plotcluster(leather[,c("length", "width")], leather_kclusters$cluster)

Exercises

Perform a hierarchical clustering of the kurgans.csv data.
Perform a k-means clustering of the kurgans.csv data.

Factor analysis

Factor analysis tries to reduce the dimensionality of a dataset by finding a smaller number of factors (also known as hidden or latent variables) that account for as much as the distance between observations as possible. The two major families are principle components analysis (for numerical data) and (multiple) correspondence analysis (for categorical data).

There are many packages that implement correspondence analysis and/or principle components analysis in R. We will use dimensio because it has a nice interface and good documentation :)

Exercises

Perform a correspondence analysis of the kurgans dataset.
Generate a biplot of the results, indicating the regional groups from the original dataset.
Generate plots showing which variables contributed most to the first and second dimensions. What does this tell us?

Multiple regression

We can transform the regression model we worked on previously into a multiple regression very easily – simply add additional predictor variables to the formula given to lm()!

Exercises

Perform a multiple regression using the islay_lithics data.