Welcome back to the part two of this training module on data visualization using ggplot2. If this is your first time of seeing this post, please go back to see the previous post where I covered the basic steps of using ggplot2.

In this part I will cover two-dimensional geometries and statistical transformations.

Two-dimensional geometries

Before we move on, let’s load the tidyverse library with the code below. If you have not installed tidyverse remove the “#” symbol and install it now.

# install.packages("tidyverse")
library("tidyverse")
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We will still be using our dataset on United States colleges which consists of the school name, city, state, region, highest degree offered by each school, SAT average scores, tution and so on. Once again, I would like to iterate that this dataset is not mine. I found it online and it can be accessed through this link.

Let’s import our data and store it R with the variable name visualData

visualData <- read_csv('https://olawaleayilara.github.io/visualData.csv')
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   name = col_character(),
##   city = col_character(),
##   state = col_character(),
##   region = col_character(),
##   highest_degree = col_character(),
##   control = col_character(),
##   gender = col_character(),
##   admission_rate = col_double(),
##   sat_avg = col_integer(),
##   undergrads = col_integer(),
##   tuition = col_integer(),
##   faculty_salary_avg = col_integer(),
##   loan_default_rate = col_character(),
##   median_debt = col_double(),
##   lon = col_double(),
##   lat = col_double()
## )
head(visualData)
## # A tibble: 6 x 17
##       id name  city  state region highest_degree control gender
##    <int> <chr> <chr> <chr> <chr>  <chr>          <chr>   <chr> 
## 1 102669 Alas… Anch… AK    West   Graduate       Private CoEd  
## 2 101648 Mari… Mari… AL    South  Associate      Public  CoEd  
## 3 100830 Aubu… Mont… AL    South  Graduate       Public  CoEd  
## 4 101879 Univ… Flor… AL    South  Graduate       Public  CoEd  
## 5 100858 Aubu… Aubu… AL    South  Graduate       Public  CoEd  
## 6 100663 Univ… Birm… AL    South  Graduate       Public  CoEd  
## # ... with 9 more variables: admission_rate <dbl>, sat_avg <int>,
## #   undergrads <int>, tuition <int>, faculty_salary_avg <int>,
## #   loan_default_rate <chr>, median_debt <dbl>, lon <dbl>, lat <dbl>

Take time to play with the data to ensure it is clean enough before you proceed to visualization. In this case, our data is reasonably clean. Therefore, we can proceed to the next stage.

The first geom that we will be considering is the point geom used for creating scatter plot. Scatterplot is a very simple way to display the relationship between two continuous variables. It can also be used to compare categorical variables but there are other variation that are more appropriate for this type of data.

ggplot(data=visualData) +
  geom_point(mapping=aes(x=tuition, y=faculty_salary_avg))

unnamed-chunk-3-1.png

We can add different scale to the aesthetic. For example, let’s add a log scale

ggplot(data=visualData) +
  geom_point(mapping=aes(x=tuition, y=log(faculty_salary_avg)))

unnamed-chunk-4-1.png

Heatmap of 2d bin counts this is a useful alternative to point geom in the presence of overplotting. This plot divides the plane into rectangles, counts the number of cases in each rectangle, and then maps the number of cases to the rectnagle’s fill.

ggplot(data=visualData) +
  geom_bin2d(mapping=aes(x=tuition, y= faculty_salary_avg))

unnamed-chunk-5-1.png

Hexagonal heatmap of 2d bin counts avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d(). This plot divides the plane into regular hexagons, counts the number of cases in each hexagon, and then maps the number of cases to the hexagon fill.

ggplot(data=visualData) +
  geom_hex(mapping=aes(x=tuition, y= faculty_salary_avg))

unnamed-chunk-6-1.png

Contours of a 2d density estimate is a useful plot for dealing with overplotting. It perform a 2D kernel density estimation using MASS::kde2d() and display the results with contours.

ggplot(data=visualData) +
  geom_density2d(mapping=aes(x=tuition, y= faculty_salary_avg))

unnamed-chunk-7-1.png

In my next post, I will show how we can modify axes to fix the cut observed in the plot above.

Smoothed line aids the eye in seeing patterns in the presence of overplotting.

ggplot(data=visualData) +
  geom_smooth(mapping=aes(x=tuition, y= faculty_salary_avg))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

unnamed-chunk-8-1.png

We can as well add a smoothed line to a scatterplot when it is difficult to see the dominant pattern.

ggplot(data=visualData,mapping=aes(x=tuition, y= faculty_salary_avg)) +
  geom_point() +
  geom_smooth() 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

unnamed-chunk-9-1.png

If you’re not interested in the confidence band around the plot, you can turn it off with geom_smooth(se = FALSE).

ggplot(data=visualData,mapping=aes(x=tuition, y= faculty_salary_avg)) +
  geom_point() +
  geom_smooth(se = FALSE) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

unnamed-chunk-10-1.png

The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).

ggplot(data=visualData,mapping=aes(x=tuition, y= faculty_salary_avg)) +
  geom_point() +
  geom_smooth(span = 0.01) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

unnamed-chunk-11-1.png

An important argument to geom_smooth() is the method, which allows you to select the type of model to use in fitting the smooth curve. Method = “loess”, which is the default for small n, uses a smooth local regression. Loess does not work well for large datasets, an alternative smoothing algorithm is used when n is greater than 1000. Method = “gam” fits a generalised additive model provided by the mgcv package. Also, we can fit a linear model by specifying method = “lm”.

ggplot(data=visualData,mapping=aes(x=tuition, y= faculty_salary_avg)) +
  geom_point() +
  geom_smooth(method = "lm")

unnamed-chunk-12-1.png

As I mentioned earlier that we can also use the geom_point to compare categorical variables, I will quickly show a plot and highlight the limitation of using geom_point for categorical variables.

ggplot(data=visualData) +
  geom_point(mapping=aes(x=highest_degree, y=faculty_salary_avg))

unnamed-chunk-13-1.png

We observe that it is difficult to see the distribution because many points are plotted in the same location. There are number of ways we can fix this problem. The first approach I will talk about is jittering.

Jittering uses geom jitter() to add a little random noise to the data which can help avoid overplotting.

ggplot(data=visualData) +
  geom_jitter(mapping=aes(x=highest_degree, y=faculty_salary_avg))

unnamed-chunk-14-1.png

The second approach is the Boxplot, which uses the geom_boxplot() and summarises the shape of the distribution with a handful of summary statistics.

ggplot(data=visualData) +
  geom_boxplot(mapping=aes(x=highest_degree, y=faculty_salary_avg))

unnamed-chunk-15-1.png

And the last one I will show is the Violin plot, which show a compact representation of the “density” of the distribution, highlighting the locations where more points are found

ggplot(data=visualData) +
  geom_violin(mapping=aes(x=highest_degree, y=faculty_salary_avg))

unnamed-chunk-16-1.png

Each method has its strengths and weaknesses. Boxplots summarises the the distribution with a five-number summary, while jittered plots show every point but only work with relatively small datasets. Violin plots are very informative and rely majorly on the calculation of a density estimate, which can be tricky to interpret. So, the choice of techniques depends on the researcher.

Sometimes the best visual to show a trend is the line plot. This is often the case when you are using a time series data. To plot this graph, you can use the geom_line or geom_path function. We will use the economics dataset in R.

ggplot(data=economics) +
  geom_line(mapping=aes(x=date, y=uempmed))

unnamed-chunk-17-1.png

ggplot(data=economics,mapping=aes(x=date, y=uempmed)) +
  geom_path() 

unnamed-chunk-18-1.png

At this point, I have covered almost all the two-dimensional geometries, but before we move forward I will like to briefly show you how to use group in our aesthetic. In some cases, it might be reasonable to compare data among different groups and still present them with the same visual. This is common in longitudinal studies with many subjects, where the plots are often descriptively called spaghetti plots. For this example, let us use a simple longitudinal dataset, Oxboys , from the nlme package. The data records the heights (height) and centered ages (age) of 26 boys (Subject), measured on nine occasions (Occasion).

data(Oxboys, package = "nlme")
head(Oxboys)
## Grouped Data: height ~ age | Subject
##   Subject     age height Occasion
## 1       1 -1.0000  140.5        1
## 2       1 -0.7479  143.4        2
## 3       1 -0.4630  144.8        3
## 4       1 -0.1643  147.1        4
## 5       1 -0.0027  147.7        5
## 6       1  0.2466  150.2        6
ggplot(data = Oxboys, mapping = aes(age, height, group = Subject)) +
geom_point() +
geom_line()

unnamed-chunk-19-1.png

The plot above show the growth trajectory for each boy (i.e each subject). Suppose we want to add a single smooth line, showing the overall trend for all boys.

data(Oxboys, package = "nlme")
ggplot(data = Oxboys, mapping = aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)

unnamed-chunk-20-1.png

Ooops, this is not what we want to achieve, we have just succeeded in adding a smoothed line for each boy. To get to our goal, let’s not specify the grouping aesthetic in ggplot(), where it will be applied to all layers, but instead we specify it in geom_line() so it applies only to the lines.

data(Oxboys, package = "nlme")
ggplot(data = Oxboys, mapping = aes(age, height)) +
geom_line(mapping = aes(group = Subject)) +
geom_smooth(method = "lm", se = FALSE)

unnamed-chunk-21-1.png

In some cases, the plots may have a discrete x scale, but you still want to draw lines connecting across groups. For example, let’s draw boxplots of height at each measurement occasion

data(Oxboys, package = "nlme")
ggplot(data = Oxboys, mapping = aes(Occasion, height)) +
geom_boxplot() 

unnamed-chunk-22-1.png

To overlay lines that connect each individual, just adding geom_line() will not work. This is because lines are drawn within each occasion, not across each subject

data(Oxboys, package = "nlme")
ggplot(data = Oxboys, mapping = aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "red", alpha = 0.5)

unnamed-chunk-23-1.png

In order to achieve what we want, we need to override the grouping showing that we want one line per boy

data(Oxboys, package = "nlme")
ggplot(data = Oxboys, mapping = aes(Occasion, height)) +
geom_boxplot() +
geom_line(mapping = aes(group = Subject), colour = "red", alpha = 0.5)

unnamed-chunk-24-1.png

Statistical transformations

A statistical transformation, or stat, transforms the data, usually by summarising it in some manner. A useful example of stat is the smoother, which calculates the smoothed mean of response variable (y), conditional on explanatory variable (x). Technically, we’ve already used many of ggplot2’s stats because they’re used behind the scenes to generate many important geoms such as:

  • stat_bin(): geom_bar(), geom_freqpoly(), geom_histogram()
  • stat_bin2d(): geom_bin2d()
  • stat_bindot(): geom_dotplot()
  • stat_binhex(): geom_hex()
  • stat_boxplot(): geom_boxplot()
  • stat_contour(): geom_contour()
  • stat_quantile(): geom_quantile()
  • stat_smooth(): geom_smooth()
  • stat_sum(): geom_count()

There are some other statistical transformations that can’t be created with a geom function. For example,

  • stat_ecdf(): compute a empirical cumulative distribution plot.
  • stat_function(): compute y values from a function of x values.
  • stat_summary(): summarise y values at distinct x values.
  • stat_summary2d(), stat summary hex(): summarise binned values.
  • stat_qq(): perform calculations for a quantile-quantile plot.
  • stat_spoke(): convert angle and radius to position.
  • stat_unique(): remove duplicated rows.

Two ways to use the stat function:

  1. Add a stat_() function and override the default geom
ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  geom_point() +
  stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4)

unnamed-chunk-25-1.png

  1. Add a geom_() function and override the default stat:
ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  geom_point() +
  geom_point(stat = "summary", fun.y = "mean", colour = "red", size = 4)

unnamed-chunk-26-1.png

We can as well look at the empirical cumulative distribution plot of the average salary of faculty members in our data using

ggplot(data=visualData) +
  stat_ecdf(mapping = aes(faculty_salary_avg)) 

unnamed-chunk-27-1.png

Let’s explore the stat_summary function more to produce some interesting visuals.

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary()
## No summary function supplied, defaulting to `mean_se()

unnamed-chunk-28-1.png

This is the default function for stat_summary and it returned both the mean and standard error estimates. We can change the geom to produce some other nice plots

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(geom = "crossbar") 
## No summary function supplied, defaulting to `mean_se()

unnamed-chunk-29-1.png

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(geom = "errorbar") 
## No summary function supplied, defaulting to `mean_se()

unnamed-chunk-29-2.png

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(geom = "linerange") 
## No summary function supplied, defaulting to `mean_se()

unnamed-chunk-29-3.png

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(geom = "pointrange") 
## No summary function supplied, defaulting to `mean_se()

unnamed-chunk-29-4.png

I will wrap up this session by showing how you can specify fun.y, fun.ymin and fun.ymax with any function. For example, Mean ± SD. Note: You can write the function outside ggplot and provide the stored variable in the fun.y, fun.ymin and fun.ymax.

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(fun.y = mean,
               fun.ymax = function(x) mean(x) + sd(x), 
               fun.ymin = function(x) mean(x) - sd(x),
               geom = "pointrange") 

unnamed-chunk-30-1.png

Understanding the grammar of ggplot2, and how its components fit together, allows you to create a wider range of visualizations, combine multiple sources of data, and customise to your heart’s content. I will leave you with the plot below, where I just combined different stuffs to get something.

ggplot(data=visualData,mapping=aes(x=highest_degree, y=faculty_salary_avg)) +
  stat_summary(fun.y = mean,
               fun.ymax = function(x) max(x), 
               fun.ymin = function(x) min(x),
               geom = "pointrange") +
  geom_jitter() +
  geom_violin() +
  stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4) 

unnamed-chunk-31-1.png

In this post, I have covered the basics of two-dimensional geometries. In my next post, I will cover scales, axes, legends, positioning and themes. Also, I will briefly introduce how to program with ggplot2. Until then, you can go ahead and pick different variables to produce the plots we have covered in this session.

Reference

  • Wickham H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!), 2nd Edition. Springer, New York

For any question and contributions, please feel free to email ayilarof@myumanitoba.ca