Introduction

In this training module, I will introduce readers to grammar of graphics in R, with a focus on ggplot2. This package was written by Hadley Wickham and Winston Chang. The repository can accessed through this link. ggplot2 is part of the tidyverse package and it useful for simple and complex data visualization. tidyverse package make it easier to load different types of dataset in R and allows quick data manipulations before visulization.

If you don’t have tidyverse, you can install it with the code below:

# install.packages("tidyverse")
library("tidyverse")
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

We will be using the United States college data which consists of the school name, city, state, region, highest degree offered by each school, SAT average scores, tution and so on. Before, we move too far, I would to let you know that this dataset is not mine. I found it online and it can be accessed through link.

Let’s import our data and store it R with the variable name visualData

visualData <- read_csv('https://olawaleayilara.github.io/visualData.csv')
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   name = col_character(),
##   city = col_character(),
##   state = col_character(),
##   region = col_character(),
##   highest_degree = col_character(),
##   control = col_character(),
##   gender = col_character(),
##   admission_rate = col_double(),
##   sat_avg = col_integer(),
##   undergrads = col_integer(),
##   tuition = col_integer(),
##   faculty_salary_avg = col_integer(),
##   loan_default_rate = col_character(),
##   median_debt = col_double(),
##   lon = col_double(),
##   lat = col_double()
## )
head(visualData)
## # A tibble: 6 x 17
##       id name  city  state region highest_degree control gender
##    <int> <chr> <chr> <chr> <chr>  <chr>          <chr>   <chr> 
## 1 102669 Alas… Anch… AK    West   Graduate       Private CoEd  
## 2 101648 Mari… Mari… AL    South  Associate      Public  CoEd  
## 3 100830 Aubu… Mont… AL    South  Graduate       Public  CoEd  
## 4 101879 Univ… Flor… AL    South  Graduate       Public  CoEd  
## 5 100858 Aubu… Aubu… AL    South  Graduate       Public  CoEd  
## 6 100663 Univ… Birm… AL    South  Graduate       Public  CoEd  
## # ... with 9 more variables: admission_rate <dbl>, sat_avg <int>,
## #   undergrads <int>, tuition <int>, faculty_salary_avg <int>,
## #   loan_default_rate <chr>, median_debt <dbl>, lon <dbl>, lat <dbl>

Take time to play with the data to ensure it is clean enough to proceed to visualization. In this case, our data is reasonably clean. Therefore, we can proceed to the next stage.

Basic Components of ggplot2

The three key components that make up every visuals using ggplot are: i. data, ii. geometries: (geom) function (shapes or layers describing how to visualize data), iii. aesthetics: (visual properties of geometries such as; size, color), iv. scale: mappings between geometries and aesthetics.

Our focus in this module will be on these four basic components. However, it is important to know that there are other optional components that control the visualization of the plots. The basic syntax for ggplot is:

ggplot(data = dataset, mapping = aes()) + 
   geom_function()

OR

ggplot(data = dataset) + 
   geom_function(mapping = aes())

We will see the difference in the two syntaxes as we move on. Now, let’s create a simple scatter plot using this grammar.

ggplot(data=visualData) +
  geom_point(mapping=aes(x=tuition, y=sat_avg))

unnamed-chunk-3-1

In ggplot, the + sign allows us to add different layer to our plot. We can add other layer to accomodate other data, statistical summaries, or metadata to make more complex plots. For example, we can add a line geometry

ggplot(data=visualData) +
  geom_point(mapping=aes(x=tuition, y=sat_avg)) +
  geom_line(mapping=aes(x=tuition, y=sat_avg))

unnamed-chunk-4-1.png

Since the mapping is the same for the two geometries, we can specify it directly in the ggplot function to keep our code simple and we will still obtain the same plot.

ggplot(data = visualData, mapping=aes(x=tuition, y=sat_avg)) +
  geom_line() +
  geom_point()

/unnamed-chunk-5-1.png

An interesting feature of ggplot2 is that we can save our data and aesthetic mappings to an object, and call this object with different geoms or other layers. This can be very useful if you want to explore how different geoms work to visualize your data, or adjust different aspects of the plot.

SAT <- ggplot(data = visualData, mapping=aes(x=tuition, y=sat_avg))
SAT + geom_point()

unnamed-chunk-7-1.png

One dimensional geoms, colors and shapes

Geoms are essential components of ggplot2, it allows us to specify what type of plot we wish to draw. For example in our previous code, the function geom_point draws a scatterplot which shows the relationship between x and y values with points. Geoms can categorize based on the number of dimensions they are mapping from 1D to 2D, or even 3D. Also, we can have individual geoms, which map each observation provided to an element of the plot. Additionally, collective geoms can help group points to summarize aspects of the data (e.g. boxplot).

Let us look at some of the geoms suitable for visualizing single continuous variables. Histogram is a very simple way to examine the distribution of any single.

ggplot(data = visualData, mapping=aes(x = tuition)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unnamed-chunk-8-1.png

We can change the default number of bins, or the width of the bins with a simple line of code.

ggplot(data = visualData) +
  geom_histogram(mapping=aes(x=tuition), bins = 5)

unnamed-chunk-9-1.png

ggplot(data = visualData) +
  geom_histogram(mapping=aes(x=tuition), bins = 100)

unnamed-chunk-10-1.png

Density Plots which is similar to histograms allows us to smooth out the distribution of data instead of binning the data. We can increase and decrease how much smoothness we want in the data using adjust. Values less than 1 result in less smoothing, and values greater than 1 result in more smoothing.

ggplot(data = visualData) +
  geom_density(mapping=aes(x=tuition))

unnamed-chunk-11-1.png

ggplot(data = visualData) +
  geom_density(mapping=aes(x=tuition), adjust = 0.05)

unnamed-chunk-12-1.png

ggplot(data = visualData) +
  geom_density(mapping=aes(x=tuition), adjust = 2)

unnamed-chunk-13-1.png

Frequency polygons display counts with lines.They are more suitable when you want to compare the distribution across the levels of a categorical variable.

ggplot(data = visualData) +
  geom_freqpoly(mapping=aes(x=tuition))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unnamed-chunk-14-1.png

Also, we can lay our frequncy polygon on the histogram

ggplot(data = visualData) +
  geom_histogram(mapping=aes(x=tuition), bins = 100) +
  geom_freqpoly(mapping=aes(x=tuition))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unnamed-chunk-15-1.png

quantile-quantile-plots can be used to assess how well the distribution of a continuous variable fits. The geom_qq() function requires a different aesthetic mapping than what we have seen before - instead of x we need to specify a sample mapping.

ggplot(data = visualData) + 
   geom_qq(mapping = aes(sample=tuition))

unnamed-chunk-16-1.png

We can add a line that connects the points at specified quartiles of the theoretical and sample distributions.

ggplot(data = visualData, mapping = aes(sample=tuition)) + 
   geom_qq() +
   geom_qq_line()

unnamed-chunk-17-1.png

We can visualize the distribution of single discrete variables as well using bar plots.

ggplot(data = visualData) + 
   geom_bar(mapping = aes(x=highest_degree))

unnamed-chunk-18-1.png

Let’s make our plots interesting to read by adding colors and fills.

ggplot(data = visualData) +
  geom_histogram(mapping=aes(x=tuition, fill = control, color = control))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unnamed-chunk-19-1.png

ggplot(data = visualData) +
  geom_density(mapping=aes(x=tuition, fill = control, color = control))

unnamed-chunk-20-1.png

ggplot(data = visualData) +
  geom_freqpoly(mapping=aes(x=tuition, color = control))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unnamed-chunk-21-1.png

ggplot(data = visualData) + 
   geom_bar(mapping = aes(x=highest_degree, fill = control))

unnamed-chunk-22-1.png

In this post, I have covered the basics of one dimensional geometries. In my next post, I will cover two dimensional geometries, and briefly introduce how we can format our plot appearences. Before then, you can go ahead and pick different variables to produce the plots we have covered in this session.

Please feel free to ask questions via email ayilarof@myumanitoba.ca. Also, let me know if you are interested in posting on my page.