Hadley Wickham on Data Visualization and Data Science

Data Visualization
ggplot2
tidyverse
Prepare for Class 05
Author

Eli Evans

Published

September 25, 2022

EMBL Keynote Lecture on Data Visualization and Data Sceince

The YouTube link to a recording of the keynote lecture by Hadley Wickham can be found here.

Key technologies and techniques

Wickham mentions the lattice package they created for biology data, which they found many issues with. After reading Grammar of Graphics, they created ggplot as a open-source utilization of the outlined priciples. For aggregating data, Wickham created the reshape, reshape2, and tidyr packages. They also mention the plyr package, which was split into dplyr and purrr for modern use. Tidyverse organizes ggplot2, dplyr, purrr, and tidyr with shared conventions to make learning each new package easier. A gganimate package has also been created to animate any ggplot2 function over time, and the ggseq package helps visualize areas of the brain with brain atlases.

Notes

  • Wickham defines “tidy” data as data that’s not necessarily clean, but is organized to allow analyzation with attributes as columns and observations as rows.

  • A main idea of data visualization is mapping between variables and what you can perceive.

  • Code is powerful because it is text, allowing easy collaboration and recreation. Because it is a series of steps, debugging becomes easier because you can see every step you took in the last run, edit it, and iterate. These steps can also be reused when data updates, because the plots can be updated as well using the code from the previous code iteration.

Summary

Wickham’s keynote talk is directed towards people who currently visualize data through more traditional means such as pen and paper and introduces them to the possibility of utilizing code to create visualizations. They emphasize the goal of tidyverse: to make code for visualizations pronounceable and productive results easy to achieve. Wickham also discusses the process to create visualizations, wherein early plots require little code and are helpful for those familiar with the data and later plots for the general public require more code. Hadley Wickham additionally stresses the reproducability and long-term aid of code in creating visualizations, and showcases emerging packages that aid interactability and specific domains.