Today we’ll be looking at creating some data visualizations in both Plotly and the Javascript-based visualization library CanvasXpress. Both are robust libraries with lots of similar features, so it’ll be fun to see what it’s like to try and recreate certain plots in both.
First, let’s load the appropriate libraries. We’ll need the Tidyverse for some data wrangling.
library(tidyverse)
library(plotly)
library(canvasXpress)
Our dataset today was obtained from the USDA’s 2022 release of the annual Honey Bee Colonies report (https://usda.library.cornell.edu/concern/publications/rn301137d?locale=en). We can grab some of the cleaned data from the TidyTuesday Github Repository.
# get data from tidytuesday's github
colony_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
stressor_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')
Let’s start off with a basic visualization of which states have the most bee colonies and how many they’ve lost in total since 2015. We’ll be using scatterplots to visualize this data to get a sense of how the different libraries handle basic visualization.
# prepare data
colony_data_grouped <- colony_data %>%
mutate(state = as.factor(state)) %>%
select(c("state", "colony_max", "colony_lost", "colony_n")) %>%
group_by(state) %>%
na.omit() %>%
summarize_all(sum)
# take a look at the data
head(colony_data_grouped)
## # A tibble: 6 × 4
## state colony_max colony_lost colony_n
## <fct> <dbl> <dbl> <dbl>
## 1 Alabama 214000 32710 197500
## 2 Arizona 734500 132300 706500
## 3 Arkansas 561000 84300 539500
## 4 California 29830000 3456000 23510000
## 5 Colorado 632500 94080 489500
## 6 Connecticut 89400 6450 87900
All Plotly plots first begin with initializing a Plotly object using
plot_ly()
. After this is done you then build upon the
Plotly object with your plot elements. We have added in an additional
fit line so we can better visualize the trend. In Plotly we do this by
creating a linear model and then plot it using add_lines()
.
We then add our aesthetic configurations at the end with
layout()
.
# build initial Plotly plot
fig <- plot_ly(colony_data_grouped,
x = ~colony_lost,
y = ~colony_n,
type = 'scatter',
mode = 'markers')
# add linear regression model
fit <- lm(colony_n ~ colony_lost, data = colony_data_grouped)
fig <- fig %>% add_lines(x = ~colony_lost, y = fitted(fit))
# add layout arguments and aesthetics
fig %>% layout(yaxis = list(title = "Number of Bee Colonies"),
xaxis = list(title = "Number of Colonies Lost"),
title = list(text = "Total Number of Bee Colonies by State Since 2015"),
showlegend = FALSE)
Now, let’s try recreating something similar in CanvasXpress.
CanvasXpress was built initially with biological data in mind, so it tends to work better with wide data formats. An example of this is having samples as columns, and variables as rows. CanvasXpress uses rownames to match up the data, so reshaping the data to emphasize row and column names are important. If you would like to learn more about wide datasets, you can check out this article.
# group data
colony_data_grouped <- colony_data %>%
mutate(state = as.factor(state)) %>%
select(c("state", "colony_max", "colony_lost", "colony_n")) %>%
group_by(state) %>%
na.omit() %>%
summarize_all(sum)
# reshaping the data so that states are rownames
cx_colony_data_grouped <- colony_data_grouped %>%
select(c(colony_n, colony_lost, state)) %>%
column_to_rownames("state")
# take a look at the data
head(cx_colony_data_grouped)
## colony_n colony_lost
## Alabama 197500 32710
## Arizona 706500 132300
## Arkansas 539500 84300
## California 23510000 3456000
## Colorado 489500 94080
## Connecticut 87900 6450
Unlike Plotly, CanvasXpress plots can all be configured using the
canvasXpress()
function to build our plot object.
CanvasXpress can take yAxis
and xAxis
arguments, but I personally found it easier to model the data and then
feed it in. CanvasXpress has a very useful function to add a regression
line to the plot without the need for a pre-existing model. This is
because of the extensive data modelling we did earlier, which makes it
easier for CanvasXpress to make assumptions about our data.
canvasXpress(data = cx_colony_data_grouped,
graphtype = "Scatter2D",
afterRender = list(list("addRegressionLine")),
title = "Total Number of Bee Colonies by State Since 2015",
yAxisTitle = list("Number of Bee Colonies"),
xAxisTitle = list("Number of Colonies Lost"))
We can see that although CanvasXpress requires a bit more mindfulness with regards to data wrangling, we don’t have to go through the layering approach that Plotly typically requires.
Let’s try to create a stacked bar chart. For this one, let’s take a look at the stressors involved in bee colony losses and their percent contributions. For the Plotly visualization we needed to do some wrangling of the data to used summarised values.
# prepare the data
stressor_data_filtered <- stressor_data %>%
mutate(stressor = as.factor(stressor_data$stressor)) %>%
group_by(stressor, state, year) %>%
summarise(pct_avg = mean(stress_pct, na.rm = TRUE)) %>% # quarterly average
group_by(stressor, year) %>%
summarise(pct_avg = mean(pct_avg, na.rm = TRUE)) %>% # average across states
pivot_wider(names_from = "stressor", values_from = "pct_avg") %>%
rename('Diseases' = 'Disesases')
# take a look at the data
head(stressor_data_filtered)
## # A tibble: 6 × 7
## year Diseases Other `Other pests/parasites` Pesticides Unknown `Varroa mites`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 3.67 6.32 11.2 7.76 4.18 27.2
## 2 2016 4.15 5.78 10.4 8.06 4.37 29.3
## 3 2017 5.87 6.61 11.2 8.06 4.37 33.0
## 4 2018 4.06 7.80 14.4 8.36 4.66 34.9
## 5 2019 4.32 6.45 12.3 8.32 4.05 31.7
## 6 2020 3.52 5.28 9.83 5.73 4.47 28.9
Next, in we will build our plotting object and add the traces for the
individual factors manually. The layering here can be useful for
controlling the order of the stacks. I personally found it a little
cumbersome to repeat these traces, but I’m sure you could use an
lapply()
or for loop to make this easier on yourself.
fig <- plot_ly(stressor_data_filtered,
x = ~year,
y = ~`Varroa mites`,
type = "bar",
name = "Varroa mites")
# add stacked bars through traces
fig <- fig %>% add_trace(y = ~`Other pests/parasites`,
name = "Other pests/parasites")
fig <- fig %>% add_trace(y = ~Pesticides,
name = "Pesticides")
fig <- fig %>% add_trace(y = ~`Diseases`,
name = "Diseases")
fig <- fig %>% add_trace(y = ~Other,
name = "Other")
fig <- fig %>% add_trace(y = ~Unknown,
name = "Unknown")
fig %>% layout(yaxis = list(title = "Average Stress PCT %"),
barmode = 'stack')
Now, for the CanvasXpress version! In addition to the data wrangling that was included for the Plotly visualization, you can see that we have also had to pivot the data to be in wide format.
# prepare the data
stressor_data_filtered <- stressor_data %>%
mutate(stressor = as.factor(stressor_data$stressor)) %>%
group_by(stressor, state, year) %>%
summarise(pct_avg = mean(stress_pct, na.rm = TRUE)) %>% # quarterly average
group_by(stressor, year) %>%
summarise(pct_avg = mean(pct_avg, na.rm = TRUE)) %>% # average across states
pivot_wider(names_from = "stressor", values_from = "pct_avg") %>%
rename('Diseases' = 'Disesases')
# setting rownames to stressors and columns to year
cx_stressor_data_filtered <- stressor_data_filtered %>%
pivot_longer(Diseases:`Varroa mites`) %>%
pivot_wider(names_from = year) %>%
column_to_rownames('name')
# take a look at the data
head(cx_stressor_data_filtered)
## 2015 2016 2017 2018 2019
## Diseases 3.667572 4.152482 5.868794 4.064007 4.323404
## Other 6.317199 5.781206 6.614184 7.795745 6.450709
## Other pests/parasites 11.219149 10.362766 11.248582 14.437589 12.277305
## Pesticides 7.760993 8.062943 8.056522 8.364539 8.323333
## Unknown 4.183156 4.372518 4.373404 4.660638 4.045390
## Varroa mites 27.172872 29.330142 33.002128 34.948759 31.731206
## 2020 2021
## Diseases 3.519858 3.058333
## Other 5.275709 5.774468
## Other pests/parasites 9.834574 10.315116
## Pesticides 5.725189 5.440000
## Unknown 4.474291 3.936905
## Varroa mites 28.882447 27.784043
Looking at the data, we can see the format looks much cleaner and simplified compared to what we were able to give Plotly. As a result our actual plotting code below is fairly straightforward.
canvasXpress(data = cx_stressor_data_filtered,
graphType = "Stacked",
graphOrientation = "vertical",
title = "Stressor Contributors to Colony Loss",
xAxisTitle = "Average Stress PCT %")
Our stacked bar charts here are pretty comparable and look solid. The approach towards the same type of chart here was very different depending on which package you were using.
Overall, both libraries did a fantastic job at creating the visualizations we wanted. One thing that did surprise me was how different the Plotly code looked depending on what type of chart I was making. In contrast, although CanvasXpress was more particular with it’s data formatting, the actual plotting functions themselves looked very consistent.
Plotly R
|
CanvasXpress
|
Stick around for part two of this article, where we do a deeper dive into more complex plots!
Interested in the code used in this article? Check out the raw versions here on Github.