A Deep Dive into Bee Colony Loss with CanvasXpress and Plotly in R

Today we’ll be looking at creating some data visualizations in both Plotly and the Javascript-based visualization library CanvasXpress. Both are robust libraries with lots of similar features, so it’ll be fun to see what it’s like to try and recreate certain plots in both.

First, let’s load the appropriate libraries. We’ll need the Tidyverse for some data wrangling.

library(tidyverse)
library(plotly)
library(canvasXpress)

Our dataset today was obtained from the USDA’s 2022 release of the annual Honey Bee Colonies report (https://usda.library.cornell.edu/concern/publications/rn301137d?locale=en). We can grab some of the cleaned data from the TidyTuesday Github Repository.

# get data from tidytuesday's github 
colony_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')

stressor_data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/stressor.csv')


Scatterplots

Let’s start off with a basic visualization of which states have the most bee colonies and how many they’ve lost in total since 2015. We’ll be using scatterplots to visualize this data to get a sense of how the different libraries handle basic visualization.

# prepare data 
colony_data_grouped <- colony_data %>%
    mutate(state = as.factor(state)) %>%
    select(c("state", "colony_max", "colony_lost", "colony_n")) %>%
    group_by(state) %>%
    na.omit() %>%
    summarize_all(sum)

# take a look at the data
head(colony_data_grouped)
## # A tibble: 6 × 4
##   state       colony_max colony_lost colony_n
##   <fct>            <dbl>       <dbl>    <dbl>
## 1 Alabama         214000       32710   197500
## 2 Arizona         734500      132300   706500
## 3 Arkansas        561000       84300   539500
## 4 California    29830000     3456000 23510000
## 5 Colorado        632500       94080   489500
## 6 Connecticut      89400        6450    87900

All Plotly plots first begin with initializing a Plotly object using plot_ly(). After this is done you then build upon the Plotly object with your plot elements. We have added in an additional fit line so we can better visualize the trend. In Plotly we do this by creating a linear model and then plot it using add_lines(). We then add our aesthetic configurations at the end with layout().

# build initial Plotly plot
fig <- plot_ly(colony_data_grouped,
               x = ~colony_lost,
               y = ~colony_n,
               type = 'scatter',
               mode = 'markers')

# add linear regression model  
fit <- lm(colony_n ~ colony_lost, data = colony_data_grouped)

fig <- fig %>% add_lines(x = ~colony_lost, y = fitted(fit))

# add layout arguments and aesthetics
fig %>% layout(yaxis = list(title = "Number of Bee Colonies"),
               xaxis = list(title = "Number of Colonies Lost"),
               title = list(text = "Total Number of Bee Colonies by State Since 2015"),
               showlegend = FALSE)

Now, let’s try recreating something similar in CanvasXpress.

CanvasXpress was built initially with biological data in mind, so it tends to work better with wide data formats. An example of this is having samples as columns, and variables as rows. CanvasXpress uses rownames to match up the data, so reshaping the data to emphasize row and column names are important. If you would like to learn more about wide datasets, you can check out this article.

# group data 
colony_data_grouped <- colony_data %>%
    mutate(state = as.factor(state)) %>%
    select(c("state", "colony_max", "colony_lost", "colony_n")) %>%
    group_by(state) %>%
    na.omit() %>%
    summarize_all(sum)

# reshaping the data so that states are rownames
cx_colony_data_grouped <- colony_data_grouped %>%
    select(c(colony_n, colony_lost, state)) %>%
    column_to_rownames("state")

# take a look at the data
head(cx_colony_data_grouped)
##             colony_n colony_lost
## Alabama       197500       32710
## Arizona       706500      132300
## Arkansas      539500       84300
## California  23510000     3456000
## Colorado      489500       94080
## Connecticut    87900        6450

Unlike Plotly, CanvasXpress plots can all be configured using the canvasXpress() function to build our plot object. CanvasXpress can take yAxis and xAxis arguments, but I personally found it easier to model the data and then feed it in. CanvasXpress has a very useful function to add a regression line to the plot without the need for a pre-existing model. This is because of the extensive data modelling we did earlier, which makes it easier for CanvasXpress to make assumptions about our data.

canvasXpress(data = cx_colony_data_grouped,
             graphtype = "Scatter2D",
             afterRender = list(list("addRegressionLine")),
             title =  "Total Number of Bee Colonies by State Since 2015",
             yAxisTitle = list("Number of Bee Colonies"),
             xAxisTitle = list("Number of Colonies Lost"))

We can see that although CanvasXpress requires a bit more mindfulness with regards to data wrangling, we don’t have to go through the layering approach that Plotly typically requires.

Stacked Bar Chart

Let’s try to create a stacked bar chart. For this one, let’s take a look at the stressors involved in bee colony losses and their percent contributions. For the Plotly visualization we needed to do some wrangling of the data to used summarised values.

# prepare the data 
stressor_data_filtered <- stressor_data %>%
    mutate(stressor = as.factor(stressor_data$stressor)) %>%
    group_by(stressor, state, year) %>%
    summarise(pct_avg = mean(stress_pct, na.rm = TRUE)) %>% # quarterly average
    group_by(stressor, year) %>%
    summarise(pct_avg = mean(pct_avg, na.rm = TRUE)) %>% # average across states
    pivot_wider(names_from = "stressor", values_from = "pct_avg") %>%
    rename('Diseases' = 'Disesases')

# take a look at the data
head(stressor_data_filtered)
## # A tibble: 6 × 7
##    year Diseases Other `Other pests/parasites` Pesticides Unknown `Varroa mites`
##   <dbl>    <dbl> <dbl>                   <dbl>      <dbl>   <dbl>          <dbl>
## 1  2015     3.67  6.32                   11.2        7.76    4.18           27.2
## 2  2016     4.15  5.78                   10.4        8.06    4.37           29.3
## 3  2017     5.87  6.61                   11.2        8.06    4.37           33.0
## 4  2018     4.06  7.80                   14.4        8.36    4.66           34.9
## 5  2019     4.32  6.45                   12.3        8.32    4.05           31.7
## 6  2020     3.52  5.28                    9.83       5.73    4.47           28.9

Next, in we will build our plotting object and add the traces for the individual factors manually. The layering here can be useful for controlling the order of the stacks. I personally found it a little cumbersome to repeat these traces, but I’m sure you could use an lapply() or for loop to make this easier on yourself.

fig <- plot_ly(stressor_data_filtered,
               x = ~year,
               y = ~`Varroa mites`,
               type = "bar",
               name = "Varroa mites")

# add stacked bars through traces
fig <- fig %>% add_trace(y = ~`Other pests/parasites`,
                         name = "Other pests/parasites")
fig <- fig %>% add_trace(y = ~Pesticides,
                         name = "Pesticides")
fig <- fig %>% add_trace(y = ~`Diseases`,
                         name = "Diseases")
fig <- fig %>% add_trace(y = ~Other,
                         name = "Other")
fig <- fig %>% add_trace(y = ~Unknown,
                         name = "Unknown")
fig %>% layout(yaxis = list(title = "Average Stress PCT %"),
               barmode = 'stack')

Now, for the CanvasXpress version! In addition to the data wrangling that was included for the Plotly visualization, you can see that we have also had to pivot the data to be in wide format.

# prepare the data 
stressor_data_filtered <- stressor_data %>%
    mutate(stressor = as.factor(stressor_data$stressor)) %>%
    group_by(stressor, state, year) %>%
    summarise(pct_avg = mean(stress_pct, na.rm = TRUE)) %>% # quarterly average
    group_by(stressor, year) %>%
    summarise(pct_avg = mean(pct_avg, na.rm = TRUE)) %>% # average across states
    pivot_wider(names_from = "stressor", values_from = "pct_avg") %>%
    rename('Diseases' = 'Disesases')

# setting rownames to stressors and columns to year
cx_stressor_data_filtered <- stressor_data_filtered %>%
    pivot_longer(Diseases:`Varroa mites`) %>% 
    pivot_wider(names_from = year) %>% 
    column_to_rownames('name')

# take a look at the data
head(cx_stressor_data_filtered)
##                            2015      2016      2017      2018      2019
## Diseases               3.667572  4.152482  5.868794  4.064007  4.323404
## Other                  6.317199  5.781206  6.614184  7.795745  6.450709
## Other pests/parasites 11.219149 10.362766 11.248582 14.437589 12.277305
## Pesticides             7.760993  8.062943  8.056522  8.364539  8.323333
## Unknown                4.183156  4.372518  4.373404  4.660638  4.045390
## Varroa mites          27.172872 29.330142 33.002128 34.948759 31.731206
##                            2020      2021
## Diseases               3.519858  3.058333
## Other                  5.275709  5.774468
## Other pests/parasites  9.834574 10.315116
## Pesticides             5.725189  5.440000
## Unknown                4.474291  3.936905
## Varroa mites          28.882447 27.784043

Looking at the data, we can see the format looks much cleaner and simplified compared to what we were able to give Plotly. As a result our actual plotting code below is fairly straightforward.

canvasXpress(data = cx_stressor_data_filtered, 
             graphType = "Stacked",
             graphOrientation = "vertical",
             title = "Stressor Contributors to Colony Loss",
             xAxisTitle = "Average Stress PCT %")

Our stacked bar charts here are pretty comparable and look solid. The approach towards the same type of chart here was very different depending on which package you were using.


Let’s take a look at the breakdown between both visualization libraries.

Overall, both libraries did a fantastic job at creating the visualizations we wanted. One thing that did surprise me was how different the Plotly code looked depending on what type of chart I was making. In contrast, although CanvasXpress was more particular with it’s data formatting, the actual plotting functions themselves looked very consistent.


Plotly R

  • Requires some additional data preparation depending on the chart type

  • More manual control over data arguments, such as manually setting x and y

  • Uses a lot of relative column and variable names

  • Has a layered, modular approach that can be flexible but also bloated. Documentation is robust since it has a large community surrounding it

  • Has various interactive features and download functionalities

CanvasXpress

  • Requires data pivoting to a wider format and setting of rownames

  • Easier to work with dataframes than tibbles and nested data formats (prefers flat data)

  • Assumes x and y values based on data argument, and doesn’t handle ambiguity well

  • All plotting arguments are set in one function, but understanding which arguments are available for each chart type can be hard to find

  • Has various interactive features and download functionalities

  • Comes with a plot editor (accessed through right-click) that can be very handy for users

Stick around for part two of this article, where we do a deeper dive into more complex plots!

Interested in the code used in this article? Check out the raw versions here on Github.