Data Analysis with the portalr Package

Data Analysis with the `portalr` Package

by Hao

So you’ve read several posts about the Portal site and have even gone to the official GitHub repo for the data, but it still seems pretty intimidating to handle and do analyses on…

Never fear! The Weecology lab hears your concerns and we are actively working on a software package to smooth out the process. You can check out the project on GitHub.

FAQ

Q: What is it exactly?

A: The portalr project is a software package for the R programming language (http://cran.r-project.org). R is one of the most popular languages for ecology, statistics, and data science; it also has a large open-source community that creates free add-on packages to extend the base functionality.

Q: How do I get the portalr package?

A: There are some basic instructions on the repo page, but in short, the package is still in development and therefore not yet uploaded to CRAN (the comprehensive R archive network). If you are unfamiliar with installing a package from GitHub, the easiest approach is to first install the devtools package, and then use one of its functions to install portalr from GitHub:

install.packages("devtools")
devtools::install_github("weecology/portalr")

Q: What can I do with the package?

A: Several different things! Mainly, it is designed to be a general-purpose interface to the Portal data for R users. It allows you to download the latest iteration of data from the data repo, summarize the data in different ways (e.g. by time, by space, by treatment), and integrate different data sources (e.g. rodents, plants, ants, weather).

Q: That sounds great! How do I get started with that?

A: Well, the package is still under development, but check out the demo below, and feel free to send us comments and suggestions (preferably as an issue here). 👇

Demo

Initial setup

Load in the packages we’re going to use for data manipulation and plotting:

library(tidyverse)
library(cowplot)
library(portalr)

Obtaining the data

To make sure we don’t unnecessarily download the data, we first check whether it might already exist, and if it does, whether the data matches the latest version on the GitHub repo:

# use current folder to store downloaded data
my_path <- "."
rodent_file <- file.path("PortalData", "Rodents", "Portal_rodent.csv")
path_to_rodent_file <- FullPath(rodent_file, my_path)

# check if we already have the latest data
if(!file.exists(path_to_rodent_file) ||
   observations_are_new(base_folder = my_path))
{
  download_observations(base_folder = my_path)
}

Next, we read in the various data tables:

rodent_data_all <- loadData(path = my_path)
print(summary(rodent_data_all))

##                Length Class      Mode
## rodent_data    29     data.frame list
## species_table   8     data.frame list
## trapping_table  6     data.frame list
## newmoons_table  4     data.frame list
## plots_table     4     data.frame list

Rodent Abundances

The first table that we loaded (rodent_data_all$rodent_data) is a record of the observed macrofauna, including rodents, but also other taxa. We first filter the data for missing, unindentified, incomplete, or otherwise erroneous data:

rodent_data_all$rodent_data %>%
  remove_suspect_entries() %>%  
  process_unknownsp(rodent_data_all$species_table, TRUE) %>%
  remove_incomplete_censuses(rodent_data_all$trapping_table, FALSE) %>%
  {.} -> rodent_data

Next, we write a function to summarize the abundances for each species within each sampling trip:

summarize_abundance <- function(rodent_data)
{
return(rodent_data %>%
         mutate(species = factor(species)) %>%
         group_by(period) %>%
         do(data.frame(x = table(.$species))) %>% 
         ungroup() %>%
         select(period, species = x.Var1, abundance = x.Freq)
  )
}
rodent_abundance <- summarize_abundance(rodent_data)

Finally, we want to add the dates of each sampling trip (currently recorded as an index in the period column), as well as the scientific names for each species (currently recorded as a two-letter species code in the species column):

join_census_date <- function(rodent_abundance, newmoons_table)
{
  return(rodent_abundance %>%
           left_join(select(newmoons_table, "period", "censusdate"),
                     by = "period") %>%
           mutate(census_date = as.Date(censusdate))
  )
}
join_scientific_name <- function(rodent_abundance, species_table)
{
  return(rodent_abundance %>%
           left_join(select(species_table, "species", "scientificname"), 
                     by = "species") %>%
           rename(scientific_name = scientificname)
  )
}

rodent_abundance %>%
  join_census_date(rodent_data_all$newmoons_table) %>%
  join_scientific_name(rodent_data_all$species_table) %>%
  select(census_date, scientific_name, abundance) %>%
  {.} -> rodent_abundance

print(summary(rodent_abundance))

## census_date                        scientific_name   abundance      
## Min.   :1977-07-16   Baiomys taylori         : 438   Min.   :  0.000  
## 1st Qu.:1987-05-28   Chaetodipus baileyi     : 438   1st Qu.:  0.000  
## Median :1996-06-02   Chaetodipus hispidus    : 438   Median :  0.000  
## Mean   :1997-03-06   Chaetodipus intermedius : 438   Mean   :  6.273  
## 3rd Qu.:2007-06-16   Chaetodipus penicillatus: 438   3rd Qu.:  5.000  
## Max.   :2017-11-18   (Other)                 :7008   Max.   :285.000  
##                      NA's                    : 438

Plot

Finally, let’s create our plot of species abundances over time:

my_plot <- ggplot(rodent_abundance, 
                  aes(x = census_date, y = abundance)) + 
  geom_line() + 
  facet_wrap(~scientific_name, scales = "free_y", ncol = 3) + 
  xlab("Date") + 
  ylab("Abundance") + 
  scale_x_date(breaks = seq(as.Date("1977-01-01"), 
               to = as.Date("2018-01-01"), "+5 years"), 
  date_labels = "%Y", 
  limits = as.Date(c("1977-01-01", "2018-01-01"))) + 
  theme_cowplot() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), 
        legend.position = "bottom", legend.justification = "center", 
        strip.text.x = element_text(size = 10))

print(my_plot)

Conclusion

So that was just one way of viewing the rodent abundance data, but we plan to include many such examples with the `portalr` package. Our goal is to help everyone get up to speed quickly with doing various analyses with the package, and to be able to use all of the different data sources effectively. Again if you have questions or suggestions, please feel free to drop us a line at the GitHub issues page.

This entry was posted on December 21, 2017 at 10:24 am and is filed under 40th Anniversary. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

The Portal Project