Archive for December, 2017

Data Analysis with the portalr Package

December 21, 2017

So you’ve read several posts about the Portal site and have even gone to the official GitHub repo for the data, but it still seems pretty intimidating to handle and do analyses on…

Never fear! The Weecology lab hears your concerns and we are actively working on a software package to smooth out the process. You can check out the project on GitHub.

FAQ

Q: What is it exactly?

A: The portalr project is a software package for the R programming language (http://cran.r-project.org). R is one of the most popular languages for ecology, statistics, and data science; it also has a large open-source community that creates free add-on packages to extend the base functionality.

Q: How do I get the portalr package?

A: There are some basic instructions on the repo page, but in short, the package is still in development and therefore not yet uploaded to CRAN (the comprehensive R archive network). If you are unfamiliar with installing a package from GitHub, the easiest approach is to first install the devtools package, and then use one of its functions to install portalr from GitHub:

install.packages("devtools")
devtools::install_github("weecology/portalr")

Q: What can I do with the package?

A: Several different things! Mainly, it is designed to be a general-purpose interface to the Portal data for R users. It allows you to download the latest iteration of data from the data repo, summarize the data in different ways (e.g. by time, by space, by treatment), and integrate different data sources (e.g. rodents, plants, ants, weather).

Q: That sounds great! How do I get started with that?

A: Well, the package is still under development, but check out the demo below, and feel free to send us comments and suggestions (preferably as an issue here). 👇

Demo

Initial setup

Load in the packages we’re going to use for data manipulation and plotting:

library(tidyverse)
library(cowplot)
library(portalr)

Obtaining the data

To make sure we don’t unnecessarily download the data, we first check whether it might already exist, and if it does, whether the data matches the latest version on the GitHub repo:

# use current folder to store downloaded data
my_path <- "."
rodent_file <- file.path("PortalData", "Rodents", "Portal_rodent.csv")
path_to_rodent_file <- FullPath(rodent_file, my_path)

# check if we already have the latest data
if(!file.exists(path_to_rodent_file) ||
   observations_are_new(base_folder = my_path))
{
  download_observations(base_folder = my_path)
}

Next, we read in the various data tables:

rodent_data_all <- loadData(path = my_path)
print(summary(rodent_data_all))
##                Length Class      Mode
## rodent_data    29     data.frame list
## species_table   8     data.frame list
## trapping_table  6     data.frame list
## newmoons_table  4     data.frame list
## plots_table     4     data.frame list

Rodent Abundances

The first table that we loaded (rodent_data_all$rodent_data) is a record of the observed macrofauna, including rodents, but also other taxa. We first filter the data for missing, unindentified, incomplete, or otherwise erroneous data:

rodent_data_all$rodent_data %>%
  remove_suspect_entries() %>%  
  process_unknownsp(rodent_data_all$species_table, TRUE) %>%
  remove_incomplete_censuses(rodent_data_all$trapping_table, FALSE) %>%
  {.} -> rodent_data

Next, we write a function to summarize the abundances for each species within each sampling trip:

summarize_abundance <- function(rodent_data)
{
return(rodent_data %>%
         mutate(species = factor(species)) %>%
         group_by(period) %>%
         do(data.frame(x = table(.$species))) %>% 
         ungroup() %>%
         select(period, species = x.Var1, abundance = x.Freq)
  )
}
rodent_abundance <- summarize_abundance(rodent_data)

Finally, we want to add the dates of each sampling trip (currently recorded as an index in the period column), as well as the scientific names for each species (currently recorded as a two-letter species code in the species column):

join_census_date <- function(rodent_abundance, newmoons_table)
{
  return(rodent_abundance %>%
           left_join(select(newmoons_table, "period", "censusdate"),
                     by = "period") %>%
           mutate(census_date = as.Date(censusdate))
  )
}
join_scientific_name <- function(rodent_abundance, species_table)
{
  return(rodent_abundance %>%
           left_join(select(species_table, "species", "scientificname"), 
                     by = "species") %>%
           rename(scientific_name = scientificname)
  )
}

rodent_abundance %>%
  join_census_date(rodent_data_all$newmoons_table) %>%
  join_scientific_name(rodent_data_all$species_table) %>%
  select(census_date, scientific_name, abundance) %>%
  {.} -> rodent_abundance

print(summary(rodent_abundance))
## census_date                        scientific_name   abundance      
## Min.   :1977-07-16   Baiomys taylori         : 438   Min.   :  0.000  
## 1st Qu.:1987-05-28   Chaetodipus baileyi     : 438   1st Qu.:  0.000  
## Median :1996-06-02   Chaetodipus hispidus    : 438   Median :  0.000  
## Mean   :1997-03-06   Chaetodipus intermedius : 438   Mean   :  6.273  
## 3rd Qu.:2007-06-16   Chaetodipus penicillatus: 438   3rd Qu.:  5.000  
## Max.   :2017-11-18   (Other)                 :7008   Max.   :285.000  
##                      NA's                    : 438

Plot

Finally, let’s create our plot of species abundances over time:

my_plot <- ggplot(rodent_abundance, 
                  aes(x = census_date, y = abundance)) + 
  geom_line() + 
  facet_wrap(~scientific_name, scales = "free_y", ncol = 3) + 
  xlab("Date") + 
  ylab("Abundance") + 
  scale_x_date(breaks = seq(as.Date("1977-01-01"), 
               to = as.Date("2018-01-01"), "+5 years"), 
  date_labels = "%Y", 
  limits = as.Date(c("1977-01-01", "2018-01-01"))) + 
  theme_cowplot() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), 
        legend.position = "bottom", legend.justification = "center", 
        strip.text.x = element_text(size = 10))

print(my_plot)

Conclusion

So that was just one way of viewing the rodent abundance data, but we plan to include many such examples with the `portalr` package. Our goal is to help everyone get up to speed quickly with doing various analyses with the package, and to be able to use all of the different data sources effectively. Again if you have questions or suggestions, please feel free to drop us a line at the GitHub issues page.

Advertisements

From field to repo – Portal data

December 4, 2017

Whenever we get new Portal data, we want to update our database as quickly as we can, without sacrificing data quality by adding data with errors or messing up the existing database when we try to update it. And, we want to make sure the process is reproducible and open, so that anyone can see what we do to maintain the data. This helps us keep the process consistent as new weecologists take over managing the data, and it lets anyone who wants to use the data understand how it’s been handled. We achieve these lofty goals through a combination of good old-fashioned record keeping and high-tech version control and open access through our GitHub repository.

All of our plant and rodent data begin on a paper datasheet that we fill in in the field. We keep those datasheets in binders, forever. We also scan them and archive digital copies. As a brand-new ecologist, I love these binders – going through them, I get to see forty years of fellow rodent enthusiasts, being enthusiastic about rodents! Beyond their sentimental value, these hard copies let us go back and quadruple-check for errors and notes whenever we find an anomaly in the data.

IMG_1806

The Portal rodent data cache

When we come back from a Portal trip, two different weecologists enter a copy of the data into an Excel workbook. Double-entry lets us filter out the inevitable typo or several. We enter the data into a preformatted Excel template, which uses data validation functions to catch especially strange typos. If we tried to record an invalid entry – a kangaroo rat ten times too big, for example – the datasheet would prompt us to double-check our work.

From there, we proof the data – check for errors – using a series of R scripts. The scripts are all continuously saved to our GitHub repository, which helps us maintain consistency and openness throughout the process. We use the scripts to check for typos and pick up on common errors in the new data, like forgetting to note that a rodent was a brand-new capture. Once the new data is cleaned up, we also check for discrepancies between the new data and our old records. For example, sometimes a rodent was identified as Dipodomys ordii in one census and Dipodomy merriami in another. Sometimes we can look at old data and resolve these discrepancies. If we can’t, we make a note to look very carefully at that rodent the next time we see it. Whenever we find a contradiction or make a change to the data, we keep a note of it in a notebook. If we change old records, the change is also recorded on GitHub.

Once the new records are clean and agree with old records, we go to add the new census to the master database. This is a potentially dangerous step: it’s pretty easy to mess up a dataset by introducing something tiny, like a comma in the wrong place. We use GitHub to protect the master database from catastrophic errors. Whoever has cleaned the new data submits a “pull request” on GitHub, which is essentially a request to make a set of changes to the master version of the database. GitHub lets us compare the old and new databases, and highlights the potential changes. We can make sure that the only changes being made are the ones we want to make before we approve the pull request. And even if something were to go wrong here, GitHub also allows us to revert to earlier versions of the repository.

At this stage, we enlist a nifty bot called Travis to run a final quality check and streamline updates to the whole database. When somebody opens a pull request to add rodent data, Travis automatically runs a set of scripts to make sure that there are no bugs in any changes we made to our data cleaning code. If those tests check out, it proceeds to automatically update the rest of the data tables in the database: records of the dates we trapped, which new moon we’re on, which plots were trapped (in case weather, or some other circumstance, kept us from trapping some plots) and which experimental treatments applied to which plots at the time. We also maintain weather records, which Travis automatically pipes in from the Portal weather station. Automatically updating all of these tables removes the possibility of introducing human error – and it’s much faster than having a human do it!

That’s it! New rodent reports come in about every four weeks. So if, like us, you have a burning desire to know what those rodents are up to – and if there are any more spectabs – it’s never a very long wait.