Author Archive

From field to repo – Portal data

December 4, 2017

Whenever we get new Portal data, we want to update our database as quickly as we can, without sacrificing data quality by adding data with errors or messing up the existing database when we try to update it. And, we want to make sure the process is reproducible and open, so that anyone can see what we do to maintain the data. This helps us keep the process consistent as new weecologists take over managing the data, and it lets anyone who wants to use the data understand how it’s been handled. We achieve these lofty goals through a combination of good old-fashioned record keeping and high-tech version control and open access through our GitHub repository.

All of our plant and rodent data begin on a paper datasheet that we fill in in the field. We keep those datasheets in binders, forever. We also scan them and archive digital copies. As a brand-new ecologist, I love these binders – going through them, I get to see forty years of fellow rodent enthusiasts, being enthusiastic about rodents! Beyond their sentimental value, these hard copies let us go back and quadruple-check for errors and notes whenever we find an anomaly in the data.

IMG_1806

The Portal rodent data cache

When we come back from a Portal trip, two different weecologists enter a copy of the data into an Excel workbook. Double-entry lets us filter out the inevitable typo or several. We enter the data into a preformatted Excel template, which uses data validation functions to catch especially strange typos. If we tried to record an invalid entry – a kangaroo rat ten times too big, for example – the datasheet would prompt us to double-check our work.

From there, we proof the data – check for errors – using a series of R scripts. The scripts are all continuously saved to our GitHub repository, which helps us maintain consistency and openness throughout the process. We use the scripts to check for typos and pick up on common errors in the new data, like forgetting to note that a rodent was a brand-new capture. Once the new data is cleaned up, we also check for discrepancies between the new data and our old records. For example, sometimes a rodent was identified as Dipodomys ordii in one census and Dipodomy merriami in another. Sometimes we can look at old data and resolve these discrepancies. If we can’t, we make a note to look very carefully at that rodent the next time we see it. Whenever we find a contradiction or make a change to the data, we keep a note of it in a notebook. If we change old records, the change is also recorded on GitHub.

Once the new records are clean and agree with old records, we go to add the new census to the master database. This is a potentially dangerous step: it’s pretty easy to mess up a dataset by introducing something tiny, like a comma in the wrong place. We use GitHub to protect the master database from catastrophic errors. Whoever has cleaned the new data submits a “pull request” on GitHub, which is essentially a request to make a set of changes to the master version of the database. GitHub lets us compare the old and new databases, and highlights the potential changes. We can make sure that the only changes being made are the ones we want to make before we approve the pull request. And even if something were to go wrong here, GitHub also allows us to revert to earlier versions of the repository.

At this stage, we enlist a nifty bot called Travis to run a final quality check and streamline updates to the whole database. When somebody opens a pull request to add rodent data, Travis automatically runs a set of scripts to make sure that there are no bugs in any changes we made to our data cleaning code. If those tests check out, it proceeds to automatically update the rest of the data tables in the database: records of the dates we trapped, which new moon we’re on, which plots were trapped (in case weather, or some other circumstance, kept us from trapping some plots) and which experimental treatments applied to which plots at the time. We also maintain weather records, which Travis automatically pipes in from the Portal weather station. Automatically updating all of these tables removes the possibility of introducing human error – and it’s much faster than having a human do it!

That’s it! New rodent reports come in about every four weeks. So if, like us, you have a burning desire to know what those rodents are up to – and if there are any more spectabs – it’s never a very long wait.

Advertisements