Currently in the Organic Surface Chemistry group, there are large variations between our group members when it comes to the tools used for data analysis. Some people feel most comfortable in spreadsheet programs such as Origin or Excel, while others rely on a mix of Matlab, R, Python or other tools. We use a lot of different experimental techniques in our research, and therefore generate data in a lot of different formats. Different workflows of varying efficiency often makes joint projects more difficult than necessary.
Recently a lot of our research activities has been oriented towards the SPOMAN Open Science collaboration, and the ongoing projects are available on the Open Science Framework (osf.io). In most cases, the methods and results are thoroughly described for each project. However, since a lot of the data analysis is currently performed in spreadsheet software, the data analysis is not very transparent and reproducible. Lowndes et al. (2017) discusses exactly this issue, and describes how the use of R and GitHub has allowed them to do better science in less time.
I have been considering how to apply this thinking to our research. I am confident that it could be made more efficient and well-documented if R was adopted as a general tool for data analysis.
Clear and well-documented analysis in R Markdown
If you recieve a spreadsheet from someone, containing some data analysis it can be very hard to decipher the thinking of the original author (and to be honest, analysis done by yourself 6 monts ago might as well have been done by a complete stranger). Some of the challenges are:
- There is no connection to the original data. How did the data come from your scientific equipment and into the spreadsheet? Was it in any way altered previously?
- What calculations are performed? Which cells contain data and formulas? You have to manually click around to know.
- There is no easy way to reproduce an analysis, without having to enter new formulas, making new assumptions etc.
R provides a very nice feature which can alleviate these challenges, and make an analysis easier to understand and reproduce: R Markdown documents. This post is written in R Markdown to demonstrate some of the capabilities (you can go here to see the code)1. It makes it easy to alternate between descriptive paragraphs, describing the analysis and intermediate conclusions, and then R code, where scripts are used to load, modify and plot data.
We could for example provide the script for loading a dataset and directly show how a few of the variables are modified. This is demonstrated here for R’s built-in
library(tidyverse) dataset <- mtcars %>% mutate(wt = wt*0.45359, # Convert lbs to tons mpg = mpg * 0.42514, # Convert mpg to km/L Transmission = forcats::fct_recode(as.factor(am), Automatic = "0", Manual = "1") ) knitr::kable( head(dataset, 4), caption = "The mtcars dataset contain data on 32 different cars. Here 4 rows are shown." )
|Mazda RX4 Wag||8.927940||6||110||1.304071||1||Manual|
|Hornet 4 Drive||9.097996||6||110||1.458292||0||Automatic|
We can then plot our dataset, in a completely reproducible manner by a few lines of code. Anyone with the datafile and this script, would be able to produce this exact plot. Notice that the package
ggplot2 is used, producing publication-ready figures right away.
dataset %>% ggplot(aes(x = wt, y = mpg, color = Transmission)) + geom_point() + labs(x = "Weight (t)", y = "Mileage (km/L)", title = "Fuel economy")
Reuse of work through an R package
Take any given research project in our group, and you will commonly find that we use X-ray Photoelectron Spectroscopy (XPS), Raman spectroscopy, Infrared spectroscopy, Ellipsommetry and various electrochemical techniques all the time. In essence, a handful of techniques produce the large majority of our data. Most of it is initially saved in proprietary formats and treated in the corresponding software, but it can all be exported as plain-text files (.csv, .txt). Wouldn’t it be nice to have a common framework to analyse and plot all data, for every project, regardless of the technique used? This is also possible in R by creating a package!
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. R packages, Hadley Wickham
Initially, an R package for our group could contain functions to load the data from the above-mentioned techniques and clean it up, so that it is ready for analysis. We could then extend on this by providing often-used analyses as functions, shortcuts for making commonly used plots etc. In this way code could easily be reused among members of the group, making our science more efficient. Having such a toolbox provided along with some basic examples will also make it easier for new group members to get started with R.
Collaborative TODO-lists and dicussions on GitHub
When you have then made your R package, how is it then distributed among group members? To be truly efficient we need a tool that:
- Easily allows users to download the newest version of the package when using R.
- Allows for cooperation from many users, such that the maintainence and debugging the code is independent of any single person.
- Contains version control, such that users can append their own code to the project or modify other peoples code without fear of breaking stuff.
All this is possible, if the package is made publically available as a Git repository. As of writing this, I have created an empty R package that is available on GitHub, and the hope is that in due time it will be populated with lots of useful stuff for our group members. This solution makes the code publically available and very easy to use in R.
# Install the package from Github using the Devtools package install.packages('devtools') devtools::install_github('SPOMAN/osc') # Load the 'osc' package library(osc)
Github further allows users and developers to track issues. This can be used as a TODO-list for the project to report bugs in the code or request missing features. This means, that even R users that mainly use the scripts, and might be unable to add new functionality themselves, has the possibility to help improve the package.
Key steps for implementation
It is clear, that there is an inital barrier to overcome, if the implementation of R as the primary analysis tool in our group is to succeed. Currently many students in our group have very limited experience in programming, but I believe that by implementing core tools in an accessible environment this can be overcome. Teaching R to new users was discussed extensively at the useR!2017 conference I attended last week. Some materials are available in the excellent talk “Teaching data science to new useRs” by Mine Cetinkaya-Rundel.
In the long run I am confident that the extra workload that must be put into making a well-crafted R package will pay off, once the scientific analyses can be performed more efficient and reproducibly by every member of our research group.
Lowndes, Julia S. Stewart, Benjamin D. Best, Courtney Scarborough, Jamie C. Afflerbach, Melanie R. Frazier, Casey C. O’Hara, Ning Jiang, and Benjamin S. Halpern. 2017. “Our path to better science in less time using open data science tools.” Nature Ecology & Evolution 1 (6): 0160. doi:10.1038/s41559-017-0160.
Fun fact: R Markdown can also be used to product PDFs, Word documents or even slideshows!↩