Overview & Objectives

Welcome to the Data Curation/R Markdown module of the STRI-McGill 2022 course on Neotropical Biology, Environments, and Conservation (BIOL 640/553) & Foundations of Environmental Policy (ENVR 610). This module is meant as a complement for the rest of the curriculum and our main goals here are two-fold. First, is to create interactive, data-driven, web products that capture analytical workflows, raw data, and data outputs. Second is to give you the tools, motivation, and inspiration to make your science more accessible, transparent, and reproducible.

If you have not already done so, please read the short post on Using this Site before continuing.

We will use R Markdown and the Distill framework to build our web products. This site itself is not only the guidebook for your web products but is also created with the same tools you will use for your projects. More on that below.

In this module of the course, we will learn how to:

Set up a working environment by installing R, RStudio, and various R packages (e.g., knitr, distill, reactable).
Render interactive data tables, figures, graphics, maps, diagrams, and more.
Troubleshoot problems using effective web searches, GitHub repositories, forums like Stack Overflow, and of course, our wit.
Reference other works and create citeable articles.
(if there is interest) Use git and GitHub Pages for hosting our work (this is not a requirement of the course).

What is a web product?

Ok, first things first. What do we mean by a web product? Well, for the purposes of this module, we consider a web product to be, primarily, a collection of HTML documents that can be opened with a browser like Chrome, Firefox, etc. That doesn’t necessarily mean that it needs to be hosted on the web, just that anyone can access the information without installing special software or understanding specific tools.

It is NOT a requirement of this course to publish your web product online.

Furthermore, we will use R Markdown and the Distill framework to create what are called static (as opposed to dynamic) web products. The difference is straightforward. Dynamic¹ means that anytime a page loads, the content and look are tailored to the user, like Facebook or Twitter. Static means that everyone who visits the site sees the same thing.

Just because your web product is static does not mean it will be dull. Unlike PDF documents, your web products will have dynamic, interactive components like tables, figures, maps, and so on. Consider this example of Supplementary Material that we submitted for a recent paper. Here is the pdf version (written in R Markdown & LaTeX) and here is the HTML version. While the content of the two documents are identical, their functionalities are not. The first thing to notice is that most of the data tables in the PDF document (called Supplementary Data) were too big to fit in the document and instead are uploaded as separate, stand-alone, text files to the journal’s website. On the other hand, in the HTML version, every data table is embedded in the document.

Go ahead and check out the Content Description section and click on any of the links that say Supplementary Data. Supplementary Data 2 for example has 13 columns and 911 rows, far to large to fit (comfortably) in a PDF document. Have a look at Supplementary Data 3. Notice that you can copy or download the contents of the table, sort, search (try searching for invertebrate), and expand the full table (under Show entries, select All). This table even contains hyperlinks that lead to other important information. Go ahead and click on one of the links. Movin on.

Motivation

Next, I want take a moment to explore some of the motivation and rationale behind this module of the course. In subsequent lessons we get into more specific details on many of these topics, but for now let’s just stick with the high level stuff.

Why a web product?

Good question. For me the answer is Open Science—a movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of society, amateur or professional (Woelfle, Olliaro, and Todd 2011). Creating a web product can help make your science more accessible, transparent, and reproducible.

Consider that many journals and journal articles sit behind pay walls that most people can’t access without paying an unreasonable price. That sucks, and while people are pushing back against this paradigm, for now this is the reality. We do our best to publish open access articles but sometimes this is not possible. A webpage allows you to share your science with a wider audience. By and large, journal articles are highly technical documents. There is nothing wrong with this; it is just the nature of the medium. But this can make the information inaccessible to non-experts. By also presenting your work as a web product you have the opportunity to tell the story behind the science, which can have important outreach implications (Forrester 2017).

A typical journal article is just a few pages, but we all know that a lot more goes into a study than what we usually see in print. Even with extensive Supplementary Material, authors are limited by what they can include in their publication. With a web product you are liberated from these limitations. A webpage gives you a venue to discuss all of the stuff that didn’t make it into your publication and to tell a more complete story of your science. Do you have a gallery of photos from your fieldwork? What about a bunch of statistical tests you tried that didn’t work? Or some personal thoughts on the system you study? In most cases this information would be inappropriate for a journal article but it is still useful and interesting information to share.

I think it is an important obligation of all scientists is to make their studies transparent and reproducible. If you publish a study I should––with minimal effort––be able to find your data, carry out the same analyses, and reach the same conclusions. There should be no mystery. Sadly, this is not always the case. In my own field of microbial ecology, it can be a daunting task to find raw data from other studies and even harder to figure out exactly how the data was analyzed. Without proper documentation, you may even forget how you did something. Here is a sobering quote from a Nature News Feature from about results of a survey given to to 1,500 scientists on the state of reproducible science. .

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Those are some of the telling figures that emerged from Nature’s survey of 1,576 researchers who took a brief online questionnaire on reproducibility in research. (Baker 2016)

If however, you build a web product around your project, where you document everything you do no matter how trivial, you can avoid these pitfalls and produce truly reproducible and transparent science. Everything I do now ends up on a project webpage. I can easily share the information and I no longer have directories filled with random bits of information on my computer. My websites have a much better memory than I do. So what is reproducible and transparent science? Turns out there really isn’t a starndard, well-defined definition

According to this article in Science Translational Medicine, reproducibility …

… [is a] set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables (Goodman, Fanelli, and Ioannidis 2016).

Or the U.S. National Science Foundation (NSF) defines it this way …

… refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator (Cacioppo et al. 2015).

Of course, all of this is easier said than done. Researchers interested in reaching a wider audience and presenting tranparent and reproducible science face many challenges; they are under a lot of pressure to produce papers and often not given the chance to pursue these activities (Forrester 2017). So what can we do about it? I think that a good first step is to create web products that embrace the concepts of accessibility, transparency, and reproducibility. Institutional outreach and media departments do an incredible job of presenting the highlights of science, but they do not have the time or resources to go much deeper. I believe that with a little organization, training, and support we can all create web products that add value to the science, whether as educational and reference tools, or as outreach components.

Consider again the paper we recently published on an acute hypoxic event in Bocas del Toro published in Nature Communications. All Nature journals now require authors to adhere to Reporting standards and availability of data, materials, code and protocols.

… authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications.

Authors must include a Data availability and Code availability statements in their papers. Here is the Code availability statement from our paper.

And here is @coraloha reaction on Twitter. Some people get really jazzed about Open Science³

Whoaaaaaa now this is open science to the next level! And the name, come on…Metal AF 🔥 🎸 🤘 https://t.co/8sznPR5QYz
— Dr. Chris Wall ((coraloha?)) July 29, 2021

Assignment

So what do you think? Is Open Science important? Do you think web products can help communicate your science in a more accessible, transparent, and reproducible way? Your assignment is to read up on Open Science and reproducible research. There is a lot of debate on this subject and what I presented above is my view on the topic. It important that you form your own opinion. Here are a few articles to get you started but I do encourage you to dig deeper.

Moving along.

Why R, RStudio, & R Markdown?

There are many ways to make a website—–Squarespace, Wordpress, Wix, and so on. You can even code your own website in HTML, CSS, and JavaScript. However, in my experience, none of these platforms allow you to run and document code while constructing web products quite like the combination of R, RStudio, & R Markdown. As with everything there are limitations, but I feel the benefits outweigh the drawbacks.

R

R is both a programming language and a software environment for statistical computing and graphics. At some point, I imagine you will all need to need to embrace a programming environment to analyze your data and summarize your findings using figures, tables, etc. R is certainly not the only way to do this; however I believe this environment offers a valuable suite of tools for your scientific needs. The benefits of R include; a) it is free and open source, b) its capabilities are extended through user-created packages⁴, c) it has a huge community of users (which means it is well supported), d) it is powerful and flexible.

RStudio

RStudio is an integrated development environment (IDE) for R language⁵. Take a moment to familiarize yourself with the idea of an IDE—in a nutshell, RStudio provides a holistic working environment to process (R) code, generate figures/tables, and create websites.

Imagine a car. Think of R as the engine and the RStudio IDE as the dashboard.

R Markdown

R Markdown is really the bread and butter of what we will be doing. R Markdown is a file format (.Rmd) for making dynamic documents with R. R Markdown combines the syntax of Markdown with the language (and environment) of R. R Markdown documents are written in Markdown—a lightweight markup language (like HTML), that uses a relatively simple syntax to facilitate the transformation of human-readible text files into .html or .pdf documents. What this means is that rather than writing HTML and CSS code to make a website, you write your content in Markdown, which is then translated (by RStudio in this case) to web content. (R) code in your document is embedded within code chunks. During the building of a page or site, RStudio identifies code chunks, runs the code, translates the results to Markdown, and then renders the output to a HTML file.

You do not need to include R code to produce a R Markdown web product. If fact, I learned R by first writing R Markdown documents.

The rendering process—creating web pages from R Markdown documents (R code plus Markdown). This figure contains clickable links if you are interested in learning more about these tools.

When a page or site is rendered, the R code in your R Markdown document (.Rmd) is first processed by a program called knitr. Knitr executes all the R code, knits the results together with the Markdown text, and creates a new Markdown document. The new Markdown document is then processed by PanDoc, which converts the Markdown syntax into HTML and CSS code. PanDoc is like a swiss-army knife for Markdown—–it can covert many types of Markdown documents into a variety of other formats. Don’t worry, most of these steps happen behind the scenes. As long as you have a properly formatted R Markdown document, these tools will take care of the rest. ⁶

Why Distill?

In this module we will use the Distill framework to create our web products. Distill for R Markdown is a web publishing format optimized for scientific and technical communication. We are using Distill because it a) is a very stable format developed and actively maintained by RStudio; b) has a large and active user community; c) provides a good balance of functionality and ease of use; and d) is a minimal, lightly themed, template focused on writing and code.

I think the main downside of Distill is that you are limited by how much customization you can do. Distill sites can be customized to some degree (with a bit of effort⁷) but not nearly to the degree of other R Markdown frameworks like blogdown and hugodown, both of which use Hugo for building sites. For a better idea of what I am talking about, have a look at a site I built using blogdown and Hugo.

Some features of the Distill framework include:

Reader-friendly typography that adapts well to mobile devices.
Features essential to technical writing like LaTeX math, citations, and footnotes.
Flexible figure layout options (e.g. displaying figures at a larger width than the article text).
Attractively rendered tables with optional support for pagination.
Support for a wide variety of diagramming tools for illustrating concepts.
The ability to incorporate JavaScript and D3-based interactive visualizations.

Expectations

Let’s take a moment to go through what we expect of you and what you should expect of me as your instructor.

Student role

Throughout this field course, you will all be designing and running experiments, making observations and measurements, collecting data, etc., often as part of a group with fellow students. Your job for this module of the course is to take all the parts of your projects—hypotheses, background information, methods, results, and conclusions—and capture these elements in a R Markdown web product using the Distill framework.

Regardless of whether you are part of a group or running solo, you will each create a web product that contains the details for at least two projects you work on in the course.

You have two choices on how to present your projects.

Website: You can present your projects in website style using the Distill website template. Here is an example of a Distill website we created for a recent publication on an acute hypoxic event in Bocas del Toro called Hypocolypse.
Blog: Or you can present your projects in blog style using the Distill blog template. Want an example of a blog created using Distill? Well, you’re looking at one right now :)

We will get into the nitty-gritty of the differences between the two formats in subsequent lessons. For now, let’s summarize the key differences.

Structural difference: A website is simply a collection of pages that can be accessed via the navigation bar at the top of a page. A blog contains 1) a collection of posts plus 2) a dedicated page to list all posts, called a listing page. The listing page is usually, but not always, the home page. A Distill blog is basically a distill website with added blog posts.

Layout difference: Websites require you to manually set up links to pages on your site. Within a blog, Distill creates the listing page, which collects links to posts for you, displaying key metadata (like date published, author, categories, title, etc.) and a thumbnail image.

Workflow difference: All pages on a Distill website (and root pages of blogs) are re-rendered (i.e., re-built) each time the site is built. However, individual blog posts must be rendered on their own, with intent. Why? Well, it is because R packages are upgraded all the time and upgrades tend to break older code. So continuously re-rendering really old posts is nearly impossible to do without errors.

So which is better, a Distill blog or website? Well, that depends on how you want to portray and display your projects.
OK then, which is easier? Based on my experience using Distill (5 websites and now 1 blog), I would have to say that building and managing a website is slightly more intuitive than a blog. This mainly has to do with how files are organized in your working directory, basically the structural differences I described above. The directory of a website has a flat structure, meaning all of your R Markdown documents (.Rmd) are in the main directory. With a blog, the .Rmd files are collected in the _posts directory. This structure can make it tricky sometimes to figure out how to create links that connect different parts of the site, so called internal link. But really, both have roughly the same learning curve; a tiny bit steep at the beginning then a plateau of smooth sailing. So my advivce is go with what works best for you.

Assignment

Your first assignment is to think about the format you want to use, website or blog. You do not have to decide now but do check out the examples provided above to get a better sense of the format differences between the two. If you’re up to it, go ahead and read through the Distill website template and the Distill blog template pages. If you have little or no experience with this type of coding, most of the content on these pages will be gibberish. But fear not! Soon enough, this will all make sense.

Lastly, I wanted to add that some of you may have a lot of experience with these tools already. That’s great––you can help your classmates with their projects. If you think something I say doesn’t make sense, please say something. I would rather do something the right way than be right about the way I am doing something.

Instructor role

As an instructor, my role is as guide and facilitator. My philosophy is that I teach the way I learn; to create a venue where you can be curious, get your hands dirty, make mistakes, and explore. I’m here to help you see what’s possible and help you create something you are proud of. Towards this end, I will write tutorials and lessons and help you work through the material to achieve your goals for the module. If you want to incorporate some particular functionality in your projects that I do not cover in the lessons, I will help you figure out how to do it. You can expect me to work with each of you individually on various aspects of your projects.

You should also expect me to be patient. I know first hand how difficult some of this material can be to digest and a big part of my role is helping you avoid many of the potholes and pitfalls I stumbled over and fell in. I can promise you that, at times, this process will be a little frustrating. But I can also promise that with some hard work (and maybe a few tears), you will have the tools to create a range of useful and beautiful documents.

Finally, I am happy to write additional tutorials and lessons covering material not included on the site already. So please, pretty please, let me know if there is something extra you want to learn and I will write a post so other students can follow along.

Housekeeping

A little housekeeping to finish up. Unfortunately this is not a formal course on R coding or statistics. Given the time we have, these topics are beyond the scope of this module—we want you in the field working on experiments as much as possible, not sitting in front of you computer the whole time :) This doesn’t mean you won’t be performing statistical analyses; it just means that we do not have any formal lectures planned for these topics. Of course, if there is a specific analysis you wish to perform or test you want to run, the instructors and/or your fellow students will help you write the code.

To get the full benefit of using R Markdown, an understanding of R is helpful but not required. That said, if you persist with R Markdown you will learn R, specifically when running analyses and rendering the results as figures or tables. In fact, when I first started creating R Markdown documents a couple of years ago my knowledge of R itself was downright awful. Working with R Markdown dramatically increased my understanding of writing R code for statistical analyses.

Next, we need to talk about operating systems (OS). I use Mac OSX and Linux but unfortunately I have little experience implementing these tools in Windows. If you are on a Windows machine, don’t worry; we will make it all work, it may just take a little more effort. Tons of people use Windows with R, RStudio, and R Markdown; sadly for you I do not happen to be one of them :(

This website is a living document, meaning that it will be updated continuously. I will make announcements when new material is added. The structure of the site is simple. All lessons and homework are listed on the landing page in order. Start at the bottom and work your way up; that way the newest material is always first. The navigation bar at the top has quick links to tools, resources, etc.

Assignment

One more thing I would like you to do before we wrap up. Go back through this page from the top. You do not have to read the contents again (yawn) I just want you to make note of the different elements—text formatting (e.g., italics and bold), headers, table of contents, footnotes and asides, hyperlinks, appendices, citations, images, etc. You will be incorporating many of these elements in your web products. Don’t worry about the details right now, we will cover them in more detail later.

That’s all for now.

Source Code

The source code for this page can be accessed on GitHub by clicking this link.

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604). https://doi.org/10.1038/533452a.

Cacioppo, John T, Robert M Kaplan, Jon A Krosnick, James L Olds, and Heather Dean. 2015. “Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science.” Report of the Subcommittee on Replicability in Science Advisory Committee to the National Science Foundation Directorate for Social, Behavioral, and Economic Sciences 1.

Forrester, N. 2017. “The Next Generation of Science Outreach.” Nature Jobs. http://blogs.nature.com/naturejobs/2017/04/14/the-next-generation-of-science-outreach.

Goodman, Steven N, Daniele Fanelli, and John PA Ioannidis. 2016. “What Does Research Reproducibility Mean?” Science Translational Medicine 8 (341): 341ps12–12. https://doi.org/10.1126/scitranslmed.aaf5027.

Munafò, Marcus R, Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1): 1–9. https://doi.org/10.1038/s41562-016-0021.

Woelfle, Michael, Piero Olliaro, and Matthew H Todd. 2011. “Open Science Is a Research Accelerator.” Nature Chemistry 3 (10): 745–48. https://doi.org/10.1038/nchem.1149.

Henceforth, I use the term dynamic throughout these lessons to refer to a component of a document that does something or is interactive, like a table that scrolls or a map you can zoom in on. Do not confuse this with a dynamic website.↩︎
To get the number of citations for this article I used the function cr_citation_count from the rcrossref package. The function takes a Digital Object Identifier (DOI) as the input and searches CrossRef OpenURL to return the number of citations.↩︎
If you want to embed a tweet on a page, click on the ellipsis icon at the top of the tweet , select Embed Tweet, copy the HTML code provided, & plop it right into your Rmd file. The only thing is you need to do is wrap the entire code in the HTML <center> & </center> tags.↩︎
The Comprehensive R Archive Network (CRAN), R’s central software repository, currently contains 18909 packages. Hundreds more can be found in places like GitHub and Bioconductor↩︎
Recent versions of RStudio now have the functionality to run Python code in your R Markdown document.↩︎
The figure above is called an image map, which is a HTML technique that allows you to create clickable areas on an image. I coded this so that each logo contains a different hyperlink. Try it out. Click here if you are interested in the source code.↩︎
We will cover some of this in future lessons on adding custom HTML and CSS code.↩︎

Lesson 1: Module 🦉 Overview

Overview & Objectives

What is a web product?

Motivation

Why a web product?

Assignment

Why R, RStudio, & R Markdown?

R

RStudio

R Markdown

Why Distill?

Expectations

Student role

Assignment

Instructor role

Housekeeping

Assignment

Source Code

References

Corrections

Reuse