Dry Lab Notebooks

R Projects

R and RStudio
R Markdown

GitHub
GitLab

Data Management

General Steps
Storage of Scripts
Storage of Large Data

Cheaha
Lab Google Drive
Box
External Hard Drives

Other Resources

Dry Lab Notebooks

We generally follow, and encourage others to follow, Ten Simple Rules for a Computational Biologist’s Laboratory Notebook.

Lab notebooks provide a complete record of procedures, reagents, data, and thoughts, as well as an explanation of why/how you are doing an experiment and its results.

We recommend taking note of the following in your daily and/or dry lab experiment notebook:

The goal/purpose of the experiment and/or operations
Systems where the operations were done
- (e.g. personal laptop or Cheaha)
Directory of operations
Names and locations of scripts and when they were edited or used
Information on and location of raw data
A description of steps like a paragraph for a paper
Any important screenshots, figures, or presentations related to a project
Dates of operations
Versions of programming languages, tools, and/or packages

R Projects

One can use R Projects that can be linked to version control programs such as Git (which is covered later in the guide) to create and manage reproducible data.

R and RStudio

See this R Master Guide for a full introduction to R and RStudio.

R Markdown

R Markdown, a tool in RStudio, creates reproducible documents that can combine narrative text and code from languages such as R, Python, and SQL. See below for some helpful resources:

It might be helpful to add a sessionInfo() function at the bottom of Rmarkdown files in order to help people know what version of R and its packages you are using. You can also use this package, titled “packrat”, for R package version control, which allows versions of packages to be shared with others.

Software Carpentry Tutorials

These lessons, titled “R for Reproducible Scientific Analysis,” were covered in the aforementioned R Master Guide but are particularly helpful in learning how to code for data analysis and setting a good workflow. Take a look at the lesson plan to see what specific lessons fit your needs.

Git

Git is a free, open source distributed version control system. It tracks changes, maintains history, and allows you to revert changes. Clones will mirror the full history of the original repository. For more information on this as well as an overview of different version control systems, check out this chapter of the Git documentation book.

For an overview of the different version control systems and a brief explanation of what makes git a distributed

GitHub

Github uses Git to allow software development version control to manage changes to scripts through a push/pull system. It is a very useful tool for both individual and team version control, with settings to allow others to view your code for collaboration and/or reproducibility. Below are some GitHub tutorials you can follow:

Coding Club’s Setting up a GitHub Repository For Your Lab - Version Control & Code Management with GitHub
Software Carpentry’s Version Control with Git
Jennifer Bryan’s Happy Git and GitHub for the useR

GitLab

GitLab also uses Git and is a service that allows access to private pipelines for free. It has seamless migration to/from GitHub, and it has an RStudio integration. Below are some resources explaining GitLab, as well as some clarifying the differences between it and GitHub:

UAB Research Computing’s GitLab, which provides free unlimited private repos
Dr. Lara Ianov’s presentation on GitLab
About GitLab - GitHub vs GitLab

Data Management

This section provides steps you can follow for general data management and storage.

General Steps

Back up your data. A general rule to follow is to have three different storage locations.
- (e.g. on your computer, on Google Drive or a cloud-based location, and on an external hard drive or lab server)
Make sure you have at least one cloud and one off-site storage space for your data and scripts.
Keep data and scripts organized within a directory or workspace. There should be different directories for raw data, scripts, results, and/or documents, but this can vary by project.
Include frequent comments in your code. This is important for keeping it clear for both yourself in the future and for others reviewing it.
When creating a new directory, start with the date (in order to make them sortable by date), then a project title or description.
- (e.g. 190811_projectname for a project started on 8/11/2019)
When naming files, start with the date, then a title or short description of the script.
- (e.g. 190811_rna_seq_analysis)
In your scripts, it is important to include:
- The person who wrote the script
- Date(s) of operations of the script
- A description of the purpose of the script
- Comments about operations done within the script
Create an overview text document in your workspace describing the project, organization of the directory, and locations of data.
- (e.g. a GitHub README.md)
Create and check md5sum files to check if your data files have changed after computational operations (more on that here).
Use a method for version control or note what versions of tools were used for computational operations.

Storage of Scripts

RStudio projects, scripts, and some small files should be stored in a GitHub project/repository as well as a backup location.

Storage of Large Data

Cheaha

Cheaha is UAB’s cluster computing environment. You can learn more about it in this youtube video.

This would be used for large data that needs a supercomputer.
Cheaha is not a backed up resource.

Tools for Data Transfer

The tool “rclone” can be used on Cheaha for copying/moving files to and from cloud storage in Google Drive, Box, and other cloud services. Here are some useful resources:
The tool “Globus” can be used for secure data transfer and can be done between remote locations (e.g.: a collaborator’s computer to Cheaha and vice-versa). For more information on utilizing Globus with your UAB BlazerID, visit this link

Data Storage Locations and Functions

/data/project/lab_name/ (50TB)
- storage of raw data, genome information files, and storage of results
- a lab must send a ticket to Research Computing requesting that project directory be created
/scratch/username (shared storage - meant for short-term storage during analysis)
- workspace: copy raw files here to work with, then copy results to the lab project folder or /data/user/username once you are done
/data/user/username (5-10TB)
- files for an individual, such as R libraries or individual projects
/home/username (20-50GB)
- limited space, but some important files for programs might need to be stored here (e.g. “.Renviron” for R and RStudio instructions in the cluster)
More information about storage locations on the cluster can be found here.

Getting Started with Cheaha

Follow this link for an in-depth guide on getting started on Cheaha.
To request an account, send an email to support@listserv.uab.edu with background on your intended use and your affiliations.
Acknowledge the confirmation email to submit the account request.
Follow this Hello World example.
Make RLibs and RStudioLibs folders for R packages in the /data/user/username folder to prevent packages from being stored in the home directory.

Pipeline Tools and Virtual Environments

These tools should be used for version control for tool and workflow management on Cheaha to ensure reproducibility and scalability.

Anaconda: an integrated tool used to build the environment the pipeline runs on. It installs programs into the working directory and allows you to view and change which versions programs are utilizing.

Learn about using Anaconda for version control here.
To learn more about creating an Anaconda environment on Cheaha, watch this video.

Snakemake: built on a make system that ensures that files are produced for each appropriate step, ensuring every expected output is produced. You can rerun a step by deleting the output from that step and the pipeline will rerun from that point.

Learn more about the tool Snakemake here

Lab Google Drive

Our lab uses a Google Drive folder, where data is organized into subfolders by project.

Box

Box is a cloud storage service that is free to UAB staff. Though UAB previously announced discontinuing its use of box, it has been revealed that negotiations have allowed for UAB to continue its use as before.

External Hard Drives

Our lab uses G-Drives, but one needs to be aware of how a given drive is formatted, as this will affect how well it works with different computing platforms. We label these drives with our initials as well as additional labels as needed and a description of what is in the hard drive. Other options are Seagate and other reputable brands with extensive reviews. As external drives are a physical storage device for data, it is important to choose a trustworthy brand and model that is compatible with your needs.

Reproducible Research Data Science Guide

Version 1.0

Jen Fisher and Avery Williams, Lasseigne Lab, CDIB, UAB

Developed with assistance from the Lasseigne Lab and Dr. Lara Ianov, UAB

4/24/2020