-
Notifications
You must be signed in to change notification settings - Fork 10
Introduction to R: June 2, 2021
When: Wednesday June 2, 2021 10am-noon PDT/ 1-3pm EDT
Co-Instructors: Saranya Canchi, Marisa Lim
Moderator: Abhijna Parigi
Helpers: Jose Sanchez, Jeremy Walter
Description:
This free two hour workshop will provide an overview of the basics of R programming language along with RStudio which is a user-friendly environment for working with R. You will be introduced to the syntax, variables, functions, packages along with the various data structures in R. You will also learn basics of data wrangling from reading data from files, subsetting, merging and exporting data out of the R environment.
While we wait to get started --
- ✔️ Have you checked out the pre-workshop resources page?
- 📝 Please fill out our pre-workshop survey if you have not already done so! Click the survey link here
- 💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button:
We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.
Have you heard of the NIH Common Fund Data Ecosystem?
Put up a ✔️ for yes and a ❎ for no!
You can contact us at training@cfde.atlassian.net
If you have questions at any point,
- Drop them in the chat, or
- Direct messages to the moderator or helpers are welcome, or
- Unmute yourself and ask during the workshop
We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.
What is R?
R is a statistical computing and data visualization programming language.
What is RStudio?
R is often used via integrated development environments (IDE), and RStudio is probably the most popular option. R and Rstudio work on Mac, Linux, and Windows operating systems. The Rstudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface.
- RStudio panels
- Commonly used options in the panels
- View file system, help docs
- View/load/manage R package list
- View/save plots
- Customizing layout, etc. --> let's increase the font size
For this tutorial, we are using sequencing data from a European Nucleotide Archive dataset (PRJEB5348). This dataset is for an RNA-Seq experiment comparing two yeast strains, SNF2 and WT. They sequenced 96 samples in 7 different lanes, with 45 wt and 45 mut strains. Thus, there are 45 biological replicates and 7 technical replicates present, with a total of 672 samples! The datasets were compared to test the underlying statistical distribution of read counts in RNA-Seq data, and to test which differential expression programs work best.
We will examine the information about the RNA quality and quantity that the authors measured for all their samples.
We are going to explore those datasets using RStudio, combining both base R and a package called dplyr
.
Some Useful Info 🏝️
- Nucleic Acid Conc. = RNA concentration in the sample in units of nanogram per microliter (ng/ul)
- 260/280 = ratio measured with a spectrophotometer (i.e., Nanodrop) to assess RNA sample purity. For RNA, the ratio should be ~2.0
- Total RNA = total RNA molecular weight in the sample in units of microgram (ugm)
- General information about RNA quality for RNA-Seq
📓 Please note!
The Rstudio binder comes pre-loaded with R and the R packages we're using in today's workshop. On your local computer, you need to install R, Rstudio, and the R packages before using them:
# install
install.packages('ggplot2')
install.packages('readr')
install.packages('dplyr')
Alternatively, all of these packages belong to the tidyverse and can be installed along with several other useful packages with one installation command. However, we are not doing this today because it will be very slow.
# install.packages('tidyverse')
Load in R packages with the library()
function:
library(ggplot2)
library(readr)
library(dplyr)
The notes about masked objects mean certain R packages have used the same function name (i.e. both the
stats
anddplyr
have their own functions calledfilter
- check with?stats::filter
and?dplyr::filter
)
#
are used for notes in scripts or "commented out" lines of code that we don't want to run
The data file we're using today is stored in an online repository (https://osf.io/pzs7g/) and it is a tab-delimited .tsv
file. To read in the dataset in R, we are using the read_tsv()
function from the readr
R package.
A few notes about R syntax -- The <-
is called an assignment operator and it is an R convention for assigning values to variable names (i.e., in Python it is =
), here we assign the input data table values to the variable experiment_info
.
Functions like read_tsv
specify required and optional inputs within ( )
.
experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')
If you run the
read_tsv()
function before loading thereadr
R package, R will give you an error message that the function can't be found.
You will see a warning message, let's talk about what that means:
Warning: 3 parsing failures.
row col expected actual file
2 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
3 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
4 -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
Warning message:
Missing column names filled in: 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12]
Now let's look at the data!
# how big is the data set ?
dim(experiment_info)
# examine the first 6 rows across all columns of the data
head(experiment_info)
Tibbles - "a modern reimagining of the data.frame"
Tibbles are another flavor of data frame that are introduced as part of the tidyverse
packages i.e. dplyr
in this case to enable ease of data manipulation.
Instructor Switch!
There are six data types in R:
Data Type | Description | Examples |
---|---|---|
Numeric | numerical values | 1, 1.5, 20, pi |
Character | letters or words, enclosed in quotes | “random”, “2”, “TRUE”, “@”, “#” |
Integer | for integer numbers; the L indicates to R that it’s an integer) | 2L, 500L, -17L |
Logical | for boolean type data | TRUE, FALSE, T, F |
Complex | numbers with real and imaginary parts | 4+ 3i |
Raw | contains raw bytes encoding |
# what type of data structure is this ?
class(experiment_info)
There are other useful commands to get attribute information about a dataset.
# how many number of rows ?
nrow(experiment_info)
# how many number of columns ?
ncol(experiment_info)
# what are the column names of your dataset ?
colnames(experiment_info)
# what are the rownames of your dataset ?
rownames(experiment_info)
# what does the last few lines of my dataset look like ?
tail(experiment_info)
# what is the structure of my dataset ?
str(experiment_info)
data.frame
is tabular data similar to an excel spreadsheet which can contain mixed data types.
vectors
are uni-dimensional data structures with uniform data types.
Let's create a vector from the table we have.
a260_vector <- experiment_info$A260
# confirm the data structure
class(a260_vector)
# get the data type of the vector
typeof(a260_vector)
We can access individual elements within the vector using it's position. Indices start at 1 and use the []
notation.
# obtain the fifth element in the vector
a260_vector[5]
# subset the fifth and tenth element
a260_vector[c(5,10)]
# subset the fifth through tenth elements
a260_vector[c(5:10)]
Generate a vector containing the yeast strain.
yeast_strain_vector <- experiment_info$Yeast Strain
# The above code with result in error because R cannot parse spaces in column name. We need to add back command around the name of the column.
yeast_strain_vector <- experiment_info$`Yeast Strain`
Alternatively, you could also use the index of the column:
# dataframe[row number, column number]
# subset based on column index
yeast_strain_vector <- experiment_info[,1]
Another, two dimensional data structure is the matrix
. The major difference between matrix and dataframe is that matrix can have only one type of data type i.e. either character or numeric but not both!
Let's create a matrix containing the Nucleic Acid Conc., 260/280 ratio and Total RNA values.
# Create a matrix by subsetting the data frame by selecting the appropriate columns by column names
yeast_strain_matrix <- data.matrix(experiment_info[,c("Nucleic Acid Conc.",
"260/280",
"Total RNA")])
# View the data
head(yeast_strain_matrix )
- Real world data is messy and needs some formatting before you can use it for analysis.
- Often, you are only interested in a subset of the data and want to work with it without loading the entire set.
- Some programs/ visualizations require data to be in a specific format.
There are many ways to wrangle data. We will show you how to use an R package called dplyr
that contains sets of tools for dataframe manipulation.
For this workshop we will cover a few of them:
dplyr verb | Description |
---|---|
select() |
subset columns |
filter() |
subset rows matching some condition |
mutate() |
create new columns using information from other columns |
arrange() |
sort results |
count() |
count discrete values |
group_by() |
group data for analysis or plotting |
summarize() |
create summary statistics |
There are some empty columns in the experiment data file we have been working with. Let's create a subset of data without those empty columns using the select
function.
experiment_info_cleaned <- select(experiment_info,
Sample,
Yeast Strain,
Nucleic Acid Conc.,
Unit,
A260,
A280,
260/280,
Total RNA)
# As before, since R cannot parse spaces in column names, we need to enclose them in backticks to indicate that these words belong together.
experiment_info_cleaned <- select(experiment_info,
Sample,
`Yeast Strain`,
`Nucleic Acid Conc.`,
Unit,
A260,
A280,
260/280,
`Total RNA`)
# As a general rule, it is best to avoid column names that start with a number; we can use backticks for this column name.
experiment_info_cleaned <- select(experiment_info,
Sample,
`Yeast Strain`,
`Nucleic Acid Conc.`,
Unit,
A260,
A280,
`260/280`,
`Total RNA`)
# Check the dimensions of the subsetted dataframe
dim(experiment_info_cleaned)
You can also use select
to specify the columns you don't want. This can be done with the -
notation.
# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)
Select the columns Sample, Yeast Strain, A260 and A280 and assign them to a new object called "experiment_data".
experiment_data <- select(experiment_info_cleaned, Sample, `Yeast Strain`, A260, A280)
# Check the data
head(experiment_data)
We learned to subset entire columns. Using the filter
function, we can subset based on conditions, using a logical vector with TRUE and FALSE, or with operators.
Conditional Subsetting | Description |
---|---|
> | greater than the value |
>= | greater than or equal to the value |
< | less than the value |
<= | less than or equal to the value |
| |
applies logical operator OR
|
& |
applies logical operator AND
|
== |
equal to the value |
%in% |
operator used to identify if the value belongs to a vector |
!= |
not equal to the value |
The filter
function works on rows, so we can use it to subset data based on conditions in a column. Here's how you would use the function:
filter(experiment_info_cleaned, `Nucleic Acid Conc.` > 1500)
filter(experiment_info_cleaned, `A260/A280` >= 2.0 & `Nucleic Acid Conc.` > 1500)
Create a new object called wt_high_conc that contains data from WT strains and contain nucleic acid concentration of more than 1500 ng/uL.
wt_high_conc<- filter(experiment_info_cleaned, `Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500)
# Check the data
head(wt_high_conc)
So far, we have applied the actions individually. What if we wanted to apply select
and filter
at the same time?
We can use the concept of a pipe which is represented by %>%
in R. This is equivalent to |
in shell.
Pipes are available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, the shortcut to produce the pipe is:
- Ctrl + Shift + M on a PC or
- Cmd + Shift + M on a Mac
Here's how you would use it:
experiment_info_wt <- experiment_info_cleaned %>%
filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>%
select(Sample, `Yeast Strain`, A260, A280)
(when using pipes, the order of the command matters!)
mutate
can be used to create new columns in our dataframe using existing data.
mutate(experiment_info_cleaned,
concentration_ug_uL = `Nucleic Acid Conc.` / 1000)
Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.
# create data object
library_start <- experiment_info_cleaned %>%
mutate(RNA_100 = 100/ `Nucleic Acid Conc.`,
water = 50 - RNA_100) %>%
select(Sample, `Yeast Strain`, RNA_100, water)
# check the data
head(library_start)
The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this using group_by()
and summarize()
group_by() takes as argument the column name/s that contain categorical variables that can be used to group the data.
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
# summarize using more than one column
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
Next, we can use arrange()
to sort the data in either ascending (defalt order) or descending order.
# arrange new table in ascending mean concentrations
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(mean_concentration)
# arrange new table in descending mean concentrations
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(desc(mean_concentration))
We can also examine the number of categorical values using count()
experiment_info_cleaned %>%
count(`Yeast Strain`)
Calculate the mean value for A260/A280 for each of the yeast strains. Can these sample be used for downstream analysis ?
# calculate mean values
experiment_info_cleaned %>%
group_by(`Yeast Strain`) %>%
summarize(mean_ratio = mean(`260/280`))
Instructor switch!
ggplot
and its related packages are a flexible and widely-used plotting R package. You can make standard plots (i.e., scatter, histogram, box, bar, density plots), as well as a variety of other plot types like maps and phylogenetic trees!
It's a very helpful tool for exploring your data/results and generating publication-ready figures.
We're going to create some plots with the tibble dataframe from the previous section. To make it easier to specify column names, let's rename to remove spaces (you could also specify the column names using backticks as we did above):
# current column names
colnames(experiment_info_cleaned)
# new names specified with a vector
colnames(experiment_info_cleaned) <- c('Sample', 'Yeast_strain', 'Nucleic_Acid_Conc.',
'ng/ul', 'A260', 'A280', 'A260_280',
'Total_RNA', 'ugm')
The ggplot syntax always begins with a function that specifies the data and then layers on additional functions to specify a whole bunch of customizable plot aesthetics. Here, we specify the data object as the experiment_info_cleaned
tibble and set the axes with the mapping
flag.
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.))
Rstudio has some really helpful features for finding help docs, and for autofilling. Use the tab
key to autofill variable or R package function names for example. Find help docs about functions or R packages with the ?
key:
# help docs for ggplot() function
? ggplot
# help docs for ggplot2 R package
? ggplot2
Let's make a scatter plot by adding geom_point()
with a +
to the ggplot()
function:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
geom_point() +
ggtitle('Scatter plot')
Notice the indentation after the +
sign.
Add transparency to points (also works for other plot types!) with the alpha
option (0=completely transparent, 1=solid color) and change point size with size
:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
geom_point(alpha=0.6, size=4) +
ggtitle('Scatter with transparent points')
We can modify the code to fill points by yeast strain and make point size correspond to the 260/280 ratio:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., fill=Yeast_strain,
size=A260_280)) +
geom_point(alpha=0.6, pch=21, col='white') +
ggtitle('Scatter: color by strain \n and size by 260/280 ratio')
There are 2 strains in this dataset, WT
and snf2
.
unique(experiment_info_cleaned$Yeast_strain)
Let's make a boxplot by strain:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Yeast_strain, y=Total_RNA)) +
geom_boxplot() +
ggtitle('Boxplot')
Boxplots can be more informative if we add points to see the distribution of the data:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Yeast_strain, y=Total_RNA)) +
geom_point(alpha=0.6) +
geom_boxplot() +
ggtitle('Boxplot w/ points')
- What happens if you reverse the order of the
geom_point()
andgeom_boxplot()
functions for the boxplot code above?
Above, the points are plotted first, then the boxplot. If we reverse the order, the points will be plotted on top of the boxplot.
- What happens if you use
geom_jitter()
instead ofgeom_point()
?
geom_jitter()
'jitters' the points so they are not overlapping as much.
- Modifying axis label text:
xlab()
andylab()
- Facets:
facet_wrap()
orfacet_grid()
ggplot(data=experiment_info_cleaned,
mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., col=Yeast_strain)) +
geom_point() +
facet_wrap(~Yeast_strain) +
xlab('Total RNA (ugm)') +
ylab('Nucleic Acid Conc. (ng/ul)') +
ggtitle('Facet scatter plot')
- Try making a histogram (
geom_histogram()
) and a density plot (geom_density()
) of theA260_280
ratio values. (Hint: a histogram plots counts on the y-axis so you only need to specify the x-axis variable in theggplot()
function.)
Here is the basic code - you may need to modify the bin size:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280))+
geom_histogram(bins=20) +
ggtitle('Histogram')
You could add colors and plot by strain type like this:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_histogram(bins=20, col='black') +
ggtitle('Histogram by strain')
And we could facet the plot by strain like this:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_histogram(bins=20, col='black') +
facet_grid(Yeast_strain~.) + #this sets whether the facet is horizontal or vertical. here, we will get 2 rows of histograms by yeast strain
ggtitle('Facet histogram')
One more edit! Let's change the colors on the histogram. There are many built-in color palettes! Check more out here:
ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_histogram(bins=20, col='black') +
facet_grid(Yeast_strain~.) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) + # Add blue and yellow colors that are more colorblind friendly for plotting
ggtitle('Histogram w/ custom colors')
Now, let's switch this to a density plot and change the plot background
ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_density(alpha=0.6) +
facet_grid(Yeast_strain~.) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
theme_bw() +
ggtitle('Density plot')
Hopefully this exercise demonstrates how flexible ggplot is!
Use the ggsave()
function to save images (variety of file options available), for example:
myplot <- ggplot(data=experiment_info_cleaned,
mapping=aes(x=A260_280, fill=Yeast_strain))+
geom_density(alpha=0.6) +
facet_grid(Yeast_strain~.) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
theme_bw() +
ggtitle('Density plot')
ggsave(filename='mydensityplot.jpg', plot=myplot, height=4, width=4, units='in', dpi=600)
There are many more options to edit plot aesthetics and there are lots of excellent tutorials and resources, including:
- Rstudio cheatsheet for ggplot
- R Cookbook graphs section
- It's important to consider colorblind-friendly color palettes for plots too!
- The ColorBrewer website is a fantastic tool for selecting nice color combinations
- Viridis has one of the best color contrasts between light and dark colors; it works reasonably well for printing color plots in black and white
- cowplot is a useful package for further customization of plot arrangements and publication-ready figures
- ggplot tutorial from Cedric Scherer
Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!
For example, select a file, click "More", click "Export":
📝 Please fill out our post-workshop survey! We use the feedback to help improve our training materials and future workshops. Click here for survey!
Questions?
Acknowledgements: This lesson was inspired by material from 2019 ANGUS Intro to R lesson
- RStudio cheatsheet for readr
- RStudio cheatsheet for dplyr
- RStudio cheatsheet for data Wrangling with dplyr
- More tutorials:
Q: I am receiving this error when I run the code error: could not find function <name of the function>
A: This generally indicates that the package containing the function or command you are trying to use is not loaded into. You can fix this by running this code:
library(<name of the package>)
in the R Console.
Q: How is the class()
function different from typeof()
?
A: class()
in R informs you of the type of the data type in your dataset while typeof()
determines the (R internal) type or storage mode of any object. You can find useful discussion on this stack exchange thread.
Q: What is the advantage of using <-
as an assignment operator, instead of =
?
A: It is a choice made by the creators of R about the different usage. But typically you use <-
for assignment and =
within functions for arguments.
Q: Can you specify a column name to rename?
A: You can use the column index to rename a specific column. For example colnames(experiment_info_cleaned)[2] <- “yeast_strain”
- Home
- Resources for Attendees
- Resources for Instructors
- Training Workshop Notes
-
HuBMAP Tools
-
R
-
RNA-Seq Concepts, Design and Workflows
-
RNA-Seq in the Cloud
-
Snakemake Part I & II
-
UNIX