tabular_data.Rmd

---
title: 'Working with tabular data in R'
---

> #### Objectives
>
> * Read tabular data into R from a CSV file
> * Filter rows based on conditions and values in columns
> * Select columns of interest or exclude columns that aren't of interest
> * Summarize the values in one or more columns
> * Group data based on one or more categorical variables and create summaries within groups
> * Add new columns to tables derived from the values in other columns
> * Visualize tabular data by creating plots with the popular **ggplot2** package
> * Perform a statistical two-sample test

```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```

# Introduction

In this session we'll introduce some of the most commonly used operations for
interacting with and visualizing tabular data. The functions we will look at are
from a collection of packages known as the **tidyverse**.

This will be something of a whistle-stop tour with the aim of giving you an
appreciation of what you can do with R and hopefully inspiring you to start on a
journey towards using R for effective analysis of data in your research.

# Tabular data

Much of the data we work with are in tabular format and R has a data structure,
known as a **data.frame**, for handling this type of data. Let's look at an
example using one of R's built-in data sets.

Open RStudio and type "`iris`" at the command prompt and hit the return key. You
will see a set of measurements of petal and sepal dimensions for various iris
plants displayed as a table.

Writing our commands in a script file will make things much easier as we're
going to build chains of operations, incrementally adding additional operations
as we go along. Create a new R script file by selecting the **File > New File >
R Script** menu item. Save the script on your computer using using the
'**Save**' option from the '**File**' menu.

Type "`iris`" in the script file and run the command by clicking on the **Run**
button in the toolbar located above the script or by pressing <kbd>cmd</kbd> +
<kbd>S</kbd> on MacOS or <kbd>Ctrl</kbd> + <kbd>S</kbd> on Windows.

With large tables it is often more useful just to print the first few lines.

```{r}
head(iris)
```

The **`nrow()`** and **`ncol()`** functions return the number or rows and
columns of the table.

```{r}
nrow(iris)
ncol(iris)
```

## Tidyverse tables (tibbles)

The **tidyverse** provides a special kind of data frame known as a **tibble**
which has some additional behaviour over and above what you get with a regular
data frame. One of those additional features is the way a tibble is printed.

We need to load the tidyverse to make use of its functions.

```{r}
library(tidyverse)
```

Several packages get loaded - we'll be using functions from most of these.

The tidyverse also comes with some built-in data sets that are mainly used to
demonstrate how functions work in example code given in the help documentation.
One of these is the `mpg` data set containing fuel economy data for different
models of cars.

```{r}
mpg
```

Only the first 10 rows are printed and only as many columns as can fit across
the width of the page. The dimensions of the table (number of rows and columns)
are also displayed as well as the types of each column. `<chr>` stands for
character, `<dbl>` for double (floating-point number) and `<int>` for integer.

The help page, accessed using `?mpg`, contains more information about this data
set.

## Data.frame structure

A data frame or table in R is an ordered set of vectors, all of which have the
same length.

The values of a column can be extracted as a vector using the **`$`** operator.

```{r}
city_mpg <- mpg$cty
```

Having extracted the miles per gallon values for city driving as a vector, we
can use it as we would any other vector.

```{r}
length(city_mpg)
```

```{r}
city_mpg[1:50]
```

```{r}
mean(mpg$cty)
```

# NALCN current response data set

In what follows we'll be using a data set generated in the Cancer Research UK
Cambridge Institute and published in the following paper.

_The NALCN channel regulates metastasis and nonmalignant cell dissemination._
[Rahrmann _et al._, Nature Genetics 2022](http://doi.org/10.1038/s41588-022-01182-0)

We'll start by looking at the source data for figure 2b in this paper which can
be downloaded from the journal website by following the 'Source data' link at
the bottom of the figure caption.

Note that this Excel spreadsheet is actually in the newer xlsx format but has
the wrong suffix (xls) so Excel and R may refuse to open it. Rename the file so
it has the xlsx suffix.

# Reading a CSV file

The data for figures 2a and 2b (sheet FIG.2a_b) are not in a very suitable
format for computational analysis (appear to be structured more for human
readability) so we'll use a reformatted version available as a CSV file
[here](data/current_responses.csv).

Download the CSV file and save or move it to the directory in which you are
running your R session or that in which you saved your R script. You can find
out which directory you are working in using the **`getwd()`** function. If you
saved your R script file to a different location you can change the working
directory to the directory in which the script is located by using the
**Sesssion > Set Working Directory > To Source File Location** menu item.

## **`read_csv()`**

We'll use the **`read_csv()`** function from the **`readr`** package (one of the
packages that was loaded when we ran `library(tidyverse)`) to read the contents
of this CSV file into R.

```{r eval = FALSE}
current_responses <- read_csv("current_responses.csv")
```

```{r echo = FALSE}
current_responses <- read_csv("data/current_responses.csv")
```

```{r}
current_responses
```

These data are from a voltage-clamp experiment in which current density was
measured for the NALCN ion channel in _P1^KP^_-GAC cells for various voltage
steps in the &pm;80mV range.

To test whether _Nalcn_ regulates cancer progression, its function was altered
using both Nalcn-short hairpin RNA and NALCN-complementary DNA lentiviral
transduction. This data set contains measurements for 5 replicates each of the
_Nalcn^shRNA^_, _NALCN^cDNA^_ and control groups.

```{r}
summary(current_responses)
mean(current_responses$current)
```

# Slicing and dicing

## **`filter()`**

We'll filter our table so it just contains rows for the control samples.

```{r}
control_responses <- filter(current_responses, group == "Control")
control_responses
```

We can add further conditions, so, for example, to obtain the subset of current
measurements for control samples at a voltage of -80mV:

```{r}
control_minus_80_responses <- filter(current_responses, group == "Control", voltage == -80)
control_minus_80_responses
```

```{r}
mean(control_minus_80_responses$current)
```

## **`select()`**

As well as subsetting our table based on the contents of various columns, we
can also select a subset of columns that we're interested in.

```{r}
control_minus_80_responses <- select(control_responses, id, voltage, current)
control_minus_80_responses
```

Alternatively, we could specify the columns we're not interested in and would
like to exclude, using the **`!`** operator.


```{r eval = FALSE}
control_minus_80_responses <- select(control_responses, !group)
```

# Chaining operations using **`%>%`**

Typically a series of operations will be performed on a data set and assigning
the result of each to an intermediate object can get quite cumbersome and result
in code that isn't easy to read.

The pipe operator, **`%>%`**, can be used to join two or more operations. It
takes the output from one operation (the one that comes before it) and passes
the result into the next function (the one that comes after) as the first
argument.

> **`x %>% f(y)`** is equivalent to **`f(x, y)`**

We'll try this out to rewrite the `filter()` and `select()` operations in the
previous code snippet.

```{r}
filter(current_responses, group == "Control", voltage == -80) %>% select(!group)
```

It is more readable to spread this over multiple lines of code with a separate
operation on each line.

```{r}
current_responses %>%
  filter(group == "Control", voltage == -80) %>%
  select(!group)
```

We can assign the result to an object in the usual way.

```{r}
control_minus_80_responses <- current_responses %>%
  filter(group == "Control", voltage == -80) %>%
  select(!group)
```

# Summarizing tabular data

The `summary()` function we looked at in the introduction section is able to
work on tables and produces a summary for each column.

```{r}
summary(control_minus_80_responses)
```

## **`summarize()`**

The tidyverse has its own summarization function.

```{r}
summarize(control_minus_80_responses, mean_current = mean(current))
```

Unlike computing the mean value of a vector of numbers as we did earlier for
the "current" column extracted as a vector, `summarize()` returns another table.
This is useful when we're summarizing more than one thing at once.

```{r}
summarize(control_minus_80_responses, Current = mean(current), SD = sd(current), N = n())
```

## **`group_by()`**

Usually, you'll want to calculate summary values within groups of data that are
given by categorical variables, i.e. columns containing group names. We can do
this using the **`group_by()`** function.

For example, we can compute the mean current for each of the different voltages
in our control samples.

```{r}
control_responses %>%
  group_by(voltage) %>%
  summarize(Current = mean(current), SD = sd(current), N = n())
```

And we can group by more than one variable, so returning to our original,
unfiltered table:

```{r}
summary <- current_responses %>%
  group_by(group, voltage) %>%
  summarize(Current = mean(current), SD = sd(current), N = n())
summary
```

# Sorting

Sometimes we wish to sort the contents of the table by one or more columns.

## **`arrange()`**

Here are some examples of using the `arrange()` function for sorting.

First, we'll sort the values by increasing current density.

```{r}
arrange(summary, Current)
```
Then by decreasing current using `desc()`:

```{r}
arrange(summary, desc(Current))
```

Finally we arrange the summarized values by increasing voltage step and then
group.

```{r}
arrange(summary, voltage, group)
```

# Modifying a table

The source data Excel spreadsheet contains values for the mean current density
and also the standard error. We've calculated the standard deviation but to get
the standard error we must divide this by the square root of the number of
observations.

## **`mutate()`**

The **`mutate()`** lets us add new variables (columns) to a table or modify
existing ones.

```{r}
mutate(summary, SE = SD /sqrt(N))
```

## **`rename()`**

Sometimes we need to rename columns and can do so using the **`rename()``**
function. Putting it all together, here's an example that includes a renaming
operation for consistent column naming (capitalization) as well as the other
operations we've covered.

```{r}
summary <- current_responses %>%
  group_by(group, voltage) %>%
  summarize(Current = mean(current), N = n(), SD = sd(current), .groups = "drop") %>%
  mutate(SE = SD /sqrt(N)) %>%
  rename(Group = group, Voltage = voltage) %>%
  select(-SD) %>%
  arrange(Voltage, Group)
summary
```
# Visualizing tabular data with ggplot2

So far we've mainly looked at manipulating and performing calculations on our
data to get it in the right shape for the analysis we want to do. Now we'll move
on to creating plots to visualize the data. For this we'll be using the popular
**ggplot2** package that comes as part of the tidyverse.

## Scatter plot

We'll start with a simple scatter plot.

```{r}
ggplot(data = summary, aes(x = Voltage, y = Current)) +
  geom_point()
```

We specified 3 things to create this plot:

1. The **data** -- needs to be a data frame (or a tibble)
2. The type of plot -- this is called a **geom** in ggplot2 terminology
3. The mapping of variables (columns in our table) to visual properties of objects in the plot - these are called **aesthetics** in ggplot2 and specified using the **`aes`** argument

In this case, the type of plot is a **`geom_point`**, ggplot2's function for a
scatter plot, and the voltage and current density are mapped to the **`x`** and
**`y`** coordinates.

## Aesthetics

Look up the `geom_point` help page to see the list of aesthetics that are
available for this geom.

```{r eval = FALSE}
?geom_point
```

We have 3 distinct groups and we can help to distinguish these in the plot by
using another of the aesthetics: colour.

```{r}
ggplot(data = summary, aes(x = Voltage, y = Current, colour = Group)) +
  geom_point()
```

## Line plot

Plots in ggplot2 are built up in layers. We'll add another _geom_ for connecting
our points to create a line plot.

```{r}
ggplot(data = summary, aes(x = Voltage, y = Current, colour = Group)) +
  geom_point() +
  geom_line()
```

## Error bars

This now looks a lot more like figure 2b in the Nature Genetics paper, but in
that figure error bars on each of the points are displayed. We'll add another
layer to our plot using **`geom_errorbar`**. This requires another two aesthetic
mappings for the lower and upper bounds representing the uncertainty of each
current density estimate.

```{r}
ggplot(summary, aes(x = Voltage, y = Current, ymin = Current - SE, ymax = Current + SE, colour = Group)) +
  geom_point() +
  geom_line() +
  geom_errorbar(width = 2)
```

The other thing we did here was to adjust the width of the error bars as the
default setting was not aesthetically pleasing.

Each of the _geoms_ has specific options for changing its appearance. The help
pages are a good place to start to learn about how each works and have lots of
example code snippets that you can easily run to understand better how they
work.

## Customizing plots

You will almost certainly want to customize the appearance of your plots and
**ggplot2** provides lots of options to do so including the option of specifying
a theme.

We'll apply one of these themes, change the order in which the layers (geoms)
are drawn, specify some colours and change the axis labels to make our plot look
a bit more polished.

```{r}
ggplot(summary, aes(x = Voltage, y = Current, ymin = Current - SE, ymax = Current + SE, colour = Group)) +
  geom_errorbar(width = 2, colour = "dimgrey") +
  geom_line() +
  geom_point(size = 2) +
  labs(x = "Voltage (mV)", y = "Current density (pA/pF)", colour = "") +
  scale_color_manual(values = c("dimgrey", "firebrick", "steelblue")) +
  scale_x_continuous(breaks = c(-80, -60, -40, -20, 0, 20, 40, 60, 80)) +
  scale_y_continuous(breaks = c(-40, -20, 0, 20, 40)) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )
```

# Statistical tests

The analysis of these data in the Nature Genetics paper also includes some
statistical tests. For example, the authors tested for a statistically
significant difference in the current response at a voltage step of +80mV
between the _Nalcn^shDNA^_ and control groups.

We'll first filter the values we need for this comparison.

```{r}
test_data <- filter(current_responses, voltage == 80, group %in% c("Control", "Nalcn_shRNA"))
```

Normally we'd visualize our data before applying a statistical test to satisfy
ourselves that the assumptions of the test hold. We'll create a box plot using
another of the geoms available in ggplot2, **`geom_boxplot`**.

```{r}
ggplot(data = test_data, aes(x = group, y = current)) +
  geom_boxplot()
```

We only have 5 observations for each group in this case so the box plot is
perhaps not that useful for looking at the distribution of the data. We'll use
**`geom_jitter`** to show the points as well.

```{r}
ggplot(data = test_data, aes(x = group, y = current)) +
  geom_boxplot(width = 0.4, outlier.shape = NA) +
  geom_jitter(width = 0.1, size = 2) +
  labs(x = "", y = "Current density (pA/pF)") +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())
```

Finally we'll run a non-parametric Wilcoxon test, also known as the Mann-Whitney
U test.

```{r}
wilcox.test(current ~ group, data = test_data)
```