-
Notifications
You must be signed in to change notification settings - Fork 0
/
tabular_data.Rmd
514 lines (378 loc) · 15.8 KB
/
tabular_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
---
title: 'Working with tabular data in R'
---
> #### Objectives
>
> * Read tabular data into R from a CSV file
> * Filter rows based on conditions and values in columns
> * Select columns of interest or exclude columns that aren't of interest
> * Summarize the values in one or more columns
> * Group data based on one or more categorical variables and create summaries within groups
> * Add new columns to tables derived from the values in other columns
> * Visualize tabular data by creating plots with the popular **ggplot2** package
> * Perform a statistical two-sample test
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```
# Introduction
In this session we'll introduce some of the most commonly used operations for
interacting with and visualizing tabular data. The functions we will look at are
from a collection of packages known as the **tidyverse**.
This will be something of a whistle-stop tour with the aim of giving you an
appreciation of what you can do with R and hopefully inspiring you to start on a
journey towards using R for effective analysis of data in your research.
# Tabular data
Much of the data we work with are in tabular format and R has a data structure,
known as a **data.frame**, for handling this type of data. Let's look at an
example using one of R's built-in data sets.
Open RStudio and type "`iris`" at the command prompt and hit the return key. You
will see a set of measurements of petal and sepal dimensions for various iris
plants displayed as a table.
Writing our commands in a script file will make things much easier as we're
going to build chains of operations, incrementally adding additional operations
as we go along. Create a new R script file by selecting the **File > New File >
R Script** menu item. Save the script on your computer using using the
'**Save**' option from the '**File**' menu.
Type "`iris`" in the script file and run the command by clicking on the **Run**
button in the toolbar located above the script or by pressing <kbd>cmd</kbd> +
<kbd>S</kbd> on MacOS or <kbd>Ctrl</kbd> + <kbd>S</kbd> on Windows.
With large tables it is often more useful just to print the first few lines.
```{r}
head(iris)
```
The **`nrow()`** and **`ncol()`** functions return the number or rows and
columns of the table.
```{r}
nrow(iris)
ncol(iris)
```
## Tidyverse tables (tibbles)
The **tidyverse** provides a special kind of data frame known as a **tibble**
which has some additional behaviour over and above what you get with a regular
data frame. One of those additional features is the way a tibble is printed.
We need to load the tidyverse to make use of its functions.
```{r}
library(tidyverse)
```
Several packages get loaded - we'll be using functions from most of these.
The tidyverse also comes with some built-in data sets that are mainly used to
demonstrate how functions work in example code given in the help documentation.
One of these is the `mpg` data set containing fuel economy data for different
models of cars.
```{r}
mpg
```
Only the first 10 rows are printed and only as many columns as can fit across
the width of the page. The dimensions of the table (number of rows and columns)
are also displayed as well as the types of each column. `<chr>` stands for
character, `<dbl>` for double (floating-point number) and `<int>` for integer.
The help page, accessed using `?mpg`, contains more information about this data
set.
## Data.frame structure
A data frame or table in R is an ordered set of vectors, all of which have the
same length.
The values of a column can be extracted as a vector using the **`$`** operator.
```{r}
city_mpg <- mpg$cty
```
Having extracted the miles per gallon values for city driving as a vector, we
can use it as we would any other vector.
```{r}
length(city_mpg)
```
```{r}
city_mpg[1:50]
```
```{r}
mean(mpg$cty)
```
# NALCN current response data set
In what follows we'll be using a data set generated in the Cancer Research UK
Cambridge Institute and published in the following paper.
_The NALCN channel regulates metastasis and nonmalignant cell dissemination._
[Rahrmann _et al._, Nature Genetics 2022](http://doi.org/10.1038/s41588-022-01182-0)
We'll start by looking at the source data for figure 2b in this paper which can
be downloaded from the journal website by following the 'Source data' link at
the bottom of the figure caption.
Note that this Excel spreadsheet is actually in the newer xlsx format but has
the wrong suffix (xls) so Excel and R may refuse to open it. Rename the file so
it has the xlsx suffix.
# Reading a CSV file
The data for figures 2a and 2b (sheet FIG.2a_b) are not in a very suitable
format for computational analysis (appear to be structured more for human
readability) so we'll use a reformatted version available as a CSV file
[here](data/current_responses.csv).
Download the CSV file and save or move it to the directory in which you are
running your R session or that in which you saved your R script. You can find
out which directory you are working in using the **`getwd()`** function. If you
saved your R script file to a different location you can change the working
directory to the directory in which the script is located by using the
**Sesssion > Set Working Directory > To Source File Location** menu item.
## **`read_csv()`**
We'll use the **`read_csv()`** function from the **`readr`** package (one of the
packages that was loaded when we ran `library(tidyverse)`) to read the contents
of this CSV file into R.
```{r eval = FALSE}
current_responses <- read_csv("current_responses.csv")
```
```{r echo = FALSE}
current_responses <- read_csv("data/current_responses.csv")
```
```{r}
current_responses
```
These data are from a voltage-clamp experiment in which current density was
measured for the NALCN ion channel in _P1^KP^_-GAC cells for various voltage
steps in the ±80mV range.
To test whether _Nalcn_ regulates cancer progression, its function was altered
using both Nalcn-short hairpin RNA and NALCN-complementary DNA lentiviral
transduction. This data set contains measurements for 5 replicates each of the
_Nalcn^shRNA^_, _NALCN^cDNA^_ and control groups.
```{r}
summary(current_responses)
mean(current_responses$current)
```
# Slicing and dicing
## **`filter()`**
We'll filter our table so it just contains rows for the control samples.
```{r}
control_responses <- filter(current_responses, group == "Control")
control_responses
```
We can add further conditions, so, for example, to obtain the subset of current
measurements for control samples at a voltage of -80mV:
```{r}
control_minus_80_responses <- filter(current_responses, group == "Control", voltage == -80)
control_minus_80_responses
```
```{r}
mean(control_minus_80_responses$current)
```
## **`select()`**
As well as subsetting our table based on the contents of various columns, we
can also select a subset of columns that we're interested in.
```{r}
control_minus_80_responses <- select(control_responses, id, voltage, current)
control_minus_80_responses
```
Alternatively, we could specify the columns we're not interested in and would
like to exclude, using the **`!`** operator.
```{r eval = FALSE}
control_minus_80_responses <- select(control_responses, !group)
```
# Chaining operations using **`%>%`**
Typically a series of operations will be performed on a data set and assigning
the result of each to an intermediate object can get quite cumbersome and result
in code that isn't easy to read.
The pipe operator, **`%>%`**, can be used to join two or more operations. It
takes the output from one operation (the one that comes before it) and passes
the result into the next function (the one that comes after) as the first
argument.
> **`x %>% f(y)`** is equivalent to **`f(x, y)`**
We'll try this out to rewrite the `filter()` and `select()` operations in the
previous code snippet.
```{r}
filter(current_responses, group == "Control", voltage == -80) %>% select(!group)
```
It is more readable to spread this over multiple lines of code with a separate
operation on each line.
```{r}
current_responses %>%
filter(group == "Control", voltage == -80) %>%
select(!group)
```
We can assign the result to an object in the usual way.
```{r}
control_minus_80_responses <- current_responses %>%
filter(group == "Control", voltage == -80) %>%
select(!group)
```
# Summarizing tabular data
The `summary()` function we looked at in the introduction section is able to
work on tables and produces a summary for each column.
```{r}
summary(control_minus_80_responses)
```
## **`summarize()`**
The tidyverse has its own summarization function.
```{r}
summarize(control_minus_80_responses, mean_current = mean(current))
```
Unlike computing the mean value of a vector of numbers as we did earlier for
the "current" column extracted as a vector, `summarize()` returns another table.
This is useful when we're summarizing more than one thing at once.
```{r}
summarize(control_minus_80_responses, Current = mean(current), SD = sd(current), N = n())
```
## **`group_by()`**
Usually, you'll want to calculate summary values within groups of data that are
given by categorical variables, i.e. columns containing group names. We can do
this using the **`group_by()`** function.
For example, we can compute the mean current for each of the different voltages
in our control samples.
```{r}
control_responses %>%
group_by(voltage) %>%
summarize(Current = mean(current), SD = sd(current), N = n())
```
And we can group by more than one variable, so returning to our original,
unfiltered table:
```{r}
summary <- current_responses %>%
group_by(group, voltage) %>%
summarize(Current = mean(current), SD = sd(current), N = n())
summary
```
# Sorting
Sometimes we wish to sort the contents of the table by one or more columns.
## **`arrange()`**
Here are some examples of using the `arrange()` function for sorting.
First, we'll sort the values by increasing current density.
```{r}
arrange(summary, Current)
```
Then by decreasing current using `desc()`:
```{r}
arrange(summary, desc(Current))
```
Finally we arrange the summarized values by increasing voltage step and then
group.
```{r}
arrange(summary, voltage, group)
```
# Modifying a table
The source data Excel spreadsheet contains values for the mean current density
and also the standard error. We've calculated the standard deviation but to get
the standard error we must divide this by the square root of the number of
observations.
## **`mutate()`**
The **`mutate()`** lets us add new variables (columns) to a table or modify
existing ones.
```{r}
mutate(summary, SE = SD /sqrt(N))
```
## **`rename()`**
Sometimes we need to rename columns and can do so using the **`rename()``**
function. Putting it all together, here's an example that includes a renaming
operation for consistent column naming (capitalization) as well as the other
operations we've covered.
```{r}
summary <- current_responses %>%
group_by(group, voltage) %>%
summarize(Current = mean(current), N = n(), SD = sd(current), .groups = "drop") %>%
mutate(SE = SD /sqrt(N)) %>%
rename(Group = group, Voltage = voltage) %>%
select(-SD) %>%
arrange(Voltage, Group)
summary
```
# Visualizing tabular data with ggplot2
So far we've mainly looked at manipulating and performing calculations on our
data to get it in the right shape for the analysis we want to do. Now we'll move
on to creating plots to visualize the data. For this we'll be using the popular
**ggplot2** package that comes as part of the tidyverse.
## Scatter plot
We'll start with a simple scatter plot.
```{r}
ggplot(data = summary, aes(x = Voltage, y = Current)) +
geom_point()
```
We specified 3 things to create this plot:
1. The **data** -- needs to be a data frame (or a tibble)
2. The type of plot -- this is called a **geom** in ggplot2 terminology
3. The mapping of variables (columns in our table) to visual properties of objects in the plot - these are called **aesthetics** in ggplot2 and specified using the **`aes`** argument
In this case, the type of plot is a **`geom_point`**, ggplot2's function for a
scatter plot, and the voltage and current density are mapped to the **`x`** and
**`y`** coordinates.
## Aesthetics
Look up the `geom_point` help page to see the list of aesthetics that are
available for this geom.
```{r eval = FALSE}
?geom_point
```
We have 3 distinct groups and we can help to distinguish these in the plot by
using another of the aesthetics: colour.
```{r}
ggplot(data = summary, aes(x = Voltage, y = Current, colour = Group)) +
geom_point()
```
## Line plot
Plots in ggplot2 are built up in layers. We'll add another _geom_ for connecting
our points to create a line plot.
```{r}
ggplot(data = summary, aes(x = Voltage, y = Current, colour = Group)) +
geom_point() +
geom_line()
```
## Error bars
This now looks a lot more like figure 2b in the Nature Genetics paper, but in
that figure error bars on each of the points are displayed. We'll add another
layer to our plot using **`geom_errorbar`**. This requires another two aesthetic
mappings for the lower and upper bounds representing the uncertainty of each
current density estimate.
```{r}
ggplot(summary, aes(x = Voltage, y = Current, ymin = Current - SE, ymax = Current + SE, colour = Group)) +
geom_point() +
geom_line() +
geom_errorbar(width = 2)
```
The other thing we did here was to adjust the width of the error bars as the
default setting was not aesthetically pleasing.
Each of the _geoms_ has specific options for changing its appearance. The help
pages are a good place to start to learn about how each works and have lots of
example code snippets that you can easily run to understand better how they
work.
## Customizing plots
You will almost certainly want to customize the appearance of your plots and
**ggplot2** provides lots of options to do so including the option of specifying
a theme.
We'll apply one of these themes, change the order in which the layers (geoms)
are drawn, specify some colours and change the axis labels to make our plot look
a bit more polished.
```{r}
ggplot(summary, aes(x = Voltage, y = Current, ymin = Current - SE, ymax = Current + SE, colour = Group)) +
geom_errorbar(width = 2, colour = "dimgrey") +
geom_line() +
geom_point(size = 2) +
labs(x = "Voltage (mV)", y = "Current density (pA/pF)", colour = "") +
scale_color_manual(values = c("dimgrey", "firebrick", "steelblue")) +
scale_x_continuous(breaks = c(-80, -60, -40, -20, 0, 20, 40, 60, 80)) +
scale_y_continuous(breaks = c(-40, -20, 0, 20, 40)) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank()
)
```
# Statistical tests
The analysis of these data in the Nature Genetics paper also includes some
statistical tests. For example, the authors tested for a statistically
significant difference in the current response at a voltage step of +80mV
between the _Nalcn^shDNA^_ and control groups.
We'll first filter the values we need for this comparison.
```{r}
test_data <- filter(current_responses, voltage == 80, group %in% c("Control", "Nalcn_shRNA"))
```
Normally we'd visualize our data before applying a statistical test to satisfy
ourselves that the assumptions of the test hold. We'll create a box plot using
another of the geoms available in ggplot2, **`geom_boxplot`**.
```{r}
ggplot(data = test_data, aes(x = group, y = current)) +
geom_boxplot()
```
We only have 5 observations for each group in this case so the box plot is
perhaps not that useful for looking at the distribution of the data. We'll use
**`geom_jitter`** to show the points as well.
```{r}
ggplot(data = test_data, aes(x = group, y = current)) +
geom_boxplot(width = 0.4, outlier.shape = NA) +
geom_jitter(width = 0.1, size = 2) +
labs(x = "", y = "Current density (pA/pF)") +
theme_minimal() +
theme(panel.grid.major.x = element_blank())
```
Finally we'll run a non-parametric Wilcoxon test, also known as the Mann-Whitney
U test.
```{r}
wilcox.test(current ~ group, data = test_data)
```