-
Notifications
You must be signed in to change notification settings - Fork 1
/
03-rstats.Rmd
797 lines (576 loc) · 22.1 KB
/
03-rstats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
---
title: 'Day 3: Beginner''s statistics in R'
author: "Laurent Gatto and Meena Choi"
output:
html_document:
self_contained: true
toc: true
toc_float: true
fig_caption: no
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Objectives
- Statistical hypothesis testing: t-test
- Sample size calculation
- Analysis for categorical data
- Linear regression and correlation
---
# Part 4: Statistical hypothesis test
First, we are going to prepare the session for further analyses.
```{r}
load("./data/summaryresults.rda")
load("./data/iprg.rda")
```
## Two sample t-test for one protein with one feature
Now, we'll perform a t-test whether protein `sp|P44015|VAC2_YEAST` has
a change in abundance between Condition 1 and Condition 2.
### Hypothesis
* $H_0$: no change in abundance, mean(Condition1) - mean(Condition2) = 0
* $H_a$: change in abundance, mean(Condition1) - mean(Condition 2) $\neq$ 0
### Statistics
* Observed $t = \frac{\mbox{difference of group means}}{\mbox{estimate of variation}} = \frac{(mean_{1} - mean_{2})}{SE} \sim t_{\alpha/2, df}$
* Standard error, $SE=\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}$
with
* $n_{i}$: number of replicates
* $s_{i}^2 = \frac{1}{n_{i}-1} \sum (Y_{ij} - \bar{Y_{i \cdot}})^2$: sample variance
### Data preparation
```{r}
## Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]
## Then, get two conditions only, because t.test only works for two groups (conditions).
oneproteindata.condition12 <- oneproteindata[oneproteindata$Condition %in%
c('Condition1', 'Condition2'), ]
oneproteindata.condition12
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
```
If we want to remove the levels that are not relevant anymore, we can
use `droplevels`:
```{r}
oneproteindata.condition12 <- droplevels(oneproteindata.condition12)
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
```
To perform the t-test, we use the `t.test` function. Let's first
familiarise ourselves with it by looking that the manual
```{r, eval=FALSE}
?t.test
```
And now apply to to our data
```{r}
# t test for different abundance (log2Int) between Groups (Condition)
result <- t.test(Log2Intensity ~ Condition,
data = oneproteindata.condition12,
var.equal = FALSE)
result
```
> **Challenge**
>
> Repeat the t-test above but with calculating a 90% confidence interval
> for the log2 fold change.
```{r, eval=FALSE, echo=FALSE}
result.ci90 <- t.test(Log2Intensity ~ Condition,
var.equal = FALSE,
data = oneproteindata.condition12,
conf.level = 0.9)
result.ci90
```
### The `htest` class
The `t.test` function, like other hypothesis testing function, return
a result of a type we haven't encountered yet, the `htest` class:
```{r}
class(result)
```
which stores typical results from such tests. Let's have a more
detailed look at what information we can learn from the results our
t-test. When we type the name of our `result` object, we get a short
textual summary, but the object contains more details:
```{r}
names(result)
```
and we can access each of these by using the `$` operator, like we
used to access a single column from a `data.frame`, but the `htest`
class is not a `data.frame` (it's actually a `list`). For example, to
access the group means, we would use
```{r}
result$estimate
```
> **Challenge**
>
> * Calculate the (log2-transformed) fold change between groups
> * Extract the value of the t-statistics
> * Calculate the standard error (fold-change/t-statistics)
> * Extract the degrees of freedom (parameter)
> * Extract the p values
> * Extract the 95% confidence intervals
```{r, echo=FALSE, include=FALSE}
## log2 fold-change
result$estimate[1]-result$estimate[2]
## test statistic value, T value
result$statistic
## standard error
(result$estimate[1]-result$estimate[2])/result$statistic
## degree of freedom
result$parameter
## p value for two-sides testing
result$p.value
## 95% confidence interval for log2 fold change
result$conf.int
```
We can also manually compute our t-test statistic using the formulas
we descibed above and compare it with the `summaryresult`.
Recall the `summaryresult` we generated last section.
```{r}
summaryresult
summaryresult12 <- summaryresult[1:2, ]
## test statistic, It is the same as 'result$statistic' above.
diff(summaryresult12$mean) ## different sign, but absolute values are same as result$estimate[1]-result$estimate[2]
sqrt(sum(summaryresult12$sd^2/summaryresult12$length)) ## same as stand error
## the t-statistic : sign is different
diff(summaryresult12$mean)/sqrt(sum(summaryresult12$sd^2/summaryresult12$length))
```
## Re-calculating the p values
See the previous
[*Working with statistical distributions*](https://htmlpreview.github.io/?https://github.com/MayInstitute/MayInstitute2017/blob/master/Program3_Intro%20stat%20in%20R/02-rstats.html#working_with_statistical_distributions)
for a reminder.
Referring back to our t-test results above, we can manually calculate
the one- and two-side tests p-values using the t-statistics and the
test parameter (using the `pt` function).
Our result t statistic was `r as.vector(result$statistic)` (accessible
with `result$statistic`). Let's start by visualising it along a t
distribution. Let's create data from such a distribution, making sure
we set to appropriate parameter.
```{r}
## generate 10^5 number with the same degree of freedom for distribution.
xt <- rt(1e5, result$parameter)
plot(density(xt), xlim = c(-10, 10))
abline(v = result$statistic, col = "red") ## where t statistics are located.
abline(h = 0, col = "gray") ## horizontal line at 0
```
**The area on the left** of that point is given by `pt(result$statistic,
result$parameter)`, which is `r pt(result$statistic,
result$parameter)`. The p-value for a one-sided test, which is ** the area on the right** of red line, is this given by
```{r}
1 - pt(result$statistic, result$parameter)
```
And the p-value for a two-sided test is
```{r}
2 * (1 - pt(result$statistic, result$parameter))
```
which is the same as the one calculated by the t-test.
***
# Part 5a: Sample size calculation
To calculate the required sample size, you’ll need to know four
things:
* $\alpha$: confidence level
* $power$: 1 - $\beta$, where $\beta$ is probability of a true positive discovery
* $\Delta$: anticipated fold change
* $\sigma$: anticipated variance
## R code
Assuming equal varaince and number of samples across groups, the
following formula is used for sample size estimation:
$$\frac{2{\sigma}^2}{n}\leq(\frac{\Delta}{z_{1-\beta}+z_{1-\alpha/2}})^2$$
```{r}
library("pwr")
## ?pwr.t.test
# Significance level alpha
alpha <- 0.05
# Power = 1 - beta
power <- 0.95
# anticipated log2 fold change
delta <- 1
# anticipated variability
sigma <- 0.9
# Effect size
# It quantifies the size of the difference between two groups
d <- delta/sigma
#Sample size estimation
pwr.t.test(d = d, sig.level = alpha, power = power, type = 'two.sample')
```
> **Challenge**
>
> * Calculate power with 10 samples and the same parameters as above.
```{r, echo=FALSE, include=FALSE}
## log2 fold-change
pwr.t.test(d = d, sig.level = alpha, n=10, power = NULL, type = 'two.sample')
```
Let's investigate the effect of required fold change and variance on the sample size estimation.
```{r, warning=FALSE}
# anticipated log2 fold change
delta <- seq(0.1, 0.7, .1)
# anticipated variability
sigma <- seq(0.1,0.5,.1)
d <- expand.grid(delta = delta, sigma = sigma)
d$d <- d$delta / d$sigma
d$n <- sapply(d$d,
function(.d) {
res <- pwr.t.test(d = .d, sig.level = 0.05, power = 0.95)
ceiling(res$n)
})
d$sigma <- factor(d$sigma)
library("ggplot2")
p <- ggplot(data = d, aes(x = delta, y = n,
group = factor(sigma),
colour = sigma))
p + geom_line() + geom_point()
print(p)
```
***
# Part 5b: Choosing a model
The decision of which statistical model is appropriate for a given set of observations depends on the type of data that have been collected.
* Quantitative response with quantitative predictors : regression model
* Categorical response with quantitative predictors : logistic regression model for bivariate categorical response (e.g., Yes/No, dead/alive), multivariate logistic regression model when the response variable has more than two possible values.
* Quantitative response with categorical predictors : ANOVA model (quantitative response across several populations defined by one or more categorical predictor variables)
* Categorical response with categorical predictors : contingency table that can be used to draw conclusions about the relationships between variables.
Part 5b are adapted from *Bremer & Doerge*,
[Using R at the Bench : Step-by-Step Data Analytics for Biologists](https://www.amazon.com/dp/1621821129/ref=olp_product_details?_encoding=UTF8&me=)
, cold Spring Harbor LaboratoryPress, 2015.
***
# Part 5c: Analysis of categorical data
For this part, we are going to use a new dataset, which contains the
patient information from TCGA colorectal cohort. This data is from
*Proteogenomic characterization of human colon and rectal cancer*
(Zhang et al. 2014).
Let's read in and explore the TCGA colorectal cancer data:
```{r}
TCGA.CRC <- read.csv("./data/TCGA_sample_information.csv")
head(TCGA.CRC)
```
Rows in the data array are patients and columns are patient
information. The column definition is shown as following:
| Variable |
|---------------------|
| TCGA participant ID |
| Gender |
| Cancer type |
| BTAF mutation status|
| History of colon polyps |
Let's be familiar with data first.
```{r}
## check the dimension of data
dim(TCGA.CRC)
dim(TCGA.CRC$Gender) # dim function is for matrix, array or data frame.
## check unique information of Gender column.
unique(TCGA.CRC$Gender)
class(TCGA.CRC$Gender)
```
> **Challenge**
>
> * Get unique information and class for `Cancer` information
> * Get unique information and class for `BRAF.mutation` information
> * Get unique information and class for `history_of_colon_polyps` information
```{r, echo=FALSE, include=FALSE}
## check unique information of Cancer column.
unique(TCGA.CRC$Cancer)
class(TCGA.CRC$Cancer)
## check unique information of BRAF.mutation column.
unique(TCGA.CRC$BRAF.mutation)
class(TCGA.CRC$BRAF.mutation)
## check unique information of history_of_colon_polyps column.
unique(TCGA.CRC$history_of_colon_polyps)
class(TCGA.CRC$history_of_colon_polyps)
```
`table` function generates contingency table (or classification table with frequency of outcomes) at each combination.
```{r}
## check how many female and male are in the dataset
table(TCGA.CRC$Gender)
## check unique information if paticipant ID
unique(TCGA.CRC$TCGA.participant.ID)
length(unique(TCGA.CRC$TCGA.participant.ID))
```
There are 74 unique participant IDs. but there are total 78 rows. We can suspect that there are multiple rows for some IDs. Let's check. Get the count per participants ID
```{r}
countID <- table(TCGA.CRC$TCGA.participant.ID)
countID
```
Some participant IDs has 2 rows. Let's get the subset that has more than one row.
```{r}
unique(countID)
countID[countID > 1]
```
4 participants have two rows per each. Let's check the rows for `parcipant ID = TCGA-AA-A00A`
```{r}
TCGA.CRC[TCGA.CRC$TCGA.participant.ID == 'TCGA-AA-A00A', ]
```
There are two rows with the same information. They are duplicated rows. Let's remove them.
```{r}
TCGA.CRC <- TCGA.CRC[!duplicated(TCGA.CRC), ]
```
> **Challenge**
>
> * Check whether dimension and number of participants ID are changed after removing duplicated rows.
```{r, echo=FALSE, include=FALSE}
## check the dimension of data
dim(TCGA.CRC)
## check unique information if paticipant ID
unique(TCGA.CRC$TCGA.participant.ID)
length(unique(TCGA.CRC$TCGA.participant.ID))
countID <- table(TCGA.CRC$TCGA.participant.ID)
countID
unique(countID)
```
## Generate 2-way contingency tables
`table` function also can calculate 2-way contingency table, which is frequency outcomes of two categorical variables. Let's get the table and visualise the count data.
```{r}
cancer.polyps <- table(TCGA.CRC$history_of_colon_polyps, TCGA.CRC$Cancer)
cancer.polyps
dotchart(cancer.polyps, xlab = "Observed counts")
```
## Comparison of two proportions
**Hypothesis in general** :
$H_0$ : each population has the same proportion of observations, $\pi_{j=1|i=1} = \pi_{j=1|i=2}$
$H_a$ : different population have different proportion of observations.
**Hypothesis that we are interested with this example**: whether the proportion of patients who have history of colon polyps in the patients with colon cancer is different from that in the patients with rectal cancer.
```{r}
polyps <- cancer.polyps[2, ]
cancer <- apply(cancer.polyps, 2, sum)
pt <- prop.test(polyps, cancer)
pt
# name of output
names(pt)
# proportion in each group
pt$estimate
# test statistic value
pt$statistic
# degree of freedom
pt$parameter
```
## Test of independence
**Hypothesis in general** :
$H_0$ : the factors are independent.
$H_a$ : the factors are not independent.
**Hypothesis that we are interested with this example**: whether history of colon polyps and type of colon cancer are independent or not.
### Pearson Chi-squared test
$$\chi^2 =\sum_{i=1}^2 \sum_{j=1}^2 \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi^2_{(2-1)(2-1)}$$
$O_{ij}$ : $n_{ij}$, which is the count within the cells
$E_{ij}$ : $n_{i+}n_{+j}/n$, where $n_{i+}$ is the row count sum, $n_{+j}$ is the column count sum and n is the total count.
!! assumption : the distribution of the test statistic is approximately chi-square if the expected counts are large enough. Use this test only if the expected count for each cell is greater than 5.
```{r}
chisq.test(cancer.polyps)
```
Mathematically, two tests above are equivalent. `prop.test()` uses `chisq.test()` internally and print output differently.
### Fisher's exact test
Fisher's exact test is a non-parametric test for independence.
It compares the distributions of counts within 4 cells. p-value for Fisher's exact test is the probability of obtaining more extreme data by chance than the data observed if the row and column totals are held fixed.
There are no assumptions on distributions or sample sizes for this test.
Therefore, the Fisher's exact test can be used with small sample sizes. However, if the sample sizes are very small, then the power of the test will be very low.
Apply the Fisher's exact test using the `fisher.test` function and extract the odds ratio.
```{r}
ft <- fisher.test(cancer.polyps)
ft
```
and extract the odds ratio.
```{r}
ft$estimate
```
> **Challenge**
>
> * Compare the proportion of male patients in the patients with colon cancer is different from that in the patients with rectal cancer.
```{r, echo=FALSE, include=FALSE}
cancer.gender <- table(TCGA.CRC$Gender, TCGA.CRC$Cancer)
cancer.gender
dotchart(cancer.gender, xlab = "Observed counts")
male <- cancer.gender[2, ]
cancer <- apply(cancer.gender, 2, sum)
pt <- prop.test(male, cancer)
pt
```
### Optional practice : Large-sample Z-test
We could also apply a z-test for comparison of two proportions, defined as
$$Z=\frac{\widehat{p}_1-\widehat{p}_2}{\sqrt{\widehat{p} (1- \widehat{p}) (\frac{1}{n_1} + \frac{1}{n_2})}}$$
where $\widehat{\pi}_1 = \frac{y_{1}}{n_1}$, $\widehat{\pi}_2 = \frac{y_{2}}{n_2}$ and $\widehat{p}=\frac{x_1 + x_2}{n_1 + n_2}$.
We are going to use this test to illustrate how to write functions in
R.
An R function is created with the function constructor, named
`function`, and is composed of:
1. a name: we will call our function to calculate the p-value
`z.prop.p`;
2. some inputs: our inputs are the number of observations for the
outcome of interest, `x1` and `x2`, and the total number of
observation in each category, `n1` and `n2`;
3. a returned value (output): the computed p-value, named `pvalue`;
4. a body, i.e. the code that, given the inputs, calculates the output.
```{r}
z.prop.p <- function(x1, x2, n1, n2) {
pi_1 <- x1/n1
pi_2 <- x2/n2
numerator <- pi_1 - pi_2
p_common <- (x1+x2)/(n1+n2)
denominator <- sqrt(p_common * (1-p_common) * (1/n1 + 1/n2))
stat <- numerator/denominator
pvalue <- 2 * (1 - pnorm(abs(stat)))
return(pvalue)
}
z.prop.p(cancer.polyps[2,1], cancer.polyps[2,2], sum(cancer.polyps[,1]), sum(cancer.polyps[,2]))
```
> **Challenge**
>
> Write a function named `f2c` (`c2f`) that converts a temperature
> from Fahrenheit to Celsium (Celsium to Fahrenheit) using the
> following formula $F = C \times 1.8 + 32$ ($C = \frac{F - 32}{1.8}$).
***
# Part 6: Linear models and correlation
When considering correlations and modelling data, visualisation is key.
Let's use the famous
[*Anscombe's quartet*](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)
data as a motivating example. This data is composed of 4 pairs of
values, $(x_1, y_1)$ to $(x_4, y_4)$:
```{r anscombe, echo = FALSE, results='asis'}
library("knitr")
kable(anscombe)
```
Each of these $x$ and $y$ sets have the same variance, mean and
correlation:
```{r anscombetab, echo=FALSE}
tab <- matrix(NA, 5, 4)
colnames(tab) <- 1:4
rownames(tab) <- c("var(x)", "mean(x)",
"var(y)", "mean(y)",
"cor(x,y)")
for (i in 1:4)
tab[, i] <- c(var(anscombe[, i]),
mean(anscombe[, i]),
var(anscombe[, i+4]),
mean(anscombe[, i+4]),
cor(anscombe[, i], anscombe[, i+4]))
```
```{r anstabdisplay, echo=FALSE}
kable(tab)
```
But...
While the *residuals* of the linear regression clearly indicate
fundamental differences in these data, the most simple and
straightforward approach is *visualisation* to highlight the
fundamental differences in the datasets.
```{r anscombefig, echo=FALSE}
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1))
for (i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, pch = 19, xlim = c(3, 19), ylim = c(3, 13))
mods[[i]] <- lm(ff, data = anscombe)
abline(mods[[i]])
}
```
See also another, more recent example:
[The Datasaurus Dozen dataset](https://www.autodeskresearch.com/publications/samestats).
<details>
![The Datasaurus Dozen dataset](https://d2f99xq7vri1nk.cloudfront.net/DinoSequentialSmaller.gif)
</details>
## Correlation
Here is an example where a wide format comes very handy. We are going
to convert our iPRG data using the `spread` function from the `tidyr`
package:
```{r}
library("tidyr")
iprg2 <- spread(iprg[, 1:3], Run, Log2Intensity)
rownames(iprg2) <- iprg2$Protein
iprg2 <- iprg2[, -1]
```
```{r, echo = FALSE}
if (!file.exists("./data/iprg2.rda"))
save(iprg2, file = "./data/iprg2.rda")
```
And lets focus on the 3 runs, i.e. 2 replicates from condition
1 and
```{r}
x <- iprg2[, c(1, 2, 10)]
head(x)
pairs(x)
```
We can use the `cor` function to calculate the Pearson correlation
between two vectors of the same length (making sure the order is
correct), or a dataframe. But, we have missing values in the data,
which will stop us from calculating the correlation:
```{r}
cor(x)
```
We first need to omit the proteins/rows that contain missing values
```{r}
x2 <- na.omit(x)
cor(x2)
```
### A note on correlation and replication
It is often assumed that high correlation is a halmark of good
replication. Rather than focus on the correlation of the data, a
better measurement would be to look a the log2 fold-changes, i.e. the
distance between or repeated measurements. The ideal way to visualise
this is on an MA-plot:
```{r, fig.width = 12}
par(mfrow = c(1, 2))
r1 <- x2[, 1]
r2 <- x2[, 2]
M <- r1 - r2
A <- (r1 + r2)/2
plot(A, M); grid()
suppressPackageStartupMessages(library("affy"))
affy::ma.plot(A, M)
```
See also this
[post](http://simplystatistics.org/2015/08/12/correlation-is-not-a-measure-of-reproducibility/)
on the *Simply Statistics* blog.
## Linear modelling
`abline(0, 1)` can be used to add a line with intercept 0 and
slop 1. It we want to add the line that models the data linearly, we
can calculate the parameters using the `lm` function:
```{r}
lmod <- lm(r2 ~ r1)
summary(lmod)
```
which can be used to add the adequate line that reflects the (linear)
relationship between the two data
```{r}
plot(r1, r2)
abline(lmod, col = "red")
```
As we have seen in the beginning of this section, it is essential not
to rely solely on the correlation value, but look at the data. This
also holds true for linear (or any) modelling, which can be done by
plotting the model:
```{r}
par(mfrow = c(2, 2))
plot(lmod)
```
* *Cook's distance* is a commonly used estimate of the influence of a
data point when performing a least-squares regression analysis and
can be used to highlight points that particularly influence the
regression.
* *Leverage* quantifies the influence of a given observation on the
regression due to its location in the space of the inputs.
See also `?influence.measures`.
> **Challenge**
>
> 1. Take any of the `iprg2` replicates, model and plot their linear
> relationship. The `iprg2` data is available as an `rda` file, or
> regenerate it as shown above.
> 2. The Anscombe quartet is available as `anscombe`. Load it, create
> a linear model for one $(x_i, y_i)$ pair of your choice and
> visualise/check the model.
```{r}
x3 <- anscombe[, 3]
y3 <- anscombe[, 7]
lmod <- lm(y3 ~ x3)
summary(lmod)
par(mfrow = c(2, 2))
plot(lmod)
```
Finally, let's conclude by illustrating how `ggplot2` can very
elegantly be used to produce similar plots, with useful annotations:
```{r, message=FALSE}
library("ggplot2")
dfr <- data.frame(r1, r2, M, A)
p <- ggplot(aes(x = r1, y = r2), data = dfr) + geom_point()
p + geom_smooth(method = "lm") +
geom_quantile(colour = "red")
```
> **Challenge**
>
> Replicate the MA plot above using `ggplot2`. Then add a
> non-parametric lowess regression using `geom_smooth()`.
```{r}
p <- ggplot(aes(x = A, y = M), data = dfr) + geom_point()
p + geom_smooth() + geom_quantile(colour = "red")
```
---
Back to course [home page](https://github.com/MayInstitute/MayInstitute2017/blob/master/Program3_Intro%20stat%20in%20R/README.md)