-
Notifications
You must be signed in to change notification settings - Fork 4
/
12_app_rstudio.Rmd
372 lines (293 loc) · 13.8 KB
/
12_app_rstudio.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
# R and RStudio Help {#app-rstudio}
```{r setup, echo = FALSE}
knitr::opts_chunk$set(warning = FALSE, error = TRUE, cache = TRUE)
```
R is a powerful, open-source statistical programming language used by both
professional and academic data scientists. It is among the computer languages
most suited to modern data science, and is growing rapidly in its user base and
available packages.
Some students may not feel comfortable working in a programming language like R or
a console-based application like RStudio, especially if they have used applications
primarily through a GUI.
This appendix provides a basic bootcamp for R and Rstudio, but cannot be a
comprehensive manual on RStudio, and it certainly cannot be one for R. Good
places to get more detailed help include:
- R help manuals
- Stack Overflow
Some of the sections in this appendix are text-based, and some contain little
more than links to YouTube videos created by me or someone else.
## Installing
There are two pieces of software you should install:
- **R** [https://cran.r-project.org/](https://cran.r-project.org/): this contains
the system libraries necessary to run R commands in a terminal on your computer,
and contains a few additional helper applications. Install the most recent
stable release for your operating system.
- **RStudio** [https://rstudio.com/products/rstudio/download/](https://rstudio.com/products/rstudio/download/) is an integrated application that makes using R considerably easier
with text completion, file management, and some GUI features.
Both software are available for Windows, MacOS, and Linux. The videos and screenshots
of the application I post will use MacOS; the R code for all systems is the same,
and the RStudio interface all systems is very similar with minor differences.
## RStudio Orientation
The video below gives a very basic introduction to RStudio.
There is also a very useful
[cheat sheet](https://resources.rstudio.com/rstudio-cheatsheets/rstudio-ide-cheat-sheet)
for working with RStudio on the Rstudio website.
```{r orient-video, echo = FALSE}
knitr::include_url("https://www.youtube.com/embed/c3xv8wOIj-g")
```
## R Packages
One of the strengths of R is the ability for anyone to write packages. These
packages make it easier to read manipulate, and vizualize data; to estimate
statistical models; or to communicate results.
There are a number of ways to install additional packages. The most straightforward
is to use the `install.packages()` function in the console. The problems
in this book are solved with two additional packages^[`tidyverse` is actually a collection
of very useful packages, and many R users just load them all at once.]:
```{r neededpackages, eval = FALSE, echo=TRUE}
install.packages("tidyverse") # a suite of tools for data manipulation
install.packages("mlogit") # discrete choice modeling
```
RStudio also contains a GUI interface to install and update packages.
Sometimes you want to use a package that has not yet been pushed to CRAN, the
international repository of "approved" R packages. This may be because the package
is in development, or for one reason or another does not meet CRAN's standards
for completeness, etc. Oftentimes, the package has been made available on GitHub.
You can install a package directly from GitHub with the `remotes` library. One
package you will want for the problems in the book is the `nhts2017` package
on the BYU Transportation GitHub account. This package contains datasets from the 2017
[National Household Travel Survey](https://nhts.ornl.gov/).
```{r github-install, eval = FALSE, echo = TRUE}
install.packages("remotes") # tools for installing development packages
remotes::install_github("byu-transpolab/nhts2017")
```
```{r nhts-video, echo = FALSE}
knitr::include_url("https://www.youtube.com/embed/ULQAbpmPBhk")
```
You only need to **install** a package once on your computer. But every time you
want to **use** a function in a package, you need to load the package with the
`library()` function. To load the `tidyverse` packages, for instance,
```{r library-demo, echo = TRUE}
library(tidyverse)
```
If you get errors when you run the command above, it means that for some reason
you did not install the package correctly. And if you ever get an error like
```{r show-error, error = TRUE, echo = TRUE }
kable(tibble(x = 1:2, y = c("blue", "red")))
```
It often means you didn't load the library. In this case, the `kable()` function
to make pretty tables is part of the `knitr` package.
```{r show-kable, echo = TRUE}
library(knitr)
kable(tibble(x = 1:2, y = c("blue", "red")))
```
You can also use a function from a package without loading the library if you
use the `::` operator, like you did in the `remotes::install_github()` command
earlier. This is handy if you only want to use one function from a package, or
if you have two functions from different packages with the same name. For example,
when you loaded the `tidyverse` package, R told you that `dplyr::filter()` would
mask `stats::filter()`. So if for some reason you wanted to use the `filter` function
from the `stats` package, you would need to use `stats::filter()`.
## Working with Tables
Most data you will work with comes in a *tabular* form, meaning that the data
is formatted in columns of variables and rows of observations.
### Reading Data
Tabular data is often stored in a comma-separated values `.csv` file. To read a
data file like this in R, you can use the `read_csv()` function included in
`tidyverse`.
```{r show-read, echo = TRUE}
trips <- read_csv("data/demo_trips.csv")
print(trips)
```
This function will make a guess as to what the columns types should be. Often
we want to keep ID values as characters, even if they are numeric (this preserves
leading `0` values, etc.). We can tell `read_csv()` what types we expect with
the `col_types` argument.
```{r show-readtypes, echo = TRUE}
trips <- read_csv("data/demo_trips.csv", col_types = list(houseid = col_character()))
print(trips)
```
You can also write tables back to `.csv` with the `write_csv()` command.
### Modifying and Summarizing Tables
In much of this section, we will work with the `nhts_trips` dataset of trips
from the 2017 National Household Travel Survey in the `nhts2017` package you
installed from GitHub above.
```{r trips, echo = TRUE}
library(nhts2017)
trips <- nhts_trips
trips
```
#### Select, Filter, and Chains
This table is pretty overwhelming. But there are two functions that can help
us pare it down:
- `select()` lets you select columns in a table using the names of the columns.
- `filter()` lets you select rows in a table that meet a certain condition.
Let's practice this by selecting our `trips` dataset to only include the id
columns, the trip length, and the trip purpose.
```{r select, echo = TRUE}
select(trips, houseid, personid, trpmiles, trippurp)
```
Let's also practice filtering the `trips` dataset to only include trips
of the purpose "HBO" (home-based other). Notice how the number of rows
in the table trips is much smaller.
```{r filter, echo = TRUE}
filter(trips, trippurp == "HBW") # use double equals as comparison
```
One ***extremely*** useful feature of the `tidyverse` functions is the chain
operator, `%>%`. This operator basically does the opposite of the assigment
operator `<-`. While assignment says "take the thing on the right and put it in
the thing on the left," chain says "take the thing on the left and pass it as
the first argument of the function on the right." What this means in practice is
we can chain R commands together. So we can do the `select` *and* the `filter`
statements in sequence,
```{r chain, echo = TRUE}
trips %>%
select(houseid, personid, trpmiles, trippurp) %>%
filter(trippurp == "HBW")
```
Notice that we didn't have to tell the `select` and `filter` functions the
name of the table we were selecting or filtering. The `%>%` chain operator did
that for us.
Once we have the table we want, we can assign it to a new object called `mytrips`
In this case, let's get `HBO` and `HBW` trips.
```{r mytrips, echo = TRUE}
mytrips <- trips %>%
select(houseid, personid, trpmiles, trippurp) %>%
filter(trippurp %in% c("HBW", "HBO")) # use %in% for multiple comparisons.
```
### Mutate, Summarize, and Group {#app-mutate}
Sometimes we want to calculate a new column in a table, or recompute an
existing column. We can do that with the `mutate` function, and we can
put more than one calculation in a single `mutate` statement.
```{r mutate, echo = TRUE}
mytrips %>%
mutate(
tripkm = trpmiles * 1.60934, # convert miles to km.
longtrip = ifelse(tripkm > 50, TRUE, FALSE) # is trip longer than 50 km?
)
```
Other times we want to calculate summary statistics like means.
For this we can use the `summarize()` function.
```{r summarize, echo = TRUE}
mytrips %>%
summarize(
mean_trip = mean(trpmiles),
sd_trip = sd(trpmiles),
max_trip = max(trpmiles),
min_trip = min(trpmiles)
)
```
Finally, we sometimes want to calculate summary statistics for different groups.
We can tell `tidyverse` to group our tables with the `group_by()` function.
```{r groupby, echo = TRUE}
mytrips %>%
group_by(trippurp) %>%
summarize(
mean_trip = mean(trpmiles),
sd_trip = sd(trpmiles),
max_trip = max(trpmiles),
min_trip = min(trpmiles)
)
```
> As you might expect, work trips are on average longer than other kinds of trips.
But some people report very long trips! You might want to filter your data more
carefully for real analyses.
## Graphics with `ggplot2`
The `ggplot2` package included in the `tidyverse` is a very powerful graphics
engine with a relatively easy-to-learn grammar. In fact, the `gg` stands for
"grammar of graphics" as it implements the grammar defined by
@wilkinson2012grammar.
The basic structure of a `ggplot2` call is constructed as follows:
```{r ggplot2-example, echo = TRUE, eval = FALSE}
ggplot(data, aes(data aesthetics like x and y coordinates, fill color, etc.)) +
geom_(geometry style like point, bar, or histogram) +
other things like theme, color, and labels
```
For instance, we can create a histogram of trip lengths in the NHTS by giving
the `x` aesthetic as the `trpmiles` column in the `mytrips` dataset.
```{r ggplot2-histogram, echo = TRUE, warning = FALSE}
ggplot(mytrips, aes(x = trpmiles)) +
geom_histogram()
```
This ends up not being very informative because some trips are very long. We
could filter out the long trips within the data argument (Note that we still have
the `-9` values from the missing information).
```{r ggplot2-histogram1, echo = TRUE, warning = FALSE}
ggplot(mytrips %>% filter(trpmiles < 50), aes(x = trpmiles)) +
geom_histogram()
```
If we wanted to see the difference between lengths of different trip purposes,
we could add a color aesthetic to the plot. By default this stacks the two
categories on top of each other.
```{r ggplot2-histogram2, echo = TRUE, warning = FALSE}
ggplot(mytrips %>% filter(trpmiles < 50), aes(x = trpmiles, fill = factor(trippurp))) +
geom_histogram()
```
You could also show this with a statistical density (the integral of a density
function is 1). Note that the `alpha` statement for fill opacity is not included
as an aesthetic, because it doesn't vary based on any data elements in the way that
the `x` and `fill` variables do.
```{r ggplot2-histogram3, echo = TRUE}
ggplot(mytrips %>% filter(trpmiles < 50), aes(x = trpmiles, fill = factor(trippurp))) +
geom_density(alpha = 0.5)
```
`ggplot2` also excels at building statistical analysis on top of visualization.
For example, we can see the odometer reading for cars still on the road in
2017 by make.
```{r ggplot2-vehicles, echo = TRUE}
set.seed(15) # so that we pull the same random records each time
# sample 15k vehicles built after 1980 with 0 to 500k miles
vehicles <- nhts_vehicles %>%
# convert numeric make to its labeled name, and then group into manufacturers
mutate(
make = as_factor(make, levels = "labels"),
vehtype = as_factor(vehtype, levels = "labels"),
make = case_when(
make %in% c("Toyota", "Lexus", "Subaru") ~ "Toyota",
make %in% c("Ford", "Lincoln", "Mercury") ~ "Ford",
make %in% c("Chevrolet", "GMC", "Pontiac", "Buick", "Cadillac", "Saturn") ~ "GM",
make %in% c("Volkswagen", "Audi", "Porsche") ~ "VW",
grepl("Jeep", make) | grepl("Chrysler", make) | make %in% c("Ram", "Dodge", "Plymouth") ~ "Chrysler",
make %in% c("Honda", "Acura") ~ "Honda",
make %in% c("Nissan/Datsun", "Infiniti") ~ "Nissan",
TRUE ~ "Other" # all other makes
) ,
vehtype = case_when(
grepl("Car", vehtype) ~ "Car",
grepl("Van", vehtype) ~ "Van",
grepl("SUV", vehtype) ~ "SUV",
grepl("Pickup", vehtype) ~ "Pickup",
TRUE ~ "Other",
)
) %>%
filter(vehtype != "Other") %>%
filter(vehyear > 1980) %>%
filter(od_read > 0, od_read < 500000) %>%
sample_n(15000)
ggplot(vehicles, aes(x = vehyear, y = od_read, color = make)) +
geom_point()
```
This is pretty unreadable. But we can add a few things to the figure to make it a little bit
easier to understand, like smooth average lines and point transparency.
```{r ggplot2-vehicles1, echo = TRUE}
ggplot(vehicles, aes(x = vehyear, y = od_read, color = make)) +
geom_point(alpha = 0.5) +
stat_smooth(method = "loess")
```
Let's break this out by vehicle type.
```{r ggplot2-vehicles2, echo = TRUE}
ggplot(vehicles, aes(x = vehyear, y = od_read, color = make)) +
geom_point(alpha = 0.5) +
stat_smooth(method = "loess") +
facet_wrap(~vehtype)
```
And let's clean it up a little bit. This is a figure that you could put in a
published journal article or thesis, if it showed something you cared to show.
```{r ggplot2-vehicles3, echo = TRUE}
ggplot(vehicles, aes(x = vehyear, y = od_read, color = make)) +
geom_point(alpha = 0.5) +
scale_color_discrete("Manufacturer") +
stat_smooth(method = "loess") +
facet_wrap(~vehtype) +
xlab("Vehicle Model Year") + ylab("Odometer Reading") +
theme_bw()
```