-
Notifications
You must be signed in to change notification settings - Fork 1
/
Workshop_Walkthrough.Rmd
426 lines (296 loc) · 23.5 KB
/
Workshop_Walkthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Introduction to R Workshop"
output:
html_document:
df_print: paged
toc: TRUE
---
# Workshop Overview
## Basic R
- `dataframe`: the most common type of data structure in R. Essentially a spreadsheet where each column in the dataframe is a variable that describes the observed unit for any given row. A `tibble` is a slightly fancier dataframe used by the tidyverse.
- `?`: using the `?` operator with a function name will being up the help screen for that
- `<-`: assignment operator. Assigns a value to a variable so the variable can be used later.
- `%>%`: pipe operator. takes the output of one operation and inserts it as the input to the next operation.
## Packages
While a number of functions are available in base R, the beauty of R is that there are a huge number of publicly available, open-source packages to make your data analysis life easier. R packages are generally housed and maintained on the [CRAN](https://cran.r-project.org/) website, however a number of packages can be installed from Github and other hosting sites as well.
### Installation
The packages we will use in this workshop are `rstatix` and the `tidyverse` suite of packages. If you have not installed these previously, they can be installed using the following command:
```{r install packages, eval=FALSE}
install.packages(c('rstatix','tidyverse'))
```
The packages installation wizard in RStudio can also be used by clicking the Packages tab in the bottom right panel and then clicking Install in the panel's control bar.
### Loading Packages
Packages that are installed are not immediately available to use at all times. In order to directly access a package's commands, it needs to be loaded into the environment using the `library` command.
When working with the tidyverse, it's almost always better to load the `tidyverse` last to try and avoid function name conflicts within the tidyverse packages.
```{r load packages, results='hide', message=FALSE}
library(rstatix)
library(tidyverse)
```
## Loading Data
For the workshop, the data we will be using are stored in basic CSV files. We can use the `read_csv` function to load these data and store them in variables. CSV files can either stored locally or can be referenced using a URL.
```{r load data, results='hide', message=FALSE}
# read in tuition cost over time
cost_attendance <- readr::read_csv('https://raw.githubusercontent.com/mdefende/2022_Intro_to_R_Workshop/main/cleaned_data/cost_of_attendance.csv')
# read in graduation rate data over time
grad_rate <- readr::read_csv('https://raw.githubusercontent.com/mdefende/2022_Intro_to_R_Workshop/main/cleaned_data/graduation_rate.csv')
```
```{r, echo=TRUE}
print(cost_attendance)
```
These data look at how tuition and total cost of attending college in the United State have changed over time. There are also data on graduation rate included for investigating possible correlations of graduation rate and tuition cost. All data come from [TuitionTracker.org](TuitionTracker.org). The variables in each table are explained below:
### Tuition Cost
- `name` (character): The whole name of the college
- `state_code` (character): The two letter abbreviation for the state
- `control` (character): Is the college Private, Public, or For-Profit
- `level` (character): Ideal amount of time to receive the main degree type of the college. Either 2 Year or 4 Year
- `cost_type` (character): What kind of cost is being presented in that row. Either Tuition (plus associated fees) or Total. Total includes tuition as well as costs for a meal plan, textbooks, housing, etc. that are listed on a college's estimated costs.
- `in_state` (character): Are the costs in the row for in-state or out-of state students.
- `AY2009` through `AY2017` (numeric): Cost to attend the college for whole academic years 2009 through 2017. Academic years begin in the fall of the year listed and extend to spring of the following year. For example, AY2009 shows cost to attend in both Fall 2009 and Spring 2010 combined.
### Graduation Rate
- `unit_id` (numeric): a unique identification code given to each college on the [Tuition Tracker](TuitionTracker.org) site.
- `name` and `state_code` are same as above
- `AY2011` through `AY2016` are the same as above.
## Basic Look at Data
While we have a data dictionary that tells us what each variable is in the data frame, it's always good to get a broad look at how the data are structured after they've been loaded. To do this, we can use the `glimpse` function:
```{r glimpse}
glimpse(cost_attendance)
```
## Cost of Tuition in Alabama
Using the datasets available, there are a few questions we can ask about how tuition has changed over time, especially as it relates to both the location, economic model, and main degree type of the college.
### Summary Statistics
**1. In Alabama, what was the average cost of tuition for each type of college (private, public, for-profit) in 2017-2018?**
Here, we will use three functions from `dplyr`, the main data wrangling package in the `tidyverse`. We will be using:
- `filter` to keep data that matches specific criteria
- `group_by` with `summarize` to calculate some summary statistics within specific groups
```{r average AL tuition by group no pipes}
# Filter the main data, keeping those with 'AL' state_codes, a 'total' cost_type, and 'in'-state costs.
cost_AL <- filter(cost_attendance, state_code == 'AL', cost_type == 'total', in_state == 'in')
# group by type of institutional control so that any summary statistics will be split across Public, Private, and For Profit Universities.
cost_AL_grouped <- group_by(cost_AL,control)
# calculate mean, standard deviation, and number of institutions within the groups defined above
summarize(cost_AL_grouped, AY2017_mean = mean(AY2017), AY2017_sd = sd(AY2017), n = n())
```
This was a fairly simple task, just wanting to see the average in-state cost for college in 2017-2018 in Alabama. However, it took 3 separate commands and the creation of two extra variables to get there which isn't very efficient. Instead, we can use the `%>%` operator, a part of the `magrittr` package and the `tidyverse`. It will take the output of the command immediately preceding it and insert the output as the input for the command immediately following it.
So we can write the previous block of code like this instead:
```{r average AL tuition by group with pipes}
cost_attendance %>%
filter(state_code == 'AL', cost_type == 'total', in_state == 'in') %>%
group_by(control) %>%
summarize(AY2017_mean = mean(AY2017),
AY2017_sd = sd(AY2017),
n = n())
```
In this case, we were able to start with the main dataframe, perform our filtering, group by control type, and calculate out summary statistics without needing to create extra variables or execute three separate commands individually.
Overall, it looks like the average cost of attending a public college in Alabama in AY 2017 was a little over $20,000 while private colleges cost more than 1.5x as much. The deviation in total cost is also twice as much for private colleges in Alabama as public colleges.
### Density Plots
Means and standard deviations only tell us so much though. We can also look at the distribution using `ggplot`, the premier plotting tool in the `tidyverse`.
`ggplot` works by assigning variables from a dataframe to aesthetics that make up a plot. For instance to make a simple density plot from our Alabama tuition data we used earlier:
```{r basic ggplot}
# Density plot of total cost of all colleges in Alabama in AY2017
cost_AL %>%
ggplot(aes(x = AY2017)) +
geom_density()
```
So most of the tuition amounts seem to be concentrated around \$20,000 per year with some in the \$30,000+ range as well. However, this graph doesn't give us a good idea of which type of institution makes up each part of the density graph. We can use a different aesthetic to split the data by a group, the `color` aesthetic:
```{r AL density color}
# Density plot split by public vs. private institution in Alabama
cost_AL %>%
ggplot(aes(x = AY2017, color = control)) +
geom_density()
```
We can see in this graph that there's a pretty even distribution of cost in AY2017 for private institutions compared to public colleges which have a noticeable peak at \$20,000 and a steep drop-off afterwards.
#### Practice: Basic Summary Statistics and Density Plots
1. What was the median total cost to attend college in NY for in-state students in AY2017? Calculate by control type and include the number of colleges in each group in the output
2. Graph a density plot showing the distribution of total costs for AY2017. Split the graphs by control type.
```{r NY stats and graphs, include = FALSE}
cost_NY <- cost_attendance %>%
filter(state_code == 'NY', cost_type == 'total', in_state == 'in')
cost_NY %>%
group_by(control) %>%
summarize(AY2017_mean = mean(AY2017),
AY2017_sd = sd(AY2017),
n = n())
cost_NY %>%
ggplot(aes(x = AY2017, color = control)) +
geom_density()
```
### Pivoting, Mutating, and Change in Tuition Over Time
It's informative to see how total costs to attend college differ based on a college's economic model, however a cross-sectional analysis can only be so informative. Next, we want to see how total costs for college have changed over time from 2009 to 2017.
In order to do this, we will need to do some transformation of our data structure. Ideally, we will want a line and dot graph with academic year on the X axis and cost on the Y axis. ggplot can only assign column variables to aesthetics, so an X axis containing all of the academic year information isn't possible to create without changing how the data is structured. To fix this, we are going to pivot our data into a longer form using the `pivot_longer` function. This will take all of the academic year columns and essentially stack them into two columns, one with the year designation and the other with the cost for that year for that college. To see information on how to use the `pivot_longer` function (or any other function), use the command `?pivot_longer`.
Let's continue to use the AL state information we've been using so far:
```{r pivot AL}
cost_AL_long <- cost_AL %>%
pivot_longer(cols = contains('AY'),
names_to = 'acad_year',
values_to = 'cost')
head(cost_AL_long,7)
```
Now we have a single variable for academic year and a single variable for the cost, as opposed to spreading that cost over multiple academic year variables like we had previously. Now, we can set our X axis to be the academic year and our Y axis to be the cost.
```{r AL cost dot plot}
cost_AL_long %>%
ggplot(aes(x = acad_year, y = cost)) +
geom_point()
```
As we can see, there's a large spread of costs within each year, but there's a general increasing trend across years (as anyone who has paid for college recently can attest). Same as with the density plot though, we want to be able to differentiate by type of college. We can use the `color` aesthetic just like before.
```{r AL cost dot plot split control}
cost_AL_long %>%
ggplot(aes(x = acad_year, y = cost, color = control)) +
geom_point(position = position_dodge(width = 0.5))
```
To better visually separate the private colleges from the public colleges, we added a `position` option to the point geom to offset the location of the points. The `position_dodge` function dodges based the most fine-scale group specified in the aesthetics for the given geom. If we had specified the `name` variable as one of our grouping aesthetics, `position_dodge` would have dodged the points based on that instead.
#### Practice: Plotting Over Time
1. Graph a scatterplot for total cost of tuition in NY colleges over each year. Separate the groups of points by the control variable.
```{r cost NY over time}
cost_NY_long <- cost_NY %>%
pivot_longer(contains('AY'),names_to = 'acad_year', values_to = 'cost')
cost_NY_long %>%
ggplot(aes(x = acad_year, y = cost, color = control)) +
geom_point(position = position_dodge(width = 0.8))
```
### Calculating Non-Tuition Costs
In our dataset, we have data on both total costs and tuition-only costs. To now, we have only been concerned with the total costs, however tuition and extraneous costs such as housing and meal plans may be contributing differently to the overall rise in cost for college. Calculating these extraneous costs is simple formula-wise. It is the total costs minus the cost of tuition.
To calculate these in R, we will use the `mutate` function. This function adds another column to the data containing some value we set it to. This value can be either a set value ('A', 12, TRUE, or any other single value), or it can be a function of columns in the dataset. Before we use it, let's take a quick look at our main data again:
```{r}
head(cost_attendance)
```
In it, we still have out costs per year divided into multiple columns, but the `cost_type` is in a single column. To use `mutate` effectively here, we first need to flip the year information and cost into two columns like we did previously, then we need to perform an opposite pivot using `pivot_wider` to put the total cost and the tuition-only cost into separate columns. For this and future exercises, we are going to go ahead and filter out out-of-state costs as well since data for tuition costs are only listed for in-state students.
```{r pivoting main table}
# filter out the out-of-state costs and store only the in-state costs in a separate dataframe for later. Can also remove the in_state variable since it only has a single unique value in it now
cost_in <- cost_attendance %>%
filter(in_state == 'in') %>%
select(-in_state)
# pivot_longer just like previous exercises to get academic year and cost into individual columns
cost_in_long <- cost_in %>%
pivot_longer(contains('AY'), names_to = 'acad_year', values_to = 'cost')
# use pivot_wider to separate total and tuition costs into separte columns
cost_in_wide <- cost_in_long %>%
pivot_wider(names_from = cost_type, values_from = cost)
head(cost_in_wide)
```
Now we have academic year information in a single column and total and tuition costs in separate columns. At this point, we can easily mutate our dataframe to add a column for extraneous costs set to the difference between total and tuition costs.
```{r add extraneous costs}
cost_in_wide <- cost_in_wide %>%
mutate(extraneous = total - tuition)
head(cost_in_wide)
```
Let's see what the average extraneous costs costs are for schools in Alabama broken down by academic year and type of college.
```{r extraneous stats}
instate_extraneous_stats_AL <- cost_in_wide %>%
filter(state_code == 'AL') %>%
group_by(control,acad_year) %>%
summarize(mean_ext = mean(extraneous),
sem_ext = sd(extraneous)/sqrt(n()),
n = n())
head(instate_extraneous_stats_AL, n = Inf)
```
Now we can take these descriptive statistics and plot them across group and year, this time adding error bars for the standard error of the mean we calculated previously:
```{r}
instate_extraneous_stats_AL %>%
ggplot(aes(x = acad_year, y = mean_ext, color = control)) +
geom_point(position = position_dodge(width = 0.5)) +
geom_errorbar(aes(ymin = mean_ext-sem_ext, ymax = mean_ext+sem_ext),position = position_dodge(width = 0.5), width = 0.3)
```
We can see that in 2009, the average extraneous expenses were lower for both Private and Public colleges compared to in 2017, but it seems like these extraneous expenses have increased at a faster rate for public colleges than private ones contributing to a rise in overall cost of college.
#### Practice: Calculating and Plotting Group Statistics
1. Calculate the mean and standard error of just the cost of tuition for schools in Georgia. Plot the mean and add the error bars, and split by control.
```{r }
GA_tuition <- cost_attendance %>%
filter(state_code == 'GA', cost_type == 'tuition') %>%
pivot_longer(contains('AY'), names_to = 'acad_year', values_to = 'cost')
GA_tuition_stats <- GA_tuition %>%
group_by(control, acad_year) %>%
summarize(mean_cost = mean(cost),
sem_cost = sd(cost)/sqrt(n()),
n = n())
GA_tuition_stats %>%
ggplot(aes(x = acad_year, y = mean_cost, color = control)) +
geom_point(position = position_dodge(width = 0.5), size = 2) +
geom_errorbar(aes(ymin = mean_cost-sem_cost,
ymax = mean_cost+sem_cost),
position = position_dodge(width = 0.5),
width = 0.4,
size = 1)
```
## Inferential Statistics
R was originally built by statisticians for statisticians. It has a huge variety of open-source statistical packages for basically any type of analysis someone can think of. There are some built-in inferential stats functions such as `lm` for linear modelling and `t.test` for performing a basic t-test. However, for (slightly) more complex stats like the ANOVA family and multi-level modelling, you will should use packages outside of base R.
Note: R does come with base `aov` and `anova` functions for performing ANOVAs, but both of these only calculate Type 1 sums of squares which is improper for some analyses. The `rstatix` package (more accurately the `car` package that `rstatix` co-opts) provides a more robust framework for ANOVAs and other simple statistical tests than base R does.
Here, we will do a quick example looking at a comparison of tuition cost between public and private colleges in New York. We will filter out For Profit colleges due to the very small sample size compared to the other groups. We've seen through graphs that private colleges are generally much more expensive in terms of tuition than public colleges, but this will be more of a proof of concept than used for any real ground-breaking analysis.
`rstatix` supports the pipe framework so we can easily perform filtering commands before piping the output into the statistical function we want to use.
```{r}
cost_attendance %>%
filter(state_code == 'NY', control != 'For Profit', cost_type == 'tuition') %>%
t_test(AY2017 ~ control)
```
We could also, perform an ANOVA testing the hypothesis that tuition costs for public colleges differed across states in 2017. Again, this isn't a very meaningful statistical test but more for sake of example given our data.
```{r}
cost_attendance %>%
filter(control == 'Public', cost_type == 'tuition') %>%
anova_test(AY2017 ~ state_code) %>%
get_anova_table()
```
## Extra: Making a Prettier Graph
The basic ggplot output graphs are better than Matlab and Excel graphs, but not by too much. There is a large amount of power stored in ggplot to alter pretty much everything about your graph that you would like, down to something as small as the pixel size of the tick marks on the axes.
As an example, let's take the graph from the last practice plotting Georgia tuition cost as a function of academic year and institutional control. While plotting a mean and standard error can be informative, plotting the entire distribution of the sample as a backdrop can add context for the viewer.
If you remember, these non-summarized data are stored in the `GA_tuition` dataframe above. We can add a second `geom_point` layer and manually set the data to come from the non-summarized dataframe. This layer will come first so the average value is plotted on top. We will also adjust the transparency of these points using the `alpha` aesthetic and set it to be 0.5, or half transparent. From there, we can add the summarized mean and SEM layers afterwards.
```{r GA tuition with indiv point backdrop}
GA_tuition_stats %>%
ggplot(aes(x = acad_year, y = mean_cost)) +
# plot individual data points for each college
geom_point(data = GA_tuition,
mapping = aes(x = acad_year,
y = cost,
color = control),
position = position_dodge(width = 0.5),
alpha = 0.5) +
# plot the mean and standard error of each group, changing the size of both and adding the `group` aesthetic instead of color so the color of these layers will be black and will stand out from the colored individual points.
geom_point(mapping = aes(group = control),
position = position_dodge(width = 0.5),
size = 2) +
geom_errorbar(aes(ymin = mean_cost-sem_cost,
ymax = mean_cost+sem_cost,
group = control),
position = position_dodge(width = 0.5),
width = 0.4,
size = 1)
```
Since we've set up the data itself to be more informative and still look good (depending on who you ask), we need to make adjustments to the axis values and labels, the the title, and some general theme cleanup such as removing the gray background. All of these changes can be seen in the plot below. The comments above each command explain exactly what is happening.
```{r GA tuition with full formatting}
GA_tuition_stats %>%
ggplot(aes(x = acad_year, y = mean_cost)) +
# plot individual data points for each college using non-summarized data
geom_point(data = GA_tuition,
mapping = aes(x = acad_year,
y = cost,
color = control),
position = position_dodge(width = 0.5),
alpha = 0.5) +
# plot the mean and standard error of each group, changing the size of both and adding the `group` aesthetic instead of color so the color of these layers will be black and will stand out from the colored individual points.
geom_point(mapping = aes(group = control),
position = position_dodge(width = 0.5),
size = 2) +
geom_errorbar(aes(ymin = mean_cost-sem_cost,
ymax = mean_cost+sem_cost,
group = control),
position = position_dodge(width = 0.5),
width = 0.4,
size = 1) +
# set the labels to be just the beginning year, dropping the AY string
scale_x_discrete(labels = 2009:2017) +
# set the y-axis to be formatted as dollars
scale_y_continuous(labels=scales::dollar_format()) +
# add axis and legend labels as well as a plot title
labs(x = 'Academic Year',
y = 'Cost of Tuition (USD)',
title = 'Comparison of Cost of Tuition in Georgia',
color = 'Control of\nCollege') +
# use the standard minimal theme to make the graph look generally better
theme_minimal() +
# set the plot title to be centered and increase the font size
theme(plot.title = element_text(hjust = 0.5, size = 14))
```
## Suggested Resources
There are a huge number of resources out there for learning R and the tidyverse. The tidyverse is much larger than the examples shown here, even as far as basic functions go. For example, during an analysis you may need to combine data from multiple sources together, keeping or removing certain rows based on whether data is missing or not. For this, look into the family of `join` commands that are a part of the tidyverse. This is again just one more thing, but here are a list of resources I've found helpful in the past to look into for what R and the tidyverse can do for you:
- [R For Data Science](https://r4ds.had.co.nz/): A free textbook teaching the basics of the tidyverse by the person who created it
- [RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets/): A large series of cheatsheets for a variety of packages such as dplyr, ggplot, purrr, and others
- [TidyModels](https://www.tidymodels.org/): A suite of packages designed to make creation of machine learning models easier. These models can range from simple linear models to complex tree based models and beyond. It can also handle cross-validation and hyperparameter tuning using either grid search or Bayesian methods.
- [purrr](https://purrr.tidyverse.org/): Functional programming in R