-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
bea-infrastructure-investment.Rmd
140 lines (115 loc) · 5.58 KB
/
bea-infrastructure-investment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: "Adjusting variable distribution and exploring data using mass linear regression"
description: |
Graphs and analysis using the #TidyTuesday data set for week 33 of 2021
(10/8/2021): "BEA Infrastructure Investment"
author:
- name: Ronan Harrington
url: https://github.com/rnnh/
date: 08-15-2021
repository_url: https://github.com/rnnh/TidyTuesday/
preview: bea-infrastructure-investment_files/figure-html5/fig1-1.png
output:
distill::distill_article:
self_contained: false
toc: true
---
```{r knitr, include=FALSE}
knitr::opts_chunk$set(code_folding = TRUE)
knitr::opts_chunk$set(include = TRUE)
```
## Introduction
In this post, the
[BEA Infrastructure Investment](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-08-10/readme.md)
data set from the [#TidyTuesday](https://github.com/rfordatascience/tidytuesday)
project is used to illustrate variable transformation and the
[exploreR::masslm()](https://rdrr.io/cran/exploreR/man/masslm.html)
function. The variable for gross infrastructure investment adjusted for inflation
is transformed to make it less skewed. Using these transformed investment values,
multiple linear models are then created to quickly see which variables in the
data set have the largest impact on infrastructure investment.
## Setup
Loading the `R` libraries and
[data set](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-08-10/readme.md).
```{r setup}
# Loading libraries
library(tidyverse)
library(tidytuesdayR)
library(exploreR)
# Loading data
tt <- tt_load("2021-08-10")
```
## Plotting distribution of inflation-adjusted infrastructure investments
In this section, the gross infrastructure investment (chained 2021 dollars) in
millions of USD are plotted with and without a $\log{10}$ transformation. From
the histograms below, we can see that applying a $\log{10}$ transformation gives
the variable a less skewed distribution. This transformation should be considered
for statistical testing of inflation-adjusted infrastructure investments.
```{r figure_1, fig.height=5, fig.width=8, fig.cap="The transformed variable is more appropriate for parametric statistical tests."}
# Creating tbl_df with gross_inv_chain values
untransformed_tbl_df <- tibble(
gross_inv_chain = tt$chain_investment$gross_inv_chain,
transformation = "Untransformed"
)
# Creating tbl_df with log10(gross_inv_chain) values
log10_tbl_df <- tibble(
gross_inv_chain = log10(tt$chain_investment$gross_inv_chain),
transformation = "Log10"
)
# Combining the above tibbles into one tbl_df
gross_inv_chain_tbl_df <- rbind(untransformed_tbl_df, log10_tbl_df)
# Plotting distribution of inflation-adjusted infrastructure investments
gross_inv_chain_tbl_df %>%
ggplot(aes(x = gross_inv_chain, fill = transformation)) +
geom_histogram(show.legend = FALSE, position = "identity",
bins = 12, colour = "black") +
facet_wrap(~transformation, scales = "free") +
labs(fill.position = "none", y = NULL,
x = "Gross infrastructure investments adjusted for inflation (millions USD)",
title = "Distributions of untransformed and log transformed infrastructure investments",
subtitle = "Log transformed investments are more normally distributed") +
scale_fill_brewer(palette = "Set1") +
theme_classic()
```
## Exploring a data set using mass linear regression
In this section,
[exploreR::masslm()](https://rdrr.io/cran/exploreR/man/masslm.html) is applied
to a copy of the data set with $\log{10}$ transformed investment values.
The masslm() function from the exploreR package quickly produces a linear model
of the dependent variable and every other variable in the data set. It then
returns a data frame containing the features of each linear model that are useful
when selecting predictor variables:
- **R squared** The proportion of variation in the dependent (response) variable that is
explained by the independent (predictor) variable.
- **p-value** The statistical significance of the model. A p-value $\lt 5\%$ is
typically considered significant.
This function is useful for quickly determining which variables should be included
in predictive models. Note that the data set used should satisfy the assumptions
of linear models, including a normally distributed response variable. In this case,
the $\log{10}$ transformed investment variable is close to normal.
From this mass linear regression model, we can see that investment category
is the single variable that explains the largest proportion of variation in
$\log{10}$ investment; and the linear model with group number is the most
significant, followed by year.
```{r masslm}
# Creating a copy of the chain_investment data set with log10 transformed
# gross investment values
chain_investment_df <- tt$chain_investment %>%
# Creating a log10 transformed copy of gross_inv_chain
mutate(gross_inv_transformed = log10(gross_inv_chain)) %>%
# Removing -Inf values
filter(gross_inv_transformed != -Inf) %>%
# Selecting variables to include in the data frame
select(category, meta_cat, group_num, year, gross_inv_transformed)
# Applying mass linear regression
transformed_investment_masslm <- masslm(chain_investment_df,
dv.var = "gross_inv_transformed")
# Printing the masslm results in order of R squared values (decreasing)
transformed_investment_masslm %>%
arrange(-R.squared)
# Printing the masslm results in order of p-values
transformed_investment_masslm %>%
arrange(P.value)
```
## References
- exploreR vignette: [The How and Why of Simple Tools](https://mran.microsoft.com/snapshot/2016-02-27/web/packages/exploreR/vignettes/exploreR.html)