-
Notifications
You must be signed in to change notification settings - Fork 19
/
05-flows.qmd
711 lines (569 loc) · 33.8 KB
/
05-flows.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
# Spatial Interaction Modelling {#sec-chp5}
This chapter covers spatial interaction flows. Using open data from the city of San Francisco about trips on its bikeshare system, we will estimate spatial interaction models that try to capture and explain the variation in the amount of trips on each given route. After visualizing the dataset, we begin with a very simple model and then build complexity progressively by augmenting it with more information, refined measurements, and better modeling approaches. Throughout the chapter, we explore different ways to grasp the predictive performance of each model. We finish with a prediction example that illustrates how these models can be deployed in a real-world application.
Content is based on the following references, which are great follow-up's on the topic:
- @fotheringham1989spatial offer a historical overview of spatial interaction models and illustration of use cases.
- @rowe2022 provide a good overview of the existing limitations and opportunities of spatial interaction modelling.
- @gds_ua17, an online short course on R for Geographic Data Science and Urban Analytics. In particular, the section on [mapping flows](https://github.com/alexsingleton/GDS_UA_2017/tree/master/Mapping_Flows) is specially relevant here.
- The predictive checks section draws heavily from @gelman2006data, in particular Chapters 6 and 7.
## Dependencies
We will rely on the following libraries in this section, all of them included in @sec-dependencies:
```{r}
#| message: false
#| warning: false
# Data management
library(tidyverse)
# Spatial Data management
library(sf)
library(sp)
# Pretty graphics
library(ggplot2)
# Thematic maps
library(tmap)
# Add basemaps
library(basemapR)
# Simulation methods
library(arm)
```
In this chapter we will show a slightly different way of managing spatial data in R. Although most of the functionality will be similar to that seen in previous chapters, we will not rely on the "`sf` stack" and we will instead show how to read and manipulate data using the more traditional `sp` stack. Although this approach is being slowly phased out, it is still important to be aware of its existence and its differences with more modern approaches.
## Data
In this note, we will use data from the city of San Francisco representing bike trips on their public bike share system. The original source is the [SF Open Data portal](https://datasf.org/opendata/) and the dataset comprises both the location of each station in the Bay Area as well as information on trips (station of origin to station of destination) undertaken in the system from September 2014 to August 2015 and the following year. Since this note is about modeling and not data preparation, a cleanly reshaped version of the data, together with some additional information, has been created and placed in the `sf_bikes` folder. The data file is named `flows.geojson` and, in case you are interested, the (Python) code required to created from the original files in the SF Data Portal is also available on the `flows_prep.ipynb` [notebook](https://github.com/darribas/spa_notes/blob/master/sf_bikes/flows_prep.ipynb), also in the same folder.
Let us then directly load the file with all the information necessary:
```{r}
db <- st_read('./data/sf_bikes/flows.geojson')
```
Note how the interface is slightly different since we are reading a `GeoJSON` file instead of a shapefile.
The data contains the geometries of the flows, as calculated from the [Google Maps API](https://developers.google.com/maps/), as well as a series of columns with characteristics of each flow:
```{r}
head(db)
```
where `orig` and `dest` are the station IDs of the origin and destination, `street/straight_dist` is the distance in metres between stations measured along the street network or as-the-crow-flies, `total_down/up` is the total downhil and climb in the trip, and `tripsXX` contains the amount of trips undertaken in the years of study.
## "*Seeing*" flows
The easiest way to get a quick preview of what the data looks like spatially is to make a simple plot:
```{r}
plot(db$geometry)
```
Equally, if we want to visualize a single route, we can simply subset the table. For example, to get the shape of the trip from station `39` to station `48`, we can:
```{r}
db %>%
dplyr::filter(orig == 39 & dest == 48) %>%
ggplot(data = .) +
geom_sf(color = "black",
size = 0.1) +
theme_void()
```
or, for the most popular route, we can:
```{r}
most_pop <- db %>%
dplyr::filter(trips15 == max(trips15))
```
These however do not reveal a lot: there is no geographical context (*why are there so many routes along the NE?*) and no sense of how volumes of bikers are allocated along different routes. Let us fix those two.
The easiest way to bring in geographical context is by overlaying the routes on top of a background map of tiles downloaded from the internet. Let us download this using `basemapR`:
```{r}
# create a bounding box
bbox_db <- st_bbox(db)
# download a basemap using ggplot and basemapR
ggplot() +
base_map(bbox_db, increase_zoom = 2, basemap = "positron") +
geom_sf(data = db, fill = NA, colour = "transparent")
```
Now to combine tiles and routes, we need to pull out the coordinates that make up each line. For the route example above, this would be:
```{r}
xys1 <- as.data.frame(st_coordinates(most_pop))
```
Now we can plot the route (note we also dim down the background to focus the attention on flows):
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
Can you plot the route for the largest climb?
:::
:::
```{r}
#| warning: false
ggplot() +
base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
geom_sf( data = db, fill = NA, colour = "transparent") +
geom_path( data = xys1,
aes( x = X, y = Y ),
#size = 1,
color = "green",
lineend ='round'
)
```
Now we can plot all of the lines by using a short `for` loop to build up the table:
```{r}
# Set up shell data.frame
lines <- data.frame(
lat = numeric(0),
lon = numeric(0),
trips = numeric(0),
id = numeric(0)
)
# Run loop
for(x in 1:nrow(db)){
# Pull out row
r <- db[x, ]
# Extract lon/lat coords
xys <- as.data.frame(st_coordinates(r))
names(xys) <- c('lon', 'lat')
# Insert trips and id
xys['trips'] <- r$trips15
xys['id'] <- x
# Append them to `lines`
lines <- rbind(lines, xys)
}
```
Now we can go on and plot all of them:
```{r}
#| warning: false
ggplot() +
# call basemap
base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
geom_sf(data = db, fill = NA, colour = "transparent") +
# add data
geom_path( data = lines,
aes(x=lon, y=lat,
group=id
),
size = 1,
color = "green",
lineend = 'round')
```
Finally, we can get a sense of the distribution of the flows by associating a color gradient to each flow based on its number of trips:
```{r}
#| message: false
#| warning: false
ggplot() +
# call basemap
base_map(bbox_db, increase_zoom = 2, basemap = "dark") +
geom_sf(data = db, fill = NA, colour = "transparent") +
# add flow data
geom_path( data = lines,
aes(x = lon, y = lat, group = id, colour = trips ),
size=log1p(lines$trips / max(lines$trips)),
lineend = 'round'
) +
# create a colour palette
scale_colour_gradient(
low='#440154FF', high='#FDE725FF'
) +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
```
Note how we transform the size so it's a proportion of the largest trip and then it is compressed with a logarithm.
## Modelling flows
Now we have an idea of the spatial distribution of flows, we can begin to think about modeling them. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (the volume of trips) using a set of predictors that describe the nature of a given flow. We will start from the simplest model and then progressively build complexity until we get to a satisfying point. Along the way, we will be exploring each model using concepts from @gelman2006data such as predictive performance checks[^05-flows-1] (PPC)
[^05-flows-1]: For a more elaborate introduction to PPC, have a look at Chapters 7 and 8.
Before we start running regressions, let us first standardize the predictors so we can interpret the intercept as the average flow when all the predictors take the average value, and so we can interpret the model coefficients as changes in standard deviation units:
```{r}
# Scale all the table
db_std <- db %>% mutate(across(where(is.numeric), scale))
# Reset trips as we want the original version
db_std$trips15 <- db$trips15
db_std$trips16 <- db$trips16
# Reset origin and destination station and express them as factors
db_std$orig <- as.factor(db$orig)
db_std$dest <- as.factor(db$dest)
```
**Baseline model**
One of the simplest possible models we can fit in this context is a linear model that explains the number of trips as a function of the straight distance between the two stations and total amount of climb and downhill. We will take this as the baseline on which we can further build later:
```{r}
m1 <- lm('trips15 ~ straight_dist + total_up + total_down', data=db_std)
summary(m1)
```
To explore how good this model is, we will be comparing the predictions the model makes about the number of trips each flow should have with the actual number of trips. A first approach is to simply plot the distribution of both variables:
```{r}
plot(
density( m1$fitted.values ),
xlim = c(-100, max( db_std$trips15 )),
main=''
)
lines(
density( db_std$trips15 ),
col='red',
main=''
)
legend(
'topright',
c('Predicted', 'Actual'),
col=c('black', 'red'),
lwd=1
)
title(main="Predictive check, point estimates - Baseline model")
```
The plot makes pretty obvious that our initial model captures very few aspects of the distribution we want to explain. However, we should not get too attached to this plot just yet. What it is showing is the distribution of predicted *point* estimates from our model. Since our model is not deterministic but inferential, there is a certain degree of uncertainty attached to its predictions, and that is completely absent from this plot.
Generally speaking, a given model has two sources of uncertainty: *predictive*, and *inferential*. The former relates to the fact that the equation we fit does not capture all the elements or in the exact form they enter the true data generating process; the latter has to do with the fact that we never get to know the true value of the model parameters only guesses (estimates) subject to error and uncertainty. If you think of our linear model above as
$$
T_{ij} = X_{ij}\beta + \epsilon_{ij}
$$ where $T_{ij}$ represents the number of trips undertaken between station $i$ and $j$, $X_{ij}$ is the set of explanatory variables (length, climb, descent, etc.), and $\epsilon_{ij}$ is an error term assumed to be distributed as a normal distribution $N(0, \sigma)$; then predictive uncertainty comes from the fact that there are elements to some extent relevant for $y$ that are not accounted for and thus subsummed into $\epsilon_{ij}$. Inferential uncertainty comes from the fact that we never get to know $\beta$ but only an estimate of it which is also subject to uncertainty itself.
Taking these two sources into consideration means that the black line in the plot above represents only the behaviour of our model we expect if the error term is absent (no predictive uncertainty) and the coefficients are the true estimates (no inferential uncertainty). However, this is not necessarily the case as our estimate for the uncertainty of the error term is certainly not zero, and our estimates for each parameter are also subject to a great deal of inferential variability. We do not know to what extent other outcomes would be just as likely. Predictive checking relates to simulating several feasible scenarios under our model and use those to assess uncertainty and to get a better grasp of the quality of our predictions.
Technically speaking, to do this, we need to build a mechanism to obtain a possible draw from our model and then repeat it several times. The first part of those two steps can be elegantly dealt with by writing a short function that takes a given model and a set of predictors, and produces a possible random draw from such model:
```{r}
generate_draw <- function(m){
# Set up predictors matrix
x <- model.matrix( m )
# Obtain draws of parameters (inferential uncertainty)
sim_bs <- sim( m, 1)
# Predicted value
mu <- x %*% sim_bs@coef[1, ]
# Draw
n <- length( mu )
y_hat <- rnorm( n, mu, sim_bs@sigma[1])
return(y_hat)
}
```
This function takes a model `m` and the set of covariates `x` used and returns a random realisation of predictions from the model. To get a sense of how this works, we can get and plot a realisation of the model, compared to the expected one and the actual values:
```{r}
new_y <- generate_draw(m1)
plot(
density( m1$fitted.values ),
xlim = c(-100, max( db_std$trips15 )),
ylim = c(0, max(c(
max( density( m1$fitted.values)$y ),
max( density( db_std$trips15)$y )
)
)
),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
lines(
density( new_y ),
col = 'green',
main = ''
)
legend(
'topright',
c('Predicted', 'Actual', 'Simulated'),
col = c( 'black', 'red', 'green' ),
lwd = 1
)
```
Once we have this "draw engine", we can set it to work as many times as we want using a simple `for` loop. In fact, we can directly plot these lines as compared to the expected one and the trip count:
```{r}
plot(
density( m1$fitted.values ),
xlim = c(-100, max( db_std$trips15 )),
ylim = c(0, max(c(
max( density( m1$fitted.values)$y ),
max( density( db_std$trips15)$y )
)
)
),
col='white',
main=''
)
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw( m1 )
lines( density( tmp_y ),
col = 'grey',
lwd = 0.1
)
}
#
lines(
density( m1$fitted.values ),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
legend(
'topright',
c('Actual', 'Predicted', 'Simulated (n=250)'),
col = c('red', 'black', 'grey'),
lwd = 1
)
title(main="Predictive check - Baseline model")
```
The plot shows there is a significant mismatch between the fitted values, which are much more concentrated around small positive values, and the realisations of our "inferential engine", which depict a much less concentrated distribution of values. This is likely due to the combination of two different reasons: on the one hand, the accuracy of our estimates may be poor, causing them to jump around a wide range of potential values and hence resulting in very diverse predictions (inferential uncertainty); on the other hand, it may be that the amount of variation we are not able to account for in the model[^05-flows-2] is so large that the degree of uncertainty contained in the error term of the model is very large, hence resulting in such a flat predictive distribution.
[^05-flows-2]: The $R^2$ of our model is around 2%
Keep in mind that the issues discussed in the paragraph above relate to the uncertainty behind our model. This is not to the point predictions derived from them, which are a mechanistic result of the minimisation of the squared residuals and hence are not subject to probability or inference. That allows the model in this case to provide a fitted distribution much more accurate apparently (black line above). However, the lesson to take from the model is that, even if the point predictions (fitted values) are artificially accurate[^05-flows-3], our capabilities to infer about the more general underlying process are fairly limited.
[^05-flows-3]: which they are not really, in light of the comparison between the black and red lines.
**Improving the model**
The bad news from the previous section is that our initial model is not great at explaining bike trips. The good news is there are several ways in which we can improve this. In this section we will cover three main extensions that exemplify three different routes you can take when enriching and augmenting models in general, and spatial interaction ones in particular[^05-flows-4]. These three routes are aligned around the following principles:
[^05-flows-4]: These principles are general and can be applied to pretty much any modeling exercise you run into. The specific approaches we take in this note relate to spatial interaction models
1. Use better approximations to model your dependent variable.
2. Recognize the structure of your data.
3. Get better predictors.
- **Use better approximations to model your dependent variable**
Standard OLS regression assumes that the error term and, since the predictors are deterministic, the dependent variable are distributed following a normal (gaussian) distribution. This is usually a good approximation for several phenomena of interest, but maybe not the best one for trips along routes: for one, we know trips cannot be negative, which the normal distribution does not account for[^05-flows-5]; more subtly, their distribution is not really symmetric but skewed with a very long tail on the right. This is common in variables that represent counts and that is why usually it is more appropriate to fit a model that relies on a distribution different from the normal.
[^05-flows-5]: For an illustration of this, consider the amount of probability mass to the left of zero in the predictive checks above.
One of the most common distributions for this cases is the Poisson, which can be incorporated through a general linear model (or GLM). The underlying assumption here is that instead of $T_{ij} \sim N(\mu_{ij}, \sigma)$, our model now follows:
$$
T_{ij} \sim Poisson (\exp^{X_{ij}\beta})
$$
As usual, such a model is easy to run in R:
```{r}
m2 <- glm(
'trips15 ~ straight_dist + total_up + total_down',
data=db_std,
family=poisson,
)
```
Now let's see how much better, if any, this approach is. To get a quick overview, we can simply plot the point predictions:
```{r}
plot(
density( m2$fitted.values ),
xlim = c( -100, max( db_std$trips15 )),
ylim = c( 0, max(c(
max( density( m2$fitted.values)$y ),
max( density(db_std$trips15)$y )
)
)
),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
col = c('black', 'red'),
lwd = 1
)
title(main = "Predictive check, point estimates - Poisson model")
```
To incorporate uncertainty to these predictions, we need to tweak our `generate_draw` function so it accommodates the fact that our model is not linear anymore.
```{r}
generate_draw_poi <- function(m){
# Set up predictors matrix
x <- model.matrix( m )
# Obtain draws of parameters (inferential uncertainty)
sim_bs <- sim( m, 1 )
# Predicted value
xb <- x %*% sim_bs@coef[1, ]
#xb <- x %*% m$coefficients
# Transform using the link function
mu <- exp( xb )
# Obtain a random realization
y_hat <- rpois( n = length( mu ), lambda = mu)
return(y_hat)
}
```
And then we can examine both point predictions an uncertainty around them:
```{r fig.fullwidth=TRUE}
plot(
density( m2$fitted.values ),
xlim = c(-100, max( db_std$trips15 )),
ylim = c(0, max(c(
max( density( m2$fitted.values)$y ),
max( density( db_std$trips15)$y )
)
)
),
col = 'white',
main = ''
)
# Loop for realizations
for(i in 1:250){
tmp_y <- generate_draw_poi( m2 )
lines(
density( tmp_y ),
col = 'grey',
lwd = 0.1
)
}
#
lines(
density( m2$fitted.values ),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
legend(
'topright',
c('Predicted', 'Actual', 'Simulated (n=250)'),
col = c('black', 'red', 'grey'),
lwd = 1
)
title( main = "Predictive check - Poisson model")
```
Voila! Although the curve is still a bit off, centered too much to the right of the actual data, our predictive simulation leaves the fitted values right in the middle. This speaks to a better fit of the model to the actual distribution othe original data follow.
- **Recognize the structure of your data**
So far, we've treated our dataset as if it was flat (i.e. comprise of fully independent realizations) when in fact it is not. Most crucially, our baseline model does not account for the fact that every observation in the dataset pertains to a trip between two stations. This means that all the trips from or to the same station probably share elements which likely help explain how many trips are undertaken between stations. For example, think of trips to and from a station located in the famous Embarcadero, a popular tourist spot. Every route to and from there probably has more trips due to the popularity of the area and we are currently not acknowledging it in the model.
A simple way to incorporate these effects into the model is through origin and destination fixed effects. This approach shares elements with both spatial fixed effects and multilevel modeling and essentially consists of including a binary variable for every origin and destination station. In mathematical notation, this equates to:
$$
T_{ij} = X_{ij}\beta + \delta_i + \delta_j + \epsilon_{ij}
$$
where $\delta_i$ and $\delta_j$ are origin and destination station fixed effects[^05-flows-6], and the rest is as above. This strategy accounts for all the unobserved heterogeneity associated with the location of the station. Technically speaking, we simply need to introduce `orig` and `dest` in the the model:
[^05-flows-6]: In this session, $\delta_i$ and $\delta_j$ are estimated as independent variables so their estimates are similar to interpret to those in $\beta$. An alternative approach could be to model them as random effects in a multilevel framework.
```{r}
m3 <- glm(
'trips15 ~ straight_dist + total_up + total_down + orig + dest',
data = db_std,
family = poisson
)
```
And with our new model, we can have a look at how well it does at predicting the overall number of trips[^05-flows-7]:
[^05-flows-7]: Although, theoretically, we could also include simulations of the model in the plot to get a better sense of the uncertainty behind our model, in practice this seems troublesome. The problems most likely arise from the fact that many of the origin and destination binary variable coefficients are estimated with a great deal of uncertainty. This causes some of the simulation to generate extreme values that, when passed through the exponential term of the Poisson link function, cause problems. If anything, this is testimony of how a simple fixed effect model can sometimes lack accuracy and generate very uncertain estimates. A potential extension to work around these problems could be to fit a multilevel model with two specific levels beyond the trip-level: one for origin and another one for destination stations.
```{r}
plot(
density( m3$fitted.values ),
xlim = c(-100, max( db_std$trips15 )),
ylim = c(0, max(c(
max( density(m3$fitted.values)$y ),
max( density(db_std$trips15)$y )
)
)
),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
col = c('black', 'red'),
lwd = 1
)
title( main = "Predictive check - Orig/dest FE Poisson model")
```
That looks significantly better, doesn't it? In fact, our model now better accounts for the long tail where a few routes take a lot of trips. This is likely because the distribution of trips is far from random across stations and our origin and destination fixed effects do a decent job at accounting for that structure. However our model is still notably underpredicting less popular routes and overpredicting routes with above average number of trips. Maybe we should think about moving beyond a simple linear model.
- **Get better predictors**
The final extension is, in principle, always available but, in practice, it can be tricky to implement. The core idea is that your baseline model might not have the best measurement of the phenomena you want to account for. In our example, we can think of the distance between stations. So far, we have been including the distance measured "as the crow flies" between stations. Although in some cases this is a good approximation (particularly when distances are long and likely route taken is as close to straight as possible), in some cases like ours, where the street layout and the presence of elevation probably matter more than the actual final distance pedalled, this is not necessarily a safe assumption.
As an exampe of this approach, we can replace the straight distance measurements for more refined ones based on the Google Maps API routes. This is very easy as all we need to do (once the distances have been calculated!) is to swap `straight_dist` for `street_dist`:
```{r}
m4 <- glm(
'trips15 ~ street_dist + total_up + total_down + orig + dest',
data = db_std,
family = poisson
)
```
And we can similarly get a sense of our predictive fitting with:
```{r}
plot(
density( m4$fitted.values ),
xlim = c(-100, max( db_std$trips15)),
ylim = c(0, max(c(
max( density(m4$fitted.values)$y ),
max( density(db_std$trips15)$y )
)
)
),
col = 'black',
main = ''
)
lines(
density( db_std$trips15 ),
col = 'red',
main = ''
)
legend(
'topright',
c('Predicted', 'Actual'),
col = c('black', 'red'),
lwd = 1
)
title( main = "Predictive check - Orig/dest FE Poisson model")
```
Hard to tell any noticeable difference, right? To see if there is any, we can have a look at the estimates obtained:
```{r}
summary(m4)$coefficients['street_dist', ]
```
And compare this to that of the straight distances in the previous model:
```{r}
summary(m3)$coefficients['straight_dist', ]
```
As we can see, the differences exist but they are not massive. Let's use this example to learn how to interpret coefficients in a Poisson model[^05-flows-8]. Effectively, these estimates can be understood as multiplicative effects. Since our model fits
[^05-flows-8]: See section 6.2 of @gelman2006data for a similar treatment of these.
$$
T_{ij} \sim Poisson (\exp^{X_{ij}\beta})
$$
we need to transform $\beta$ through an exponential in order to get a sense of the effect of distance on the number of trips. This means that for the street distance, our original estimate is $\beta_{street} = -0.0996$, but this needs to be translated through the exponential into $e^{-0.0996} = 0.906$. In other words, since distance is expressed in standard deviations[^05-flows-9], we can expect a 10% decrease in the number of trips for an increase of one standard deviation (about 1Km) in the distance between the stations. This can be compared with $e^{-0.0782} = 0.925$ for the straight distances, or a reduction of about 8% the number of trips for every increase of a standard deviation (about 720m).
[^05-flows-9]: Remember the transformation at the very beginning.
## Predicting flows
So far we have put all of our modeling efforts in understanding the model we fit and improving such model so it fits our data as closely as possible. This is essential in any modelling exercise but should be far from a stopping point. Once we're confident our model is a decent representation of the data generating process, we can start exploiting it. In this section, we will cover one specific case that showcases how a fitted model can help: out-of-sample forecasts.
It is August 2015, and you have just started working as a data scientist for the bikeshare company that runs the San Francisco system. You join them as they're planning for the next academic year and, in order to plan their operations (re-allocating vans, station maintenance, etc.), they need to get a sense of how many people are going to be pedalling across the city and, crucially, *where* they are going to be pedalling through. What can you do to help them?
The easiest approach is to say "well, a good guess for how many people will be going between two given stations this coming year is how many went through last year, isn't it?". This is one prediction approach. However, you could see how, even if the same process governs over both datasets (2015 and 2016), each year will probably have some idiosyncracies and thus looking too closely into one year might not give the best possible answer for the next one. Ideally, you want a good stylized synthesis that captures the bits that stay constant over time and thus can be applied in the future and that ignores those aspects that are too particular to a given point in time. That is the rationale behind using a fitted model to obtain predictions.
However good any theory though, the truth is in the pudding. So, to see if a modeling approach is better at producing forecasts than just using the counts from last year, we can put them to a test. The way this is done when evaluating the predictive performance of a model (as this is called in the literature) relies on two basic steps: a) obtain predictions from a given model and b) compare those to the actual values (in our case, with the counts for 2016 in `trips16`) and get a sense of "how off" they are. We have essentially covered a) above; for b), there are several measures to use. We will use one of the most common ones, the root mean squared error (RMSE), which roughly gives a sense of the average difference between a predicted vector and the real deal:
$$
RMSE = \sqrt{ \sum_{ij} (\hat{T_{ij}} - T_{ij})^2}
$$
where $\hat{T_{ij}}$ is the predicted amount of trips between stations $i$ and $j$. RMSE is straightforward in R and, since we will use it a couple of times, let's write a short function to make our lives easier:
```{r}
rmse <- function(t, p){
se <- (t - p)^2
mse <- mean(se)
rmse <- sqrt(mse)
return(rmse)
}
```
where `t` stands for the vector of true values, and `p` is the vector of predictions. Let's give it a spin to make sure it works:
```{r}
rmse_m4 <- rmse(db_std$trips16, m4$fitted.values)
rmse_m4
```
That means that, on average, predictions in our best model `m4` are 256 trips off. Is this good? Bad? Worse? It's hard to say but, being practical, what we can say is whether this better than our alternative. Let us have a look at the RMSE of the other models as well as that of simply plugging in last year's counts:\[\^05-flows-11\]
::: column-margin
::: {.callout-tip appearance="simple" icon="false"}
**Task**
Can you create a single plot that displays the distribution of the predicted values of the five different ways to predict trips in 2016 and the actual counts of trips?
:::
:::
```{r}
rmses <- data.frame(
model=c(
'OLS',
'Poisson',
'Poisson + FE',
'Poisson + FE + street dist.',
'Trips-2015'
),
RMSE=c(
rmse(db_std$trips16, m1$fitted.values),
rmse(db_std$trips16, m2$fitted.values),
rmse(db_std$trips16, m3$fitted.values),
rmse(db_std$trips16, m4$fitted.values),
rmse(db_std$trips16, db_std$trips15)
)
)
rmses
```
The table is both encouraging and disheartning at the same time. On the one hand, all the modeling techniques covered above behave as we would expect: the baseline model displays the worst predicting power of all, and every improvement (except the street distances!) results in notable decreases of the RMSE. This is good news. However, on the other hand, all of our modelling efforts fall short of given a better guess than simply using the previous year's counts. *Why? Does this mean that we should not pay attention to modeling and inference?* Not really. Generally speaking, a model is as good at predicting as it is able to mimic the underlying process that gave rise to the data in the first place. The results above point to a case where our model is not picking up all the factors that determine the amount of trips undertaken in a give route. This could be improved by enriching the model with more/better predictors, as we have seen above. Also, the example above seems to point to a case where those idiosyncracies in 2015 that the model does not pick up seem to be at work in 2016 as well. This is great news for our prediction efforts this time, but we have no idea why this is the case and, for all that matters, it could change the coming year. Besides the elegant quantification of uncertainty, the true advantage of a modeling approach in this context is that, if well fit, it is able to pick up the fundamentals that apply over and over. This means that, if next year we're not as lucky as this one and previous counts are not good predictors but the variables we used in our model continue to have a role in determining the outcome, the data scientist should be luckier and hit a better prediction.
<!-- #region -->
## Questions
We will be using again the Madrid AirBnb dataset: <!-- #endregion -->
```{r}
mad_abb <- st_read('./data/assignment_1_madrid/madrid_abb.gpkg')
```
The columns to use here are:
- `price_usd`: price expressed in USD
- `log1p_price_usd`: logarithm of the price expressed in USD
- `accommodates`: number of people the property accommodates
- `bathrooms`: number of bathrooms the property includes
- `bedrooms`: number of bedrooms the property includes
- `beds`: number of beds the property includes
With these data at hand, accomplish the following challenges:
1. Set up a baseline regression model where you explain the price of a property as a function of its characteristics:
$$
P_i = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i
$$
2. Fit a parallel model that uses the log of price as dependent variable:
$$
\log(P_i) = \alpha + \beta_1 Acc_i + \beta_2 Bath_i + \beta_3 Bedr_i + \beta_4 Beds_i + \epsilon_i
$$
3. Perform a predictive check analysis of both models, discussing how they compare, which one you would prefer, and why