-
Notifications
You must be signed in to change notification settings - Fork 4
/
01-intro.Rmd
1316 lines (1071 loc) · 58.9 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Building Blocks {#chap-blocks}
```{r setup-blocks, include = FALSE}
knitr::opts_chunk$set(cache = TRUE)
library(tidyverse)
library(tidycensus)
library(tigris)
library(leaflet)
library(sf)
library(nhts2017)
theme_set(theme_bw())
```
This chapter contains concepts, definitions, and mathematical techniques that will
be used throughout the semester. Critical terms to understand are given in **bold**.
## Planning for Human Systems
If you look out on any sufficiently busy road, you will see a steady stream of
vehicles passing by. Each vehicle is largely indistinguishable from the others,
and it is easy as an engineer responsible for that road to see the cars driving
by as little more than an input to a problem. But the *people* inside the cars
should not be indistinguishable from each other. Each person who is driving or
riding in each of those cars has their own reasons to be driving on that road.
One person might be driving to work; one person might be trying to get home to
his or her family. Another car might hold a family going on vacation, or a group
of friends heading to a movie.
If you don't recognize that each person who travels is different, with different
needs and purposes, then it is easy to look only at the **supply** of transportation
infrastructure. Is the road wide enough? Is the traffic signal timed appropriately?
But as with anything in the economy transportation is a function of both supply
and **demand**. Why are so many people trying to get down this one road *right now*?
Why didn't more people take transit? Why didn't some people choose a different
destination? Or why didn't some people just stay home in the first place?
Transportation planning therefore must be concerned with both the supply of
infrastructure and the demand for travel. For the most part, economists consider
travel a **derived demand**, which means people only go to the hassle of
traveling somewhere if they have some other reason to be there. No one
typically just drives around (with the possible exception of teenagers on a
weekend night); they are going to work, or school, or a social engagement, or
*something*.
Travel demand has not been stable over time. The availability of inexpensive
automobiles in the 20th Century created demand for inter-city and intra-urban
roads that did not exist before. Rising labor force participation rates for
women radically changed the number and types of trips the average household
makes in an average day. Technological developments like teleconferencing and
smartphone-enabled ridehailing could generate different trends. At the same
time, populations in most regions continue to grow. **Planning** for future
transportation infrastructure is difficult because of the uncertainty of the
future, but it is necessary to keep economies rolling and preserve or improve
quality of life.
In the United States and most societies with some democratic process,
**decisions** about what transportation facilities to build, which policies to
implement, and how to build a city generally fall to **decision makers**. These
decision makers consist of mayors, city councils, planning commissions,
state legislatures, Congress, state and federal agencies, and innumerable others
who are elected by the public, or who are accountable to others who have been.
In making decisions about how to spend public money on civil infrastructure or
enact tax or other policies, decision makers consult **plans** developed by
professional engineers and planners.
As engineers and planners, we are rarely in a position to *make* decisions, but
we have a responsibility to provide accurate data and technical analysis to
support decision makers. There is a misconception that transportation planners
must accurately predict the future to be relevant. The purpose of transportation
planning is not to perfectly envision what will happen under every scenario,
it is to provide information that will help make good decisions now so that
the future is at least as pleasant as the present. We all have hopes for what
our lives and community will look like ten or twenty years from now; it may not
be possible for anyone to provide analysis entirely free of all personal bias.
But as you conduct your work as an engineer and planner, you owe the public your
integrity and competence as you provide information to their representatives.
## The Four-Step Transportation Planning Process
How can you know what might happen in the future? And how might that change based
on decisions that you make today? This basic question is at the heart of
transportation planning:
- What might traffic look like if we build nothing and population still grows?
- Can we build less if we change land development patterns?
- How many people will use this new transit line?
In many fields --- including politics, meteorology, economics, etc. ---
professionals who seek answers to questions like this do so with the help of a
**model**. A model is a mathematical representation of a real-world system.
In any model, there are things that need to vary (called inputs), things that
can be estimated or calibrated (called parameters), and results (called outputs).
There are also things that are held fixed. The specific mathematical structure
of the model, and which things get included and which things are excluded or
abstracted away, determine what the model should and should not be used for.
For example, we might try to predict something with a linear model,
\begin{equation}
y = \alpha + \beta x + \varepsilon
(\#eq:basic-model)
\end{equation}
In this case $y$ is the output, $x$ is the input, the
$\beta$ parameter defines how the input influences the outcomes, $\alpha$ is a
fixed value, and $\varepsilon$ accounts for the random influence of all the
factors we did not include. We could add more $\beta$ and $x$ terms to
include more factors in the model. We could also change the mathematical format
of the model to represent different types of outcomes, or chain several smaller
models together to represent more complex relationships. If we wanted to see what
might happen if $x$ changed, we could put in a new value into this equation
and the output result $y$ might be a plausible prediction. The plausibility of
the output is a function of how well the mathematical model actually represents
the reality of the system.
In this class you will learn the details of the *travel demand modeling process*,
which is a chain of many models, each with different inputs and outputs. A
travel demand model on the whole has two basic inputs:
- Socioeconomic data representing where people live and work and go to school and
do other things.
- Transportation network data representing the roads and transit services and
other methods people use to get between their activities.
The basic outputs of a travel demand model are transportation volumes and levels
of service. There are many ways to design and build travel demand models, but
the traditional way most regions in the United States approach travel demand models
is through a **four step**, trip-based^[The primary alternative to the four-step method is called an "activity-based" method. Activity-based models can be significantly more complex to construct and use, but they are based more concretely in human behavior. Whether they actually result in better decisions by transportation planning agencies is an open question.] process. The four steps are:
1. Trip Generation
2. Trip Distribution
3. Mode Choice
4. Route Assignment
A *trip generation* model determines how many trips are produced in each *zone*
(neighborhood), and how many trips are attracted to each zone. The inputs to
this model is the socioeconomic data in each zone. Mathematically, trip generation
can be represented as
\begin{equation}
P_i = f(SE_i), A_j = f(SE_j)
(\#eq:tripgen)
\end{equation}
where $i$ and $j$ are the production and destination zone indexes.
A *trip distribution* model seeks to pair the productions and attractions in the
zones based on the travel costs $c$ between the two zones. Mathematically, trip
distribution can be represented as
\begin{equation}
T_{ij} = f(P_i, A_j, c_{ij})
(\#eq:tripdist)
\end{equation}
A *mode choice* model estimates how many of the trips from $i$ to $j$ will happen
on each available mode $k$, based on the travel time by each mode and other
attributes of the origin and destination zones. Mathematically,
\begin{equation}
T_{ijk} = f(T_{ij}, c_{ijk}, SE_{i,j})
(\#eq:modechoice)
\end{equation}
A *route assignment* model determines the specific routes that the trips going
between $i$ and $j$ take. This allows us to estimate the volume of level of
service on each highway link and transit system $l$. Mathematically,
\begin{equation}
LOS_l, V_l = f(T_{ijk}, c_{ijk})
(\#eq:modechoice)
\end{equation}
On the whole, the travel demand model can be represented mathematically as a
single function where the output transportation volumes and levels of service
are a function of the input socioeconomic information and travel costs.
\begin{equation}
LOS_l, V_l = \mathcal{F}(SE_{i,j}, c_{ij})
(\#eq:tdm)
\end{equation}
The details of each of these models will be the topic for the next several
chapters.
## Travel Model Building Blocks
In this section, we present some of the terms used in transportation planning
and modeling, as well as some of the data objects used in constructing travel
demand models.
### Travel Analysis Zones and SE Data
The "people" in a model conduct *activities*: work, school, recreation, and
other activities. Because travel is a derived demand, the purpose of travel is to
move between these activities. So a travel model needs a way to represent where
the households, persons, jobs, and activities are located in space.
Activities in travel demand models happen in **Travel Analysis Zones** (TAZs).
The model tries to represent trips between the TAZs. Because trips inside a
TAZ --- called intrazonal trips --- are not included in the travel model, each
TAZ should be sufficiently small such that these trips do not affect the models'
ability to forecast travel on roadways. The following rules are helpful when
drawing TAZ's:
- The TAZ should not stretch across major roadways
- The TAZ should contain principally one land use, though in some areas this
is not possible.
- In areas with more dense population, the TAZ should be smaller.
Each TAZ is associated with **socioeconomic** (SE) data, or information about
the people, businesses, and other activities that are located in the TAZ.
**Households** are a basic unit of analysis in many economic and statistical
analyses. A household typically consists of one or more **persons** who reside
in the same dwelling. Individuals living in the same dwelling can make up a
family or other group of individuals; that is, a group of roommates is
considered a household. Not everyone lives in households, however; some people
live in what are called group quarters: military barracks, college dormitories,
monasteries, prisons, etc. Travel models need to handle these people as well, but
in this class we will focus on people who live in households.
Households in travel models are often grouped into a classification scheme
based on the number of people in the household, the number of children, the number
of vehicles, etc. Households of different classifications will have different
behavior in the rest of the model.
::::{.rmdthink}
Your lab activity for this unit will walk you through specifying a
household classification model.
::::
**Firms** are another basic unit of analysis in many economic and statistical
analyses. A firm is a profit-seeking person or entity that provides goods or
services in exchange for monetary transactions. A firm can provide raw resources,
manufactured resources, other services, or be a place of employment. In some
cases, a firm may be another household. Each firm will have an *industry* type.
Examples of industry types include office, service, manufacturing, retail, etc.
In many SE data files, firms are simply represented as the total number of
jobs in a TAZ belonging to each industry. Other **Institutions** including
academic, government, and non-profit entities will also be represented in the SE
data in terms of their jobs.
It is important to be precise in our definitions when put all of these different
items into an analysis. A typical socioeconomic data table for a small region is
given in Table \@ref(tab:setable). Note the following relationships:
- Persons live in Households
- Workers are Persons who have a Job
- Firms have employees who work at a Jobs
```{r RoanokeTAZ, fig.cap="Travel Analysis Zones in Central Roanoke.", out.width='100%', echo = FALSE}
knitr::include_graphics("images/RoanokeTAZ.png")
```
When we talk about "how many jobs" are in a TAZ, we mean "How many people do
the firms located in that TAZ employ," and not "how many people who live
in that TAZ are workers."
```{r setable, echo = FALSE}
se_nrow <- 3
tibble(
taz = 1:se_nrow, persons = sample(30:50, se_nrow),
hh = round(persons * rnorm(se_nrow, 0.8, 0.3)),
workers = round(persons * rnorm(se_nrow, 0.4, 0.1)),
retail = sample(100:150, se_nrow), office = sample(50:100, se_nrow),
manufacturing = c(2, 11, 0)
) %>% knitr::kable(caption = "Example SE Table")
```
::::{.rmdexample}
Alice lives with her husband in zone $A$ and works as an accountant in zone $B$.
Her husband does not currently work.
Fill out the SE table from Table \@ref(tab:setable) with *just* this
household's information.
Two persons live in one household with one worker in zone A. The firm Alice works at
has an office job for her in Zone B.
| taz | persons | hh | workers | retail | office | manufacturing |
|-----|---------|------------|---------|--------|--------|---------------|
| A | 2 | 1 | 1| | | |
| B | | | | | 1| |
::::
### Transportation Networks
The purpose of a travel model is to understand how people are likely to use
transportation infrastructure, so there has to be a way to represent
roadway and transit systems. We do this with a
*network*^[Sometimes called a *graph* in mathematics.]. A network consists of
two basic data structures: **Nodes** and **Links**
Nodes are points in space. In a highway network, almost all nodes represent
intersections between different roads. Some important nodes represent the
**TAZ Centroids**, or the points where the households and jobs in the travel
model are located.
Links connect nodes, and represent roads. Links have many different attributes
describing the characteristics of the roadway they represent. The two most
important link attributes are the link's *speed* and *capacity*, because they
provide the travel costs ($c_{ij}$ above) to the various steps of the model. But
these attributes might not always be known at the outset, so instead we use
attributes of the roadway that influence capacity and speed, and then calculate
these other values.
**Functional Type** or **Functional Class** describes the relative
role each road plays in the transportation system [@2018aashto]. Every street
fills a role on a spectrum from mobility on one end to accessibility on the other:
roads that are good at moving high volumes of vehicles are usually not good at
providing access to homes and businesses. Common functional types include:
- *Freeways* are provided almost exclusively to enhance mobility for through traffic.
Access to freeways is provided only at specific grade-separated interchanges,
with no direct access to the freeway from adjacent land except by way of those
interchanges.
- The primary function of major and minor *arterials* is to provide mobility of through
traffic. However, arterials also connect to both collectors and local roads and
streets and many arterials provide direct access to adjacent development.
- Major and minor *collectors* connect arterials to local roads and provide access
to adjacent development. Mobility for through traffic is less important.
- *Local streets* exist primarily to serve adjacent development. Mobility for
through traffic is not important, or even desired.
Figure \@ref(fig:udot-classes) shows roads in Provo and Orem classified by this
scheme. Streets of a functional class below collector are almost never included
in travel models, unless they provide essential connectivity between other
roads. Entire neighborhoods of local streets may be represented by just a
handful of special links called *centroid connectors*.
```{r udot-classes, fig.cap="UDOT Functional Classes.", out.width='100%', echo = FALSE}
knitr::include_graphics("images/UDOT_Functional_Classes.png")
```
::::{.rmdthink}
Why are local roads not included in travel models?
::::
**Free-flow speed** is the speed vehicles travel when the road is empty.
Historically, travel modelers would use formulas in the Highway Capacity Manual
to estimate the free-flow speed for roadways, or assert a basic calculation like
5 miles per hour over the speed limit. More recently, modelers use the speeds
reported from GPS devices in the middle of the night to establish free-flow
speeds.
The number of **lanes** on a road is fairly self-explanatory, but it plays a
major role in the road's capacity.
Roads located in different **area types** -- urban, suburban, and rural --
operate differently from each other. Sometimes travel models will assert this
value, but more recent models will calculate the area type for each link
based on the density of the surrounding TAZs.
Link **capacity** is the maximum number of vehicles a can optimally transport
between two nodes. The capacity is a function of functional type, lanes,
free-flow speed, area type, etc. Usually travel models will calculate the
capacity based on the given values for other roadway characteristics, but
sometimes there are ways to override this feature, i.e., if engineers have
developed specific capacity estimates for a new project.
**Centroid connectors** are special links that connect centroids to a network.
These are different from other links in that they usually don't have a capacity
or a speed (they don't represent real roads).
### Matrices
Travel models need to represent travel times, costs, and flows between zones.
Models store this data in **matrices**, special data structures developed for
this purpose. Each matrix is a square table where the rows $i$ represent origin
zones and the columns $j$ represent the destination zones. Each cell represents
something about the relationship between the two zones.
There are two kinds of information we typically
represent with matrices:
- **Cost matrices**, or skims, are matrices where the cells contain estimates of
travel time or cost. They are called *skims* because they are the results of
*skimming* a network to find the shortest path between each pair of TAZ
centroids.
- **Flow matrices**, represent flows of people or vehicles from each origin to
each destination. The number in the corresponding cell $T_{ij}$ is the total
number of trips made, and represents the demand between two zones in a network.
## Data Inputs
In the last section we discussed data *structures* like highway networks and
socioeconomic data files. In this section, we are going to talk about the data
*inputs* that can be used for developing travel models. Besides highway networks
--- which usually have to be supplied by the transportation agency --- modelers
frequently gather data from household travel surveys and the US Census Bureau.
::::{.rmdthink}
Obtaining an accurate highway network is one of the most difficult tasks in
travel modeling. It's not conceptually or intellectually difficult, but it is
very difficult to map model networks onto the linear referencing systems or
GIS files used by other agency departments. This is made even more
difficult by model networks needing to be *routeable*: common GIS formats
like shapefiles have no way of representing routability.
::::
### Household Travel Surveys
Travel demand models try to represent individual behavior. How many trips
does the average household make per day? How do people respond to changes in
transit fare? And how can a modeler know if the model accurately reflects
total traffic?
Household travel surveys are a critical component of much travel modeling
practice and research, and are a primary way to answer some of these questions.
In a travel survey, a regional planning
agency^[Like a Metropolitan Planning Organization (MPO).] will recruit households
to participate in the survey. Often there is some kind of reward to encourage
participation, like a gift card or raffle. Once recruited, household members
fill out a diary of their activities on an assigned day; Figure
\@ref(fig:travel-diary) shows an example of one activity from a survey diary.
From the example, you can see the kinds of data that are available: where the
person traveled, which travel mode they used, and what was their reason for
making the trip.
```{r travel-diary, fig.cap="Example travel survey diary entry.", out.width='70%', echo = FALSE}
knitr::include_graphics("images/travel_diary.png")
```
Not all travel surveys are filled in on forms; nowadays telephone interviews or
mobile applications are more common (more on that below). But for decades, paper
travel surveys were the basis of almost all transportation behavior science.
Once the surveys are collected, the data is usually processed into several tables
stored in different files or a database.
- A **Households** table has one row for each household in the dataset, including
information about the number of people in the household, the number of vehicles,
and the household income.
- A **Persons** table has one row for each person in the dataset --- including
which household they are a part of (to link with the households table) --- and
personal attributes like age, student or worker status.
- A **Vehicles** table has one row for each vehicle owned by the households in
in the dataset, including attributes like model year, vehicle class, and fuel
efficiency.
- A **Trips** table has one row for each trip taken by each person in the dataset.
This table can be linked against the other tables if necessary, and contains
information like the trip purpose and many other elements collected with the form
in Figure \@ref(fig:travel-diary).
Tables \@ref(tab:show-nhts-hh) through \@ref(tab:show-nhts-trips) show
data collected from one household in the 2017 National Household Travel Survey.
The household contains four people, two of whom are working adults in their late
thirties. (the other two are children, and the NHTS did not collect their trip
data). The household has two vehicles, and on the survey travel day person 2
appeared to make a few very long trips. It's impossible to know if this
is a typical day for this person or not, but that's the data that was collected.
```{r show-nhts-hh, echo = FALSE}
nhts_households %>%
mutate_at(c("hhfaminc"), ~as_factor(., levels = "labels")) %>%
filter(houseid == "30000082") %>%
select(houseid, hhsize, numadlt, wrkcount, hhvehcnt, hhfaminc, wthhfin) %>%
knitr::kable(caption = "NHTS Households File")
```
```{r show-nhts-persons, echo = FALSE}
nhts_persons %>%
mutate_at(c("educ", "r_sex"), ~as_factor(., levels = "labels")) %>%
filter(houseid == "30000082") %>%
select(houseid, personid, r_age, educ, r_sex) %>%
knitr::kable(caption = "NHTS Persons File")
```
```{r show-nhts-vehicles, echo = FALSE}
nhts_vehicles %>%
mutate_at(c("make", "model", "fueltype"), ~as_factor(., levels = "labels")) %>%
filter(houseid == "30000082") %>%
select(houseid, vehid, vehyear, make, model, fueltype, od_read) %>%
knitr::kable(caption = "NHTS Vehicles File")
```
```{r show-nhts-trips, echo = FALSE}
nhts_trips %>%
mutate_at(c("trptrans", "trippurp"), ~as_factor(., levels = "labels")) %>%
filter(houseid == "30000082") %>%
select(houseid, personid, strttime, endtime, trpmiles, trptrans, trippurp) %>%
knitr::kable(caption = "NHTS Trips File")
```
Note that that the households data in Table \@ref(tab:show-nhts-hh) contains a
numeric column called `wthhfin`. This is a survey *weight*. Because it is impossible
to sample everyone in a population, there needs to be a way to *expand* the survey
to the population. What this number means is that the selected household carries
the same *weight* in this survey as approximately 1100 households in the general
population. Also note that not every household's weight will be equal; because
some population groups have different survey response weights, some households
will need to be weighted more heavily so that the survey reflects the general
population. Most software packages have functions that allow you to
calculate statistics or estimate models including weighted values. The code
chunk below shows how to calculate the average number of workers per household
with and without weights in R; as you can see, omitting the weights leads
to a substantial change in the survey analysis.
```{r weighted-mean, echo = TRUE}
# Average workers per household with no weights
mean(nhts_households$wrkcount)
# Average workers per household, weighted
weighted.mean(nhts_households$wrkcount, nhts_households$wthhfin)
```
Travel survey methodology is changing rapidly as a result of mobile devices with
location capabilities. First, most travel surveys are now administered through a
mobile application: respondents are invited to install an app on their smartphone
that tracks the respondent's position and occasionally asks questions about
trip purpose or mode. This makes collecting and cleaning data considerably easier
than traditional paper surveys, and it also lowers the response burden for the
survey participants. Another change that mobile data has brought to travel surveys
is the introduction of large datasets of location information that planners can
purchase directly from cellular providers or third-party providers. Though these
data do not have all the information on demographics and preferences a survey
would provide, they provide a considerably larger and more detailed sample
on things like overall trip flows. As a result, it may be possible to collect
surveys less frequently, or to reduce survey sample sizes.
### US Census Bureau
The primary statistical agency of the United States is the U.S. Census Bureau,
called Census. The need to collect statistics is established in the Constitution,
> Representatives and direct Taxes shall be apportioned among the several States which
may be included within this Union, according to their respective Numbers...
The actual Enumeration shall be made within three Years after the first Meeting
of the Congress of the United States, and within every subsequent Term of ten Years, in
such Manner as they shall by Law direct.
Since the first census in 1790, Census has collected more data than simply the
number of people in each state. Current programs that are especially important
for travel modeling and other related demographic research include:
- The *Decennial Census of Population and Housing* is the thing most people
think of when they think of Census. Every ten years (most recently in 2020),
Census attempts to collect the number, age, and other key population attributes
for every dwelling unit in the United States.
- The *American Community Survey* (ACS) is an annual 1% sample of the US population.
This survey is considerably more detailed than the decennial census, and asks
questions regarding the education and income of household members, how each
worker traveled to work, whether people own or rent their home, etc.
The ACS is a particularly useful data set, especially because the decennial
census can become outdated as ten years go by between collections. To protect
the individual identity of ACS respondents, Census engages in a number of
different schemes.
The simplest scheme is data aggregation. ACS data is usually obtained as tables
representing the number of individuals or households in a geographic area that
match a certain category. The data is aggregated in two ways: First, large geographic
areas are aggregated each year, meaning that up-to-date numbers are always
available; Second, smaller geographic areas contain groups of records collected
over the last five years. In this way there is a tradeoff between temporal and
spatial precision that the researcher needs to consider.
Another basic scheme is **top-coding**, where numeric variables are capped at a common
value. The ACS will report how many people in a neighborhood have an income
over $250,000, but not how many have an income over
$1 million. Census will also **suppress** small counts in a category; they will
not reveal how many households have 8, 9, or 10 people, instead collapsing all
of these households into a "seven or more" group. If too few individuals or
households match that category, Census does not provide a count. For example, if
only one household in a neighborhood makes more than $250,000, the ACS table for
that cell will contain no information. That could mean there are zero, one,
four, or some other small number of households
in that category.
Besides tables, the Census releases the ACS Public Use Micro-Sample (PUMS)
containing ACS responses as disaggregate *microdata*. These data are
geographically located to much larger areas than other ACS records, and Census
does imputation and data swapping on the records to ensure that private
information cannot be disclosed. But studies conducted on ACS PUMS records
reach the same statistical conclusions as studies conducted on the unmodeled and
raw data, making PUMS a useful tool in studying populations.
#### Geographies
Census data are given in a geographical heirarchy:
- State
- County
- Tract
- Block Group
- Block
The bottom three layers are shown in Figure \@ref(fig:censusgeos). Each layer
nests completely within the layer above it. More detailed data is available
at less spatially detailed geographies.
```{r censusgeos, fig.cap="US Census Geographies in Central Provo.", echo = FALSE, messages = FALSE, warning = FALSE, cache = TRUE}
ut_tracts <- tracts("UT", "Utah", class = "sf", progress_bar = FALSE) %>%
filter(TRACTCE %in% c("001601", "001504", "001602", "001603", "001701",
"001900", "001801", "001702", "001802", "001803",
"002400", "002500", "002701")) %>%
select(GEOID, TRACTCE) %>%
st_transform(4236)
ut_bg <- block_groups("UT", "Utah", class = "sf", progress_bar = FALSE) %>%
filter(TRACTCE %in% ut_tracts$TRACTCE) %>%
select(GEOID, TRACTCE) %>%
st_transform(4236)
ut_blks <- blocks("UT", "Utah", class = "sf", progress_bar = FALSE) %>%
filter(TRACTCE10 %in% ut_tracts$TRACTCE) %>%
select(GEOID = GEOID10, TRACTCE = TRACTCE10) %>%
st_transform(4236)
list(
"Tracts" = ut_tracts %>% mutate(Layer = "Tracts"),
"Block Groups" = ut_bg %>% mutate(Layer = "Block Groups"),
"Blocks" = ut_blks %>% mutate(Layer = "Blocks")
) %>%
bind_rows() %>%
mutate(Layer = factor(Layer, levels = c("Blocks", "Block Groups", "Tracts"))) %>%
ggplot(aes(color = Layer)) +
geom_sf(alpha = 0.0)
```
## Statistical and Mathematical Techniques
Many elements of travel modeling and forecasting require complex numerical and
quantitative techniques. In this section we will present some of these techniques.
Many of the data tables are in the `nhts2017` package. To install this package,
follow the directions in the [Appendix](#app-rstudio).
### Continuous and Discrete Distributions
In general, statistical variables can fall into one of two categories:
- *Continuous* variables can take any numeric value along some range
- *Discrete* variables can take some limited set of predetermined values
A simplistic definition would be to say that continuous variables are numeric and
discrete variables are non-numeric. A continuous variable has statistics such as
a *mean*, but these statistics do not make sense on discrete variables. In the
NHTS trips dataset, we can compute a mean trip miles, but we cannot compute
a mean trip purpose. Or we can't compute a mean that makes sense.
```{r dc_mean, echo = TRUE, error=TRUE}
# mean of continuous variable: trip length
weighted.mean(nhts_trips$trpmiles, nhts_trips$wttrdfin)
# mean of categorical variable: trip purpose
weighted.mean(nhts_trips$trippurp, nhts_trips$wttrdfin)
```
What we can do, however, is we can print a summary table showing the number
of observations that fit in each trip purpose category. Note that sometimes there
will be a category devoted to data that is missing or otherwise invalid.
```{r dc_categories, echo = TRUE}
table(nhts_trips$trippurp)
```
Sometimes it is handy to split a continuous variable into categories so that you
can treat it as a discrete variable.
```{r dc-cut, echo = TRUE}
nhts_trips$miles_cat <- cut(nhts_trips$trpmiles, breaks = c(0, 10, 20, 30, 50, 100, Inf))
table(nhts_trips$miles_cat)
```
When we visualize the distribution of a continuous variable, we might
use a histogram or density plot, but with a discrete variable we would use
a bar chart.
```{r dc-histogram, echo = TRUE, fig.cap="Visualizing a continuous distribution with a histogram.", warning=FALSE}
ggplot(nhts_trips, aes(x = trpmiles, weight = wttrdfin)) +
geom_histogram() + xlab("Trip Distance [Miles]") + ylab("Weighted Trips") +
scale_x_continuous(limits = c(0, 50))
```
```{r dc-barchart, echo = TRUE, fig.cap="Visualizing a discrete distribution with a bar chart.", warning=FALSE}
ggplot(nhts_trips, aes(x = as_factor(trippurp, levels = "labels"),
weight = wttrdfin)) +
geom_bar() + xlab("Trip Purpose") + ylab("Weighted Trips")
```
To this point we've only looked at the distribution of one variable at a time.
There are lots of cases where someone might want to consider the *joint*
distribution of two variables. This joint distribution tells you what is happening
with one variable while the other variable changes. In a table like the one below,
the margins of the table (the row and column sums) contain the single variable
distribution. So sometimes we call these the *marginal* distributions.
```{r dc-joint}
table(nhts_trips$miles_cat, nhts_trips$trippurp)
```
We can visualize joint distributions as well, and sometimes the results are
quite nice.
```{r dc-joint-hist, echo = FALSE, warning = FALSE}
ggplot(nhts_trips %>%
mutate(trippurp =as_factor(trippurp, levels = "labels")) %>%
filter(trippurp != "Not ascertained"),
aes(fill = trippurp, x = trpmiles, weight = wttrdfin)) +
geom_density(alpha = 0.5) +
xlab("Trip Distance [log miles]") + ylab("Density of Trips") +
scale_color_discrete("Trip Purpose") + scale_x_log10(limits = c(0.1, 100))
```
### Iterative Proportional Fitting {#ipf}
There are times when we know two marginal distributions but do not know the
joint distribution. This can happen for a number of reasons:
- We know how many households of different sizes and workers, but not how
many large households have multiple workers.
- We have a forecast of truck volumes at external roads, but do not know
how many trucks go between the roads.
In these situations, a convenient technique is *iterative proportional fitting*.
This technique updates a joint distribution to match two or more marginal
distributions within a particular tolerance. IPF is complicated to explain but
easy to demonstrate, so let's go straight into an example.
Let's say we have a forecast for AADT at three external stations in the future.
We can assume that the two-way AADT is roughly even in each direction, so the
inbound volume at each station is half the AADT. Let's also say we have an
estimate of where the trips coming at those stations *today* go today. This
matrix is called a **seed**.
```{r ipf-1, echo = TRUE}
# AADT projections
volumes <- tibble(
Station = LETTERS[1:3],
Volume = c(20000, 30000, 35000),
AADT = Volume * 2
)
volumes
# observed distribution for a seed
seed <- matrix(c( 0, 7501 ,12500, 8956, 0, 11879, 9146, 21044, 4687),
nrow = 3, ncol = 3, byrow = TRUE)
rownames(seed) <- colnames(seed) <- volumes$Station
seed
```
Note that the row and column sums of the matrix do not match the forecast.
But we can multiply each row in the matrix by the new volume estimate and the
proportion of the seed matrix row that is in that cell. This gives us a new
estimate of the cell's value. Mathematically,
\begin{equation}
S_{n+1, ij} = \frac{m_i * S_{n, ij} }{\sum_{i}S_{ij}}
(\#eq:ipf)
\end{equation}
Where $S$ is the seed matrix and $m$ is the marginal vector. We then repeat with
the other marginal,
\begin{equation}
S_{n+2, ij} = \frac{m_j * S_{n, ij} }{\sum_{j}S_{ij}}
(\#eq:ipf2)
\end{equation}
In this example case, the first row is
\begin{align*}
S_{1, 1 1} &= 20000 * 0 / 20001 &= 0\\
S_{1, 1 2} &= 20000 * 7501 / 20001 &= 7500.62\\
S_{1, 1 2} &= 20000 * 12500/ 20001 &= 12499.38\\
\end{align*}
And the whole row iteration is
```{r ipf-row}
# get factor to multiply each row by
row_factor <- volumes$Volume / rowSums(seed)
# multiply across rows, see ?sweep()
sweep(seed, 1, row_factor, "*")
```
We can write a function that does a complete round of row, then column fitting.
```{r ipf-round}
ipf_round <- function(marginal1, marginal2, seed) {
# multiply the first marginal through the rows (MARGIN = 1)
seed_rows <- sweep(seed, MARGIN = 1, marginal1 / rowSums(seed), "*")
# multiply the second marginal through the columns (MARGIN = 2)
seed_cols <- sweep(seed_rows, MARGIN = 2, marginal2 / colSums(seed_rows), "*")
# return
seed_cols
}
ipf_round(volumes$Volume, volumes$Volume, seed)
```
If we repeat this process for several iterations, we can see that the change
between successive values shrinks. We can use this change to set a tolerance
for when we want the process to stop.
```{r ipf-repeat}
change <- vector("numeric")
joint <- seed
for(i in 1:5){
# update joint table
new_joint <- ipf_round(volumes$Volume, volumes$Volume, joint)
# calculate absolute error at this iteration
print(sum(abs(new_joint - joint)))
# update joint
joint <- new_joint
}
# final estimate
new_joint
```
A few notes:
- IPF is not guaranteed to progressively converge. Meaning, it is possible to get
stuck in a loop where the successive change between iterations does not get
smaller. It is important to set a maximum number of iterations.
- IPF can work in any number of dimensions. A two-dimensional matrix is easy to
visualize and works as a good example, but that's by no means an upper limit.
Just keep repeating.
- If a cell in a seed matrix has a zero value, all successive iterations will
have zero. If the seed table has a structural zero, keep it in. Otherwise, you
might want to consider overriding the value with a small number.
- The process is not consistent: A different seed matrix will lead to a different
outcome. If you are uncertain about your seed matrix, you could consider
taking the average of multiple potential seed matrices.
### Regression Analysis
Consider the basic model in \@ref(eq:basic-model). This is a linear model, meaning
that it creates a line that minimizes the error between the points. Figure
\@ref(fig:lm-mpg) shows the linear relationship between engine displacement (in
cubic centimeters) and engine miles per gallon in a sample of cars.
```{r lm-mpg, fig.cap = "Relationship between engine displacement and miles per gallon."}
ggplot(mtcars, aes(y = mpg, x = disp)) +
geom_point() + stat_smooth(method = "lm", color = "red") + theme_bw()
```
How do we calculate this line? In R, the function to compute a basic linear
regression is `lm()`. So to regress the `mpg` variable on the `disp` variable,
```{r lm1}
# y ~ x + x1 + others, from data
lm1 <- lm(mpg ~ disp, data = mtcars)
# print summary
summary(lm1)
```
What this suggests is that for every additional cc of displacement, the engine
miles per gallon changes by `r coef(lm1)[2]` miles per gallon.
But how does R calculate this?
Assume $y = X\beta + \epsilon$, we want the estimate $\hat{\beta}$ that minimizes
the total error $\epsilon = y - X\hat(beta)$. We can measure this total error as
the "sum of squared residuals"
\begin{equation}
SSR(\hat{\beta}) = \sum(y - X\hat{\beta})^2
(\#eq:ssr)
\end{equation}
If we take derivative of this with respect to $\hat(\beta)$, set to zero, and solve
for $\hat{\beta}$, we obtain the following estimate:
\begin{equation}
\hat{\beta} = (X'X)^{-1}X'y
(\#eq:betahat)
\end{equation}
We can calculate this manually in R, and verify that it matches the estimates
obtained through `lm()`.
```{r ols}
y <- mtcars$mpg
X <- model.matrix(mpg ~ disp, data = mtcars)
(XTX1 <- solve(t(X) %*% X))
(beta <- XTX1 %*% t(X) %*% y)
```
#### Variance and Confidence Tests
Because $\hat{\beta}$ is an estimate rather than a population statistic, we have
to consider that the estimate might be erroneous. That is, we need to construct
a hypothesis test,
\begin{align*}
h_0:& \beta = 0\\
h_a:& \beta \neq 0\\
\end{align*}
Let's look at the variance of our least-squares estimates:
\begin{align*}
\hat{\beta} &= (X'X)^{-1}X'(X\beta + \epsilon)\\
\hat{\beta} &= \beta + (X'X)^{-1}X'\epsilon\\
Var(\hat{\beta}) &= 0 + (X'X)^{-1}X'\Sigma_{\epsilon} X(X'X)^{-1}\\
\end{align*}
If we assume that $\epsilon$ is *independently and identically distributed* (IID)
with mean 0 and variance $\sigma^2$,
\begin{align*}
\Sigma_\epsilon &= \sigma^2I\\
Var(\hat{\beta}) &= 0 + (X'X)^{-1}X\sigma^2X(X'X)^{-1}\\
&= \sigma^2(X'X)^{-1}X'X(X'X)^{-1}\\
& = \sigma^2(X'X)^{-1}
\end{align*}
What does IID mean? It means that the distribution of $\epsilon$ does not depend
on the values of $X$.
```{r olsse}
e <- y - X %*% beta
s <- sum(e^2 / (nrow(X) - ncol(X))) ## SSR / degrees of freedom
(se <- sqrt(diag(s * XTX1)))
```
So, let's talk about confidence tests using our new measure of the distribution
of $\hat{\beta}$. What is the probability of observing an estimate
*at least this extreme* given the null hypothesis that the true value is 0?
The $t$-statistic is the quotient of $\beta$ and its standard error, with
the $p$-value determined based on the degrees of freedom of the model.
```{r olst}
(t = beta / se)
# two-tailed P test with certain degrees of freedom
2*(pt(abs(t), nrow(X) - ncol(X), lower.tail = FALSE))
```
Of course, this test is only valid if the errors are IID.
When you `plot` the results of a linear model, you get a series of diagnostic
plots. The first one is a plot of the residuals at different values of $X\beta$,
which allows you to visually inspect whether the IID assumption is valid.
```{r regression-plot}
plot(lm1, which = 1)
```
So, what can you do when the IID assumption is violated? Well, you could try
to transform the data. What if the inverse of displacement is what matters?
```{r regression-inverse}
# transform the data
lm2 <- lm(log(mpg) ~ I(1 / disp), data = mtcars)
plot(lm2, which = 1)
```
There might also be other variables that matter. We could add the number of
engine cylinders into our estimation. Both of these are better than the original
model.
```{r regression-morevars}
lm3 <- lm(mpg ~ disp + cyl, data = mtcars)
plot(lm3, which = 1)
```
### Numerical Optimization
Let's say you have a function with a
## Homework {-#hw-blocks}
> Some of these questions require a completed run of the demonstration model.
For instructions on accessing and running the model, see the [Appendix](#app-demomodel)
1. With the TAZ layer and socioeconomic data in the demonstration model, make a
set of choropleth and / or bar chart maps showing the following socioeconomic
variables:
- total households
- household density (per acre)
- total jobs
- job density
- share of manufacturing vs office vs retail employment
Compare your maps with aerial imagery from Google Maps or OpenStreetMap.
Describe the spatial patterns of the socioeconomic data in the model region.
Identify which zones constitute the central business district, and identify any
outlying employment centers.
1. With the highway network layer, create maps
showing: link functional type; link free flow speed; and link hourly capacity.
Compare your maps with aerial imagery from Google Maps or OpenStreetMap. Note
that the hourly capacity is not on the input network, so you will need to use a
either the loaded highway network that is output from the model, or an
intermediate network after the initial highway capacity calculations.
Identify the major freeways and principal arterials in the model region.
1. Find the shortest free-flow speed path along the network between two zones.
Then find the shortest *distance* path between the same two zones. Are the paths the
same? Do the paths match what an online mapping service shows for a trip in the
middle of the night?
1. Open the highway assignment report, which shows vehicle hours and miles