Skip to content

[1] Activity Generation & Scheduling

Hussein Mahfouz edited this page Nov 12, 2024 · 1 revision

image

One approach for adding activity schedules to synthetic populations is to match activity diaries from a survey onto the synthetic population. In the approach below we use a combination of categorical (exact) matching and statistical matching to match activity diaries from the NTS onto our synthetic population

We choose a two-step matching approach. The first step is categorical matching at the household level to ensure that households from the SPC are matched onto corresponding household with the same number of adults and children. The approach is detailed below, as well as future improvements

0. Preprocessing NTS sample

In order to use the NTS, we first need to do some preprocessing to ensure our dataset is representative of the population we want to model. Some regional variations and temporal changes can be seen in the figures below. To ensure accuracy of our model, we do two preprocessing steps filter_by_region and filter_by_date.

bus01b-facet bus01b-one_plot-rural_urban
bus01f-facet-local_authorities

Filter by date

The NTS has data going many years back. We can choose to include survey results from a list of years to increase our sample size (instead of restricting ourselves to one year). We also need to avoid years where travel patterns were affected by covid (as shown in the plots above)

Filter by region

The figures above show regional disparities in bus usage. The most striking one in terms of mode share is London, but there are also differences between Metropolitan and Non-metropolitan areas. Other disparities (not shown above) include travel distances.

We mitigate against these disparities by filtering the NTS to only include regions we think are representative for our use case. The unique list of valid regions is specified here. We can choose one or more regions to include. Given the small sample size of the NTS, for any area outside of London, one could simply exclude London.

We do not yet allow filtering by Metropolitan / Non-metropolitan areas, but this column is available in the NTS and functionality will be added. We do however match (below) on urban / Rural classification

1. Matching - Household level (Categorical / Exact)

This is exact matching using a join. We choose specific columns that exist in both datasets, and perform an exact match.

1.1 Matching Columns

In general, the fixed_columns and optional_columns can be specified by the user. However the columns that define a household structure should be in fixed_columns

What columns should be matched on? We want to do matching in a way that minimises the number of households in the SPC that are not matched. I am using the following columns

Variable Name (NTS) Name (SPC) Transformation (NTS) Transformation (SPC)
Household income HHIncome2002_BO2ID salary_yearly NA Group by household ID and sum
Number of adults HHoldNumAdults age_years NA Group by household ID and count
Number of children HHoldNumChildren age_years NA Group by household ID and count
Employment status HHoldEmploy_B01ID pwkstat NA a) match to NTS categories. b) group by household ID
Car ownership NumCar num_cars SPC is capped at 2. We change all entries > 2 to 2 NA
Type of tenancy Ten1_B02ID tenure
Urban-Rural classification of residence Settlement2011EW_B04ID NA Convert to census 2011 categories Spatial join between layer and SPC

1.2 Matching Logic

To increase the proportion of households in the SPC with matches, we do an iterative join and relax constraints at each iteration. We speciy fixed_columns (columns that are kept in the join at every iteration), and optional_columns (columns that can be removed from the join). The optional columns are listed in the order of importance (least important at the end). At each iteration, we remove the least important remaining variable in optional_columns, and try to match again. We keep count of the number of matches for each SPC household, and stop trying to match it once we've reached a threshold of n matches.

The pseudocode is descibed below:

  1. Initialize two lists of matching columns: fixed_columns and optional_columns AND n (satisfactory number of matches)

  2. While there are columns left in the `optional_columns or unmatched households:

    1. Perform matching between SPC and NTS using the current list of columns.
    2. For each matched SPC household:
      • Record unique matches for that household.
      • Update the match count for that household.
    3. Identify SPC households with matches >= n and exclude them from the next iteration.
    4. If unmatched households remain:
      • Remove least important column in optional_columns from the list to relax constraints.
  3. Return dictionary of matched households.

Once we have a list of matches for each household in the SPC, we sample a matched household for each SPC household and continue to the individual level matching

2. Matching at Individual Level (Propensity Score Matching)

Once each household in the SPC is matched to a household in the NTS, we need to match the individuals inside the household. If number of adults and number of children were used as fixed_columns, then there should be the same number of individuals in the SPC household and its corresponding NTS match.

At the individual level, we do propensity score matching (PSM). PSM can match an individual in dataset A to the closest matching individual in dataset B (The match does not have to be exact).

2.1. Matching Logic

Propensity score matching is done using the Nearest Neighbors algorithm from the sklearn.neighbors library to calculate the distances between rows in two DataFrames based on specified columns. In our case, use the age and sex. Our matching is constrained (i.e. each individual from the SPC household can only be matched to one individual from the NTS household. This is done iteratively; in each iteration we identify the two closest individuals, match them, remove them, and repeat until all individuals have been matched. This is similar to bipartite matching

image

3. Validation

Internal Consistency checks are done to ensure that the global distributions of the SPC match those of the NTS. This is done for:

  • Mode share
  • Activity sequences
  • Trip purpose
  • Trip start time by trip purpose

4. Notes / TODOs for future

4.1 Using PSM directly at the household level

  • PSM can be carried out to match households to ensure that each household in the SPC is matched to a household in the NTS
  • It could allow us to use more variables for matching (e.g. household income) without having to use iterative relaxation
  • The matchit R package has an extensive set of functions that could be useful
    • Matching with replacement: This is similar to the approach we are using, where each household in the NTS can be matched to multiple households in the SPC. One big advantage is that we can use calipers. Calipers allow setting a threshold for acceptable distances for each variable (e.g. a person with age = 40 can only be matched to a person with age 35 - 45)
    • Genetic matching

4.2 Bipartate matching (if mainting the same two step matching approach)

4.2 Validation

In categorical or Propensity Score matching, a household in the SPC can be matched to multiple households in the NTS. By extension, an individual in the SPC can be matched to multiple activity chains from the NTS. How do we decide which activity chain is the most accurate for an individual?

Sequence Alignment

One approach is to see how close the breakdown in the matched population is to that in the travel survey (i.e. are the proportion of people doing x trips the same? Are the proportion of people doing similar activity chains the same?)

  • SAMs are able to extract patterns of behaviour from large spatiotemporal datasets. These patterns can then be used to group data by similarity
  • Two types of analysis are common using SAMs (Shoval and Isaacson 2007)
    • Construct groups based on their overall activity patterns. Clustering algorithms produce trees (heirarchical clustering)
    • Detect patterns of behaviour in the sequences
  1. Run Sequence Alignment on activity chains in NTS → get clusters of activity participation / time use
    1. Each activity chain is assigned to a cluster
    2. Get relative size of clusters: How big is each cluster (as % of total)? This assumes that we have a representative population
  2. Individuals in SPC are assigned to a cluster based on the activity chain matched to them
    1. With categorical matching, an SPC individual can be matched to multiple activity chains, so they are in a different cluster depending on the activity chain we choose for them.
  3. For each SPC individual, we randomly assign an activity chain from the pool of activity chains matched to it
    1. Get relative size of clusters. Do they match with the results in the NTS (1b)?
    2. Should we do a brute force analysis where we run (all) different combinations and then find the one where the cluster composition most closely matches with the NTS? Is there a clever way to do this?

Other notes:

  • cluster on activity chains AND household properties
  • You need evaluation metrics
  • Output should match other datasets (e.g. number of GP visits should be similar to reported numbers - similar to constraints on commuting)

5. References

Sequence Alignment

Propensity Score Matching

This youtube video gives a nice overview of psm. Check from 13:42 for:

  • greedy vs optimal algorithms
  • caliper vs nearest neighbor
  • One to one vs one to many

Packages

R

  • The matchit r package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)

Python

I didn't find a python library that has the same functionality as MatchIt.

Other resources

6. Misc Notes

The approach in this paper uses psm to:

  • Step 1: match each household from the population to households in the sample (multiple households from the sample can be matched to a population household.
  • Step 2: match individuals from the population household to the closest individual from all matched households. They do not match all individuals from the population household to individuals from one sample household. This means that they do not maintain household integrity. Different individuals in the same population household could be matched to individuals from different sample households

Our approach maintains household integrity