Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial documentation #63

Merged
merged 11 commits into from
Nov 12, 2024
165 changes: 161 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,39 @@

A package to create activity-based models (for transport demand modelling)

## Installation
- [acbm](#acbm)
- [Motivation and Contribution](#motivation-and-contribution)
- [Installation](#installation)
- [How to Run the Pipeline](#how-to-run-the-pipeline)
- [Step 1: Prepare Data Inputs](#step-1-prepare-data-inputs)
- [Step 2: Setup your config.toml file](#step-2-setup-your-configtoml-file)
- [Step 3: Run the pipeline](#step-3-run-the-pipeline)
- [Future Work](#future-work)
- [Generative Aproaches to activity scheduling](#generative-aproaches-to-activity-scheduling)
- [Location Choice](#location-choice)
- [Related Work](#related-work)
- [Synthetic Population Generation](#synthetic-population-generation)
- [Activity Generation](#activity-generation)
- [Deep Learning](#deep-learning)
- [Location Choice](#location-choice-1)
- [Primary Locations](#primary-locations)
- [Secondary Locations](#secondary-locations)
- [Entire Pipeline](#entire-pipeline)
- [Contributing](#contributing)
- [License](#license)


# Motivation and Contribution

Activity-based models have emerged as an alternative to traditional 4-step transport demand models. They provide a more detailed framework by modeling travel as a sequence of activities, accounting for when, how, and with whom individuals participate. They can integrate household interactions, spatial-temporal constraints, are well suited to model on demand transport services (which are becoming increasingly common), and look at the equity implications across transport scenarios.

Despite being increasingly popular in research, adoption in industry has been slow. A couple of factors have influenced this. The first is inertia and well-established guidelines on 4-step models. However, this is changing; in 2024, the UK Department for Transport released its first Transport Analysis Guidance on activity and agent-based models (See [TAG unit M5-4 agent-based methods and activity-based demand modelling](<ins>https://www.gov.uk/government/publications/tag-unit-m5-4-agent-based-methods-and-activity-based-demand-modelling</ins>)). Other initiatives, such as the [European Association of Activity-Based Modeling](https://eaabm.org/) are also being established to try and increase adoption of activity-based modelling and push research into practice.

Another factor is tool availability. Activity-based modelling involves many steps, including synthetic population generation, activity sequence generation, and (primary and secondary) location assignment. Many tools exist for serving different steps, but only a couple of tools exist to run an entire, configurable pipeline, and they tend to be suited to the data of specific countries (see [Related Work](<ins>#related-work</ins>) for a list of different open-source tools).

To our knowledge, no open-source activity-based modelling pipeline exists for the UK. This repository allows researchers to run the entire pipeline for any region in the UK, with the output being a synthetic population with daily activity diaries and locations for each person. The pipeline is meant to be extendible, and we aim to plug in different approaches developed by others in the future

# Installation

```bash
python -m pip install acbm
Expand All @@ -19,15 +51,140 @@ cd acbm
poetry install
```

# How to Run the Pipeline

The pipeline is a series of scripts that are run in sequence to generate the activity-based model. There are a few external datasets that are required. The data and config directories are structured as follows:

```md
├── config
│   ├── <your_config_1>.toml
│   ├── <your_config_2>.toml
├── data
│   ├── external
│   │   ├── boundaries
│   │   │   ├── MSOA_DEC_2021_EW_NC_v3.geojson
│   │   │   ├── oa_england.geojson
│   │   │   ├── study_area_zones.geojson
│   │   ├── census_2011_rural_urban.csv
│   │   ├── centroids
│   │   │   ├── LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
│   │   │   └── Output_Areas_Dec_2011_PWC_2022.csv
│   │   ├── MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv
│   │   ├── nts
│   │   │   ├── filtered
│   │   │   │   ├── nts_households.parquet
│   │   │   │   ├── nts_individuals.parquet
│   │   │   │   └── nts_trips.parquet
│   │   │   └── UKDA-5340-tab
│   │   │   ├── 5340_file_information.rtf
│   │   │   ├── mrdoc
│   │   │   │   ├── excel
│   │   │   │   ├── pdf
│   │   │   │   ├── UKDA
│   │   │   │   └── ukda_data_dictionaries.zip
│   │   │   └── tab
│   │   │   ├── household_eul_2002-2022.tab
│   │   │   ├── individual_eul_2002-2022.tab
│   │   │   ├── psu_eul_2002-2022.tab
│   │   │   ├── trip_eul_2002-2022.tab
│   │   │   └── <other_nts_tables>.tab
│   │   ├── travel_times
│   │   | ├── oa
│   │   | | ├── travel_time_matrix.parquet
| | | └── msoa
│   │   | └── travel_time_matrix.parquet
│   │   ├── ODWP01EW_OA.zip
│   │   ├── ODWP15EW_MSOA_v1.zip
│   │   ├── spc_output
│   │   │   ├── <region>>_people_hh.parquet (Generated in Script 1)
│   │   │   ├── <region>>_people_tu.parquet (Generated in Script 1)
│   │   │   ├── raw
│   │   │   │   ├── <region>_households.parquet
│   │   │   │   ├── <region>_info_per_msoa.json
│   │   │   │   ├── <region>.pb
│   │   │   │   ├── <region>_people.parquet
│   │   │   │   ├── <region>_time_use_diaries.parquet
│   │   │   │   ├── <region>_venues.parquet
│   │   │   │   ├── README.md
│   ├── interim
│   │   ├── assigning (Generated in Script 3)
│   │   └── matching (Generated in Script 2)
│   └── processed
Copy link
Collaborator

@sgreenbury sgreenbury Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'll aim to update the notebooks and scripts to read/output files to match this structure (#53)

│   ├── acbm_<config_name>_<date>
│   │   ├── activities.csv
│   │   ├── households.csv
│   │   ├── legs.csv
│   │   ├── legs_with_locations.parquet
│   │   ├── people.csv
│   │   └── plans.xml
│   ├── plots
│   │   ├── assigning
│   │   └── validation
```

## Step 1: Prepare Data Inputs

You need to populate the data/external diectory with the required datasets. A guide on where to find / generate each dataset can be found in the [data/external/README.md]

## Step 2: Setup your config.toml file

You need to create a config file in the config directory. The config file is a toml file that contains the parameters for the pipeline. A guide on how to set up the config file can be found in the [config/README.md]

## Step 3: Run the pipeline

The scripts are listed in order of execution in the [scripts/run_pipeline.sh](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/run_pipeline.sh) bash file

You can run the pipeline by executing the following command in the terminal from the base directory:

```bash
bash ./scripts/run_pipeline.sh config/<your_config_file>.toml
```

where your config file is the file you created in Step 2.

## Future Work

We aim to include different options for each step of the pipeline. Some hopes for the future include:

### Generative Aproaches to activity scheduling
- [ ] Bayesian Network approach to generate activities
- [ ] Implement a Deep Learning approach to generate activities (see package below)

### Location Choice
- [ ] Workzone assignment: Plug in Neural Spatial Interaction Approach

## Related Work

There are a number of open-source tools for different parts of the activity-based modelling pipeline. Some of these include:

### Synthetic Population Generation

### Activity Generation

#### Deep Learning
- [caveat](https://github.com/fredshone/caveat)

### Location Choice

#### Primary Locations

- [GeNSIT](https://github.com/YannisZa/GeNSIT)

#### Secondary Locations
- [PAM](https://github.com/arup-group/pam/blob/main/examples/17_advanced_discretionary_locations.ipynb): PAM c


## Usage
### Entire Pipeline
- [Eqasim](https://github.com/eqasim-org/eqasim-java)
- [ActivitySim](https://activitysim.github.io/activitysim/v1.3.1/index.html)
- [PAM](https://github.com/arup-group/pam): PAM has functionality for different parts of the pipeline, but itis not clear how to use it to create an activity-based model for an entire population. Specifically, it does not yet have functionality for activity generation (e.g. statistical matching or generative approaches), or constarined primary location assignment.


## Contributing
# Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on how to contribute.

## License
# License

Distributed under the terms of the [Apache license](LICENSE).

Expand Down
1 change: 1 addition & 0 deletions config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The config.toml file has an explanation for each parameter. You can copy the toml file, give it a name that is relevant to your project, and modify the parameters as needed.
113 changes: 113 additions & 0 deletions data/external/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
The folder contains all external datasets necessary to run the pipeline. Some can be downloaded, while others need to be generated. The README.md file in this folder provides a guide on where to find / generate each dataset.


## Folder Structure

The structure of the folder is as follows:

```md
.
├── data
│   ├── external
│   │   ├── boundaries
│   │   │   ├── MSOA_DEC_2021_EW_NC_v3.geojson
│   │   │   ├── oa_england.geojson
│   │   │   ├── study_area_zones.geojson
│   │   ├── census_2011_rural_urban.csv
│   │   ├── centroids
│   │   │   ├── LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
│   │   │   ├── Output_Areas_Dec_2011_PWC_2022.csv
│   │   ├── MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv
│   │   ├── nts
│   │   │   ├── filtered
│   │   │   │   ├── nts_households.parquet
│   │   │   │   ├── nts_individuals.parquet
│   │   │   │   └── nts_trips.parquet
│   │   │   └── UKDA-5340-tab
│   │   │   ├── 5340_file_information.rtf
│   │   │   ├── mrdoc
│   │   │   │   ├── excel
│   │   │   │   ├── pdf
│   │   │   │   ├── UKDA
│   │   │   │   └── ukda_data_dictionaries.zip
│   │   │   └── tab
│   │   │   ├── household_eul_2002-2022.tab
│   │   │   ├── individual_eul_2002-2022.tab
│   │   │   ├── psu_eul_2002-2022.tab
│   │   │   ├── trip_eul_2002-2022.tab
│   │   │   └── <other_nts_tables>.tab
| | └── travel_times
| | │   │   ├── oa
| | │   │   | ├── travel_time_matrix.parquet
| | | | └── msoa
| | │   │   └── travel_time_matrix.parquet
│   │   ├── ODWP01EW_OA.zip
│   │   ├── ODWP15EW_MSOA_v1.zip
│   │   ├── spc_output
│   │   │   ├── <region>>_people_hh.parquet (Generated in Script 1)
│   │   │   ├── <region>>_people_tu.parquet (Generated in Script 1)
│   │   │   ├── raw
│   │   │   │   ├── <region>_households.parquet
│   │   │   │   ├── <region>_info_per_msoa.json
│   │   │   │   ├── <region>.pb
│   │   │   │   ├── <region>_people.parquet
│   │   │   │   ├── <region>_time_use_diaries.parquet
│   │   │   │   ├── <region>_venues.parquet
│   │   │   │   ├── README.md

```

## Data Sources


`spc_output/`

Use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md)
to get a parquet file and convert it to JSON.

You have two options:
1. Slow and memory-hungry: download the `.pb` file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html)
and load in the pbf file with the python package
2. Faster: Run SPC to generate parquet outputs, and then load using the SPC toolkit python package. To generate parquet, you need to:
1. Clone [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs)
2. Run:
```shell
cargo run --release -- \
--rng-seed 0 \
--flat-output \
--year 2020 \
config/England/west-yorkshire.txt
```
and replace `west-yorkshire` and `2020` with your preferred option.

`boundaries/`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, add data source

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2ca4d10


- MSOA_DEC_2021_EW_NC_v3.geojson
- oa_england.geojson
- study_area_zones.geojson

`centroids/`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, add data source

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2ca4d10


- LSOA_Dec_2011_PWC_in_England_and_Wales_2022.csv
- Output_Areas_Dec_2011_PWC_2022.csv

`nts/`

UKDA-5340-tab:
- Download the UKDA-5340-tab from the UK Data Service [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340)
- Step 1: Create an account
- Step 2: Create a project and request access to the data
- We use the `National Travel Survey, 2002-2023` dataset (SN: 5340)
- Step 3: Download TAB file format

`travel_times/`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, add information on dataframe structure (i.e. expected columns)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2ca4d10


- OPTIONAL Dataset - If it does not exist, it will be generated in the pipeline. They are added under oa/ or msoa/ subdirectories.
- e.g. oa/`travel_time_matrix.parquet` or msoa/`travel_time_matrix.parquet`

`ODWP01EW_OA.zip`
`ODWP15EW_MSOA_v1.zip`
`MSOA_2011_MSOA_2021_Lookup_for_England_and_Wales.csv`
`census_2011_rural_urban.csv`

```
59 changes: 22 additions & 37 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,22 @@
# Preparing synthetic population scripts

## Datasets
- [Synthetic Population Catalyst](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md)
- [National Travel Survey](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340)
- [Rural Urban Classification 2011 classification](https://geoportal.statistics.gov.uk/datasets/53360acabd1e4567bc4b8d35081b36ff/about)
- [OA centroids](): TODO

## Loading in the SPC synthetic population

Use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md)
to get a parquet file and convert it to JSON.

You have two options:
1. Slow and memory-hungry: download the `.pb` file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html)
and load in the pbf file with the python package
2. Faster: Run SPC to generate parquet outputs, and then load using the SPC toolkit python package. To generate parquet, you need to:
1. Clone [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs)
2. Run:
```shell
cargo run --release -- \
--rng-seed 0 \
--flat-output \
--year 2020 \
config/England/west-yorkshire.txt
```
and replace `west-yorkshire` and `2020` with your preferred option.


## Matching
### Adding activity chains to synthetic populations
The purpose of this script is to match each individual in the synthetic population to a respondant from the [National Travel Survey (NTS)](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=5340).

### Methods
We will try two methods:
1. categorical matching: joining on relevant socio-demographic variables
2. statistical matching, as described in [An unconstrained statistical matching algorithm for combining individual and household level geo-specific census and survey data](https://doi.org/10.1016/j.compenvurbsys.2016.11.003).
# Scripts

## Synthetic Population Generation

- 1_prep_synthpop.py: Create a synthetic population using the SPC

## Adding Activity Patterns to Population

- 2_match_households_and_individuals.py: Match individuals in the synthetic population to travel diaries in the NTS. This is based on statistical matching approach described in ...

## Location Assignment

- 3.1_assign_primary_feasible_zones.py</ins>: This script is used to obtain, for each activity, the feasible destination zones that the activity could take place in. This is done by using a travel time matrix between zones to identify the zones that can be reached given the NTS reported travel time and travel mode in the NTS. A travel time matrix should be provided before running the pipeline (in the correct format). If a travel time matrix does not exist, the code can create travel time estimates based on mode average speeds and crow fly distance. For tips on creating a travel time matrix, see the comment here https://github.com/Urban-Analytics-Technology-Platform/acbm/issues/20#issuecomment-2317037441
- [3.2.1_assign_primary_zone_edu.py](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/3.2.1_assign_primary_zone_edu.py):
- [3.2.2_assign_primary_zone_work.py](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/3.2.2_assign_primary_zone_work.py)
- [3.2.3_assign_secondary_zone.py](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/3.2.3_assign_secondary_zone.py)
- [3.3_assign_facility_all.py](https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/3.3_assign_facility_all.py)

## Validation
- 4_validate.py: Validate the synthetic population by comparing the distribution of activity chains in the NTS to our model outputs.

## Output
Loading