Skip to content

Commit

Permalink
Merge pull request #9 from Urban-Analytics-Technology-Platform/8-task…
Browse files Browse the repository at this point in the history
…-1-add-activity-patterns-to-synthetic-population

Add activity patterns to synthetic population
  • Loading branch information
Hussein-Mahfouz authored Apr 5, 2024
2 parents cacf64d + 44a143c commit 4292b81
Show file tree
Hide file tree
Showing 8 changed files with 12,091 additions and 790 deletions.
220 changes: 220 additions & 0 deletions notebooks/1_prep_synthpop.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing the Synthetic Population"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the spc package for our synthetic population. To add it as a dependancy in this virtual environment, I ran `poetry add git+https://github.com/alan-turing-institute/uatk-spc.git@55-output-formats-python#subdirectory=python`. The branch may change if the python package is merged into the main spc branch. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#import json\n",
"import pandas as pd\n",
"\n",
"#https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/examples/spc_builder_example.ipynb\n",
"from uatk_spc.builder import Builder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading in the SPC synthetic population\n",
"\n",
"I use the code in the `Quickstart` [here](https://github.com/alan-turing-institute/uatk-spc/blob/55-output-formats-python/python/README.md) to get a parquet file and convert it to JSON. \n",
"\n",
"You have two options:\n",
"\n",
"\n",
"1- Slow and memory-hungry: Download the pbf file directly from [here](https://alan-turing-institute.github.io/uatk-spc/using_england_outputs.html) and load in the pbf file with the python package\n",
"\n",
"2- Faster: Covert the pbf file to parquet, and then load it using the python package. To convert to parquet, you need to:\n",
"\n",
"a. clone the [uatk-spc](https://github.com/alan-turing-institute/uatk-spc/tree/main/docs) \n",
"\n",
"b. Run `cargo run --release -- --rng-seed 0 --flat-output config/England/west-yorkshire.txt --year 2020` and replace `west-yorkshire` and `2020` with your preferred option\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Pick a region with SPC output saved\n",
"path = \"../data/external/spc_output/raw/\"\n",
"region = \"west-yorkshire\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### People and household data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# add people and households\n",
"spc_people_hh = (\n",
" Builder(path, region, backend=\"pandas\", input_type=\"parquet\")\n",
" .add_households()\n",
" .unnest([\"health\", \"employment\", \"details\"])\n",
" # remove nssec column\n",
" .build()\n",
")\n",
"\n",
"spc_people_hh.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we need to unnest the demographic data. If we do this above\n",
"# we get an error as there will be two \"nssec8\" columns.\n",
"\n",
"# Unnest the JSON column\n",
"demographics = pd.json_normalize(spc_people_hh['demographics'])\n",
"\n",
"# Remove the columns we don't want\n",
"spc_people_hh = spc_people_hh.drop(['demographics', 'nssec8'], axis = 1)\n",
"# Add the unnested demographics column\n",
"spc_people_hh = pd.concat([spc_people_hh, demographics], axis = 1)\n",
"\n",
"spc_people_hh.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save the output\n",
"spc_people_hh.to_parquet('../data/external/spc_output/' + region + '_people_hh.parquet')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spc_people_hh['salary_yearly'].hist(bins=100)\n",
"\n",
"\n",
"#plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spc_people_hh['salary_yearly'].unique()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### People and time-use data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Subset of (non-time-use) features to include and unnest\n",
"\n",
"# The features can be found here: https://github.com/alan-turing-institute/uatk-spc/blob/main/synthpop.proto\n",
"features = {\n",
" \"health\": [\n",
" \"bmi\",\n",
" \"has_cardiovascular_disease\",\n",
" \"has_diabetes\",\n",
" \"has_high_blood_pressure\",\n",
" \"self_assessed_health\",\n",
" \"life_satisfaction\",\n",
" ],\n",
" \"demographics\": [\"age_years\",\n",
" \"ethnicity\",\n",
" \"sex\",\n",
" \"nssec8\"\n",
" ],\n",
" \"employment\": [\"sic1d2007\",\n",
" \"sic2d2007\",\n",
" \"pwkstat\",\n",
" \"salary_yearly\"\n",
" ]\n",
"\n",
"}\n",
"\n",
"# build the table\n",
"spc_people_tu = (\n",
" Builder(path, region, backend=\"polars\", input_type=\"parquet\")\n",
" .add_households()\n",
" .add_time_use_diaries(features, diary_type=\"weekday_diaries\")\n",
" .build()\n",
")\n",
"spc_people_tu.head()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save the output\n",
"spc_people_tu.write_parquet('../data/external/spc_output/' + region + '_people_tu.parquet')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "acbm-7iKwKWLy-py3.10",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 4292b81

Please sign in to comment.