Pipeline

File system

Resources

├── resources
│   ├── Facebook
│   ├── maps
│   ├── OpenCellid
│   ├── survey

Facebook (data * API tokens)

├── resources
│   ├── Facebook
│   │   ├── MarketingAPI
│   │   ├── Movement
│   │   ├── Population

Maps (api keys)

├── resources
│   ├── maps
│   │   ├── copernicus
│   │   ├── geopy
│   │   ├── googleearth
│   │   ├── staticmaps

OpenCellId

├── resources
│   ├── OpenCellid
│   │   ├── fulldatabase
│   │   ├── <download_date>
│   │   │   ├── cell_towers.csv.gz
│   │   │   ├── country
│   │   │   │   ├── cell_towers_<country_name>.csv

Survey (ground-truth)

├── resources
│   ├── survey
│   │   ├── <country_name>
│   │   │   ├── <source>
│   │   │   │   ├── <year>
│   │   │   │   │   ├── *GC*FL.zip
│   │   │   │   │   ├── *GE*FL.zip
│   │   │   │   │   ├── *HR*DT.zip

Inferring Poverty Maps

Ground-Truth

Download ground-truth GT (e.g., DHS data)
The GT must be under resources/survey/<country>/<source>/<year>/<files>
Add country metadata in COUNTRY constant under resources/survey/available_metadata.json (country name, country code, years, capital)
- If DHS data is used, note the ccode used in the first two letters of the files.
Add country code and ground truth source in resources/survey/available_source.json
Add country code ccode in:
- scripts/batch_init.sh
- scripts/batch_preprocessing.sh
- scripts/batch_xgb_train.sh
(DHS data only) Add question ids to WATER, TOILET and FLOOR for the respective country.
1. Open *HR**FL.MAP from the GT household data (*HR*DT.zip)
2. For each question identify the high, medium, and low categories (hints in libs/utils/constants.py)
- hv201 for water supply: ../resources/survey/dhs_water_codes.json
- hv205 for toilet facility: ../resources/survey/dhs_toilet_codes.json
- hv2013 for floor quality: ../resources/survey/dhs_floor_codes.json

Environment

Download conda [guide]
Go to project folder: cd SES-Inference/
Create enviroment with python 3.7: conda create --name myenv python=3.7
Install packages: pip install -r requirements.txt

Features

Download population density (*_general_2020_csv.zip): [doc] [data]
1. Locate it under resources/Facebook/Population/<country>/
Download populated places (for African countries only: *_geojson.zip): [data]
1. Locate the files under resources/OCHA/Population/<country>/
Download mobility data:
1. Locate it under: resources/Facebook/Movement
Download (again to update) the OpenCellID data under resources/OpenCellid/full_database/YYYY-MM-DD/ (use the date of download)
1. From scripts run python batch_pre_oci.py -fnzip ../resources/OpenCellid/full_database/{download_date}/cell_towers.csv.gz -njobs 20
Reactivate Facebook Marketing API tokens
1. Go to [Meta for developers]
2. Store them in resources/Facebook/MarketingAPI/tokens/tokens-<FB_account>
  1. Each FB_account can generate at most 15 tokens. Use as many as possible.
3. Check tokens here: notebooks/_FBM-check.ipynb
4. Run resources/Facebook/MarketingAPI/copy_to_tokens.sh <num>
  - Tokens will be distributed across num folders (in case you want to run the batch_features.sh script for clusters and pplaces (or different countries) at the same time. One folder for each instance. If so, pass the correct path under the argument -t in batch_features.sh.
Get API keys from Google Earth Engine
1. Go to [Obtaining an API Key]
2. You need to create these files, in each add the respective key:
  1. api_key
  2. project_id
  3. service_account
  4. clientsecret.json
    - This file is generated when adding a key here
Get API keys for Google Maps Static API
1. Go to [Use API Keys with Maps Static API]
2. You need to create these files, in each add the respective key:
  1. api_key
  2. secret

Running scripts

For example, cd scripts/

Run init: ./batch_init.sh -r ../data/Uganda -c UG -y 2016,2018 -n 10
- Note: It runs 3 scripts (i) prepares folder structure for the given country, (ii) prepares the GT data, (iii) prepares PPlaces
- If you have your own PPLACES.csv, you should move it to ../data/<country>/features/pplaces/
Run features GT: ./batch_features.sh -r ../data/Uganda -c UG -y 2016,2018 -n 10
- Note 1: pass argument -z to specify the size (width x height) of satellite images, e.g., -z 160x160 (when using 400x400m grid-cells) or 640x640 (when using OSM pplaces)
- Note 2: pass argument -m to specify the bounding boxes for the VIIRS (comma separated meters), e.g., -m 400,800,1200,1600 (when using 400x400m grid-cells). If nothing is passed the default is: 1600,2000,5000,10000 (e.g., when using OSM pplaces)
- Note 3: pass argument -w to specify the bounding box for OSM in meters, e.g., 1600 (when using OSM pplaces) or 400 (when using 400x400m grid-cells)
- Note 4: pass correct path to API keys.
- After running this step, you can move to step #4
Run features PP: ./batch_features.sh -r ../data/Uganda -c UG -n 10
- Same as in step 2. Be aware of passing the correct values for -z, -m and -w
- Note that this step is not needed to run steps #4, 5, 6, 7, 8, and 9.
Run Pre-processing: ./batch_preprocessing.sh -r ../data/Uganda -c UG -y 2016,2018 -o none -t all -k 5 -e 3
- Note: -t is the argument to specify the recency of the ground truth data to use for training. The available values are: all, newest, oldest
Run CatBoost training: ./batch_xgb_train.sh -r ../data/Uganda -c UG -y 2016,2018 -l none -t all -a mean_wi,std_wi -f all -k 4 -v 1
- Weights: Run notebooks/_CatBoost_Weights_Cat10.ipynb then update weights in resources/CB_weights.json and maxval in resources/survey/available_metadata.json (@TODO: make it a script)
- Weighted: ./batch_xgb_train.sh -r ../data/Uganda -c UG -y 2016,2018 -l none -t all -a mean_wi,std_wi -f all -k 4 -v 1 -w 1
- Note: -f is the argument to specify the source of features. Available: all, FBM, FBP, FBMV, NTLL, OCI, OSM, and any combination of the individual sources sorted ASC alphabetically, and separated with _, eg. FBM_FBMP_OCI
Run Augmentation: python cnn_augmentation.py -r ../data/Uganda -years 2016,2018 -dhsloc none -probaug 1.0 -njobs 10 -imgwidth 640 -imgheight 640
- Note 1: Change imgwidth and imgheight accordingly, e.g., 640 when using OSM pplaces, and 160 when using 400x400m grid cells.
Run CNN training (& feature maps): Run scripts in SLURM (for CNN and CNNa)
- Create folder: slurm/dhsloc_none/logs-<ccode>
- see ../slurm/LB_train_r1_f12.sh (this runs Liberia, run 1, fold 1 and 2)
- Send job to schedule: sbatch LB_train_r1_f12.sh
- Run CNN merging files: see notebooks/_CNN_Merging_Tuning_Files.ipynb
- Run ../slurm/LB_train_r1_r2.sh (this runs Liberia, run 1 and 2)
Run fmaps: Run scripts in SLURM (for CNN and CNNa)
- see ../slurm/LB_fmaps_noaug_r12.sh (this runs Liberia, run 1 and 2)
- Send job to schedule: sbatch LB_fmaps_noaug_r12.sh
Run CNN+CatBoost training: ./batch_xgb_train.sh -r ../data/Uganda -c UG -y 2016,2018 -l none -t all -a mean_wi,std_wi -f all -k 4 -v 1 -n offaug_cnn_mp_dp_relu_sigmoid_adam_mean_std_regression -e 19 -w 1
- Note 1: change -m (CNN model with and without augmentation) and -w (weighted CB or not) accordingly.
Run Poverty map: python batch_infer_poverty_maps.py -ccode UG -model CB
- First, make sure you have collected the features for PPLACES (step #3)
- Second, run models: CB, CBw, CNN, CNNa (passing each of them as -model <model_name>)
  - Note that, running -model CNN* will create the feature maps for the pplaces, required to run the combined models
- Third, run the combined models: CNN+CB, CNNa+CB, CNN+CBw, CNNa+CBw
- For countries with many pplaces e.g., >10K it is recommended to use a high-memory server.
- CNN and CNN_a must be run on a GPU.
Run Cross-country testing: python batch_cross_predictions.py (on a GPU)
- First, update list of countries COUNTRIES to include in the transfer (in the same script).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline.md

Pipeline.md

Pipeline

File system

Resources

Facebook (data * API tokens)

Maps (api keys)

OpenCellId

Survey (ground-truth)

Inferring Poverty Maps

Ground-Truth

Environment

Features

Running scripts

Files

Pipeline.md

Latest commit

History

Pipeline.md

File metadata and controls

Pipeline

File system

Resources

Facebook (data * API tokens)

Maps (api keys)

OpenCellId

Survey (ground-truth)

Inferring Poverty Maps

Ground-Truth

Environment

Features

Running scripts