Skip to content
iabaako edited this page Oct 7, 2024 · 9 revisions

Overview

The IPA Data Cleaning Package is a Stata package developed by Innovations for Poverty Action (IPA) to streamline the process of cleaning and validating survey data. This package includes a suite of commands designed to handle common data cleaning tasks efficiently.

Software Requirements

Some of the commands in the ipaclean program are heavily dependent on Stata’s data frames. Unfortunately, Stata data frames are only available in Stata 17 or later. Consequently, ipaclean requires that the user has Stata 17.0 or later installed on their machine prior to running ipaclean. IPA employees with older versions of Stata should contact IT for access to a newer version.

Installation

To install the package, use the following command in Stata:

* Install ipaclean using
net install ipaclean, all replace from("https://raw.githubusercontent.com/PovertyAction/ipaclean/main")

* After installation, run the following command to install helper commands:
ipaclean update

* check your version of ipaclean with the command:
ipaclean version

Features

The package includes several key commands:

  • ipaappend: Safely append datasets.
  • ipamergerepeats: Safely merge datasets.
  • ipaodksplit: Create dummy variables from SurveyCTO/ODK style select_multiple type questions.
  • ipaodkmergerepeats: Reshape and Merge ODK/SurveyCTO repeat groups.
  • ipacompare: Compare Datasets across multiple rounds of survey data collection.
  • ipacodebook: Describe data content and export codebook to excel.

ipaappend – Safely Append Datasets

The ipaappend command is designed to safely append Stata-format datasets to an existing dataset in memory with additional features to avoid common issues such as mismatched variable types. This command ensures data consistency and optionally allows for a detailed append report to assess potential type conflicts. ipaappend's safely option is an alternative to using the force option with the Stata default append command which will lead to data loss. The safely option checks for the best data type that can accommodate all values and converts the variable in the master or using datasets so the append can happen without data loss.

Use case:

The ipaappend command is especially useful in cases where datasets need to be appended but variables contain inconsistent data types across datasets eg. price variable is numeric in master dataset and string in using dataset. It ensures that appending doesn’t lead to data loss or incompatibilities that could result from variable mismatches.

Example:

Suppose you have two datasets containing car data — one with data for domestic cars, where price is recorded as a string, and one with data for foreign cars, where price is numeric. Using Stata's native append command with the force option could result in data loss. However, ipaappend with the safely option will handle the type mismatch without data loss.

* Prepare datasets
sysuse auto, clear
keep if foreign == 0
tostring price, replace
save domestic

sysuse auto, clear
keep if foreign == 1
save foreign

* Attempting to append with the native append command results in an error
append using domestic

* Using the force option will result in data loss
append using domestic, force

* Using ipaappend's safely option instead
use foreign, clear
ipaappend using domestic, outfile("append_report.xlsx") safely replace

ipamerge – Safely Merge Datasets

The ipamerge command is designed to safely merge Stata-format datasets on one or more key variables while avoiding issues such as mismatched variable types. This command performs match merges (one-to-one, one-to-many, many-to-one, and many-to-many), ensuring data consistency and offering an alternative to Stata's merge command. With the safely option, ipamerge can handle type conflicts that would otherwise result in data loss or errors, converting variable types where necessary to retain all data during the merge process.

Use case:

The ipamerge command is especially useful for merging datasets with inconsistent variable types. For example, if a variable is numeric in the master dataset but a string in the using dataset, ipamerge with the safely option will convert data types as needed to avoid data loss or merge errors.

Example:

Suppose you have two datasets containing car information, one with foreign as a string variable and another with price as a string variable. Attempting a merge with the default Stata merge command may lead to errors or data loss when using force. However, ipamerge with the safely option will manage these inconsistencies without data loss.

* Prepare datasets
sysuse auto, clear
keep make price trunk weight length turn displacement gear_ratio foreign
tostring foreign, force replace
save "using_data", replace

sysuse auto, clear
keep make price mpg rep78 headroom foreign
tostring price, replace
save "master_data", replace

* Attempting a standard merge with the native merge command will cause an error
merge 1:1 make using "using_data"

* Using the force option will lead to missing values
merge 1:1 make using "using_data", force

* Using ipamerge's safely option to handle data type mismatch safely
use "master_data", clear
ipamerge 1:1 make using "using_data", safely outfile("mergereport.xlsx", replace)

ipaodksplit – Create Dummy Variables from SurveyCTO/ODK select_multiple Questions

The ipaodksplit command is designed to process XLSForm files and create dummy variables for select_multiple questions in SurveyCTO or ODK datasets. It reads the specified XLSForm and generates dummy variables based on the choices provided for each select_multiple question. This command can efficiently handle multiple languages and provides flexibility in variable labeling and ordering.

Use case:

The ipaodksplit command is especially useful for survey datasets with select_multiple questions, allowing researchers to create individual dummy variables for each option in the select_multiple questions. The command provides a much more streamlined and easier method for splitiing select_multiple variables during data cleaning.

Example:

In this example, ipaodksplit is used to label the dummy variables and assign value labels for binary responses (e.g., "Yes" and "No"). A prefix is added to avoid naming conflicts.

* Import Data
use "Employment Status and Consumption Patterns in 2023.dta", clear
* Define custom value labels and process select_multiple questions
label define yesno 0 "No" 1 "Yes"
* Split variables and label dummies using yesno value label
ipaodksplit using "Employment Status and Consumption Patterns in 2023.xlsx", order label vallab(yesno) prefix(_)

ipaodkmergerepeats – Reshape and Merge ODK/SurveyCTO Repeat Groups

The ipaodkmergerepeats command is designed to reshape and merge repeat group datasets from ODK or SurveyCTO forms into a main dataset. It simplifies the process of handling nested repeat groups, enabling users to consolidate data collected from repeat groups into a single, merged file.

** Use Case**: The ipaodkmergerepeats command is ideal for merging repeat group datasets that are common in SurveyCTO and ODK data collection. This functionality is particularly useful when you have nested data structures, such as a survey that includes household members and each member's individual responses.

Example:

In this example, ipaodkmergerepeats is used to reshape and merge repeat group data, then save a copy of the merged dataset to disk.

* Unzip and load test data
unzipfile "https://raw.github.com/PovertyAction/ipaclean/main/data/ipamergerepeats_test_data.zip"

* Reshape and merge repeat groups, then save the merged dataset
ipaodkmergerepeats using "Nested Repeat Data.dta", saving("Nested Repeat Data_merged") replace

ipacompare – Compare Datasets Across Multiple Survey Rounds

The ipacompare command is designed to track and compare datasets across multiple rounds of survey data collection. It generates an Excel report that highlights survey completion rates and data completeness across different rounds, providing valuable insights into respondent tracking and survey participation over time.

Use Case:

The ipacompare command is especially useful for longitudinal studies or any survey project involving repeated data collection. It allows researchers to compare respondent data across multiple rounds, enabling them to assess consistency, track consent rates and analyze survey outcomes.

Examples: Example 1: Comparing Four Survey Rounds with a Master Dataset

In this example, ipacompare is used to track data across four rounds of data collection against a master dataset. The master dataset includes respondent demographics, consent, and outcome information, allowing for detailed tracking and reporting.

* Unzip and load test data
unzipfile "https://raw.github.com/PovertyAction/ipaclean/main/data/ipacompare_test_data.zip"

* Compare data from four survey rounds with a master dataset
ipacompare, id(hhid) date(submissiondate) keepmaster(sex) consent(consent, 1) outcome(complete, 1 2) masterdata("Deworming Project - Master Dataset") ///
    s1("Deworming Project - Census", "Census") s2("Deworming Project - Baseline", "Baseline") s3("D Project - Midline", "Midline") s4("Deworming Project - Endline", "Endline") ///
    outfile(compare.xlsx) replace

Example 2: Comparing Four Survey Rounds Without a Master Dataset

In this case, ipacompare compares four rounds of survey data without a master dataset. The command compiles IDs from all rounds, tracking consent and outcome information over time.

* Compare data from four survey rounds without a master dataset
ipacompare, id(hhid) date(submissiondate) consent(consent, 1) outcome(complete, 1 2) ///
    s1("Deworming Project - Census", "Census") s2("Deworming Project - Baseline", "Baseline") s3("Deworming Project - Midline", "Midline") s4("Deworming Project - Endline", "Endline") ///
    outfile(compare.xlsx) replace

ipacodebook – Describe Data Content and Export Codebook to Excel

The ipacodebook command generates an Excel codebook that summarizes the variable names, labels, types, number and percentage of missing values, and the number of distinct values for each variable. Additionally, it can produce a template for variable name and label corrections, which can be applied back to the dataset. Users can also specify statistics to be included in the codebook, such as mean, standard deviation, or percentiles.

Use case:

The ipacodebook command is particularly useful for projects that require detailed documentation of survey or research data. By exporting a comprehensive codebook, it enables researchers to review, document, and share key details about their datasets. The template feature allows for easy adjustments to variable names and labels, which can then be applied back to the data.

Example:

Example 1: Exporting a Simple Codebook for All Variables

This example demonstrates how to export a codebook for all variables in the auto dataset, overwriting any existing file.

* Load example data
sysuse auto, clear

* Export codebook for all variables
ipacodebook _all using "auto_codebook.xlsx", replace

Example 2: Using Notes as Variable Labels

In this example, the codebook is generated using notes as variable labels when the notes are longer than the labels.

* Export codebook using notes as labels if notes are longer
ipacodebook _all using "auto_codebook.xlsx", note(1, longer) replace

Example 3: Generating a Template Codebook and Applying Changes

This example shows how to create a codebook template to edit variable names and labels, then apply the changes back to the dataset.

* Generate a template codebook for editing
ipacodebook _all using "auto_codebook.xlsx", template replace

* After editing the template, apply the changes
ipacodebook _all using "auto_codebook_NEW.xlsx", applyusing("auto_codebook.xlsx") replace

Example 4: Exporting a Codebook with Selected Statistics

In this example, the codebook includes specific statistics (mean, p25, p50, and p75) for all numeric variables.

* Export codebook with selected statistics
ipacodebook _all using "auto_codebook.xlsx", replace statistics(mean p25 p50 p75)

Example 5: Exporting a Codebook with Statistics for Specific Variables

This example shows how to generate a codebook including mean, median, and standard deviation for only the price and mpg variables.

* Export codebook with statistics for selected variables
ipacodebook _all using "auto_codebook.xlsx", replace statistics(mean median sd) statvariables(price 
Clone this wiki locally