This repository collects publicly available datasets for replicability analysis. Currently, we curate a collection of paired individual-level datasets of original and replication studies, and one-sided pairs with individual-level data for the replication study. We are non-selective in collecting these datasets, i.e., both successful and failed studies are included as long as they are available.
This repository accompanies the paper "Diagnosing the role of observable distribution shift in scientific replications" by Ying Jin, Kevin Guo and Dominik Rothenhäusler. [Reference]
Please feel free to contact us at ying531[at]stanford[dot]edu
, or open an issue if you have suggestions for replication datasets not collected here!
R package. Our R package repDiagnosis provides statistical tools for estimating the contribution of observable distribution shifts in replication studies, such as covariate difference and mediation shifts. Paired data 1, 3, 8 below are cleaned and pre-loaded in the R package for use.
Interactive diagnosis app. Play with our interactive analysis tools in our online R shiny app! Quick start with pre-loaded datasets in the app (datasets 1, 3, 8 below). You can also diagnose your own replication study, or probe the generalizability of your single study.
Example analysis. We provide in analysis.html
a analysis report for other datasets that we did not elaborate on in our paper.
1. Complete, paired datasets. Data list, Data details.
2. One sided datasets. Data list, Data details.
Below we list links to papers and datasets for original and replication studies where both of them have individual-level data publicly available. The Processed
column links to data folder in this repo (if any) which we processed from publicly available data. Clicking the link in Name
column jumps to texts that summarize the studies.
Below we collect one-sided original-replication study pairs, i.e., where the replication study has individual-level data, while the original study has only summary statistics available. We include such datasets if the original paper contains rich summary statistics. These summary statistics, together with individual-level data of the replication study, are processed and stored in the links in Processed
column. Clicking the link in Name
column jumps to texts that summarize the studies.
Name | Original paper | Replication paper | Replication data/repo | Processed |
---|---|---|---|---|
1. Climate change misinformation | van der Linden, et al., 2015 | Williams and Bond, 2020 | OSF link | Folder link |
2. Pain-tolerance metaphor | Sierra, et al., 2016 | Pendrous, et al., 2020 | OSF link | Folder link |
3. Body dissatifaction | Martijn, et al., 2010 | Glashouwer, et al., 2019 | Database link | Folder link |
4. Priming and exercise | Pottratz, et al., 2021 | Timme, et al., 2022 | OSF link | Folder link |
-
Background. This study investigates the effect of a `nudge' for thinking about truthfulness of information on the ability of truth discernment when sharing COVID-related news. The treated were asked to rate the accuracy of several headlines, and all participants rated how likely they were to share them on social media.
-
Sample sizes. The original study by Pennycook et al. recruited n = 1145 participants, while the replication study by Roozenbeek et al. had sample size N = 1583.
-
Variables. The outcome variable is
ratings
, which is the rating for willingness to share the headlines. In addition, both studies measured demographical information including age, gender, education, ethnicity. Other measures include cognitive reflectioncrt
, science knowledgesciknow
, medical maximizer-minimizer scalemms
, etc. The binary treatment is encoded intreatment
column, andreal
is a binary indicator of whether the information is correct. -
Results. The original study finds a statistically significant estimate of the interaction of treatment and news truthfulness, i.e., treated participants were less willing to share headlines that were perceived as less accurate. The replication study failed to detect such effect in the first stage with N = 701, but find a significant but smaller effect after collecting the second round of data with pooled N = 1583.
-
Background. Babcock et al. conducted two replications of one study from Côté et al., regarding the effect of inducing emphathy on utilitarian moral judgment across socialeconomic status (SES). Treated participants took an emphathy nudge, and then all participants completed an allocation task.
-
Sample sizes. The original sample size was n = 91. The first replication study had sample size N1 = 230, and the second had N2 = 300.
-
Variables. The primal outcome is
Decision_DV
, i.e., how many dollars they would take away from the 'lose' member in the allocation task, as a measure of utilitarian moral judgement. Control variables including age, gender, ethnicity, income, riligiousity, political orientation, etc., were also collected. Intermediate outcomes on how much they felt compassionate, moved, and sympathetic towards the 'lose' member were also collected. We clean the datasets for the two replication studies separately. -
Results. The original study found a significant effect of the interaction of experimental condition and SES. Study 1 in the replication study did not replicate this result, while the second replication study did.
-
Background. This study concerns the effect of eye movement on susceptibility to false memories. These eye movements are a standard component of ``eye movement desensitization and recprocessing", a standard intervention for posttraumatic stress disorder.
-
Sample sizes. The original study by Houben et al. had sample size n = 82, while the direct replication by Calvillo et al. had sample size N = 120.
-
Variables. The outcome variable are the total number of correct answers and the total number of misinformation after the experiment. In addition, both studies collect gender, age, pre- and post-intervention vividness of memory and emotionality, with one depression level measure differing from BDI to BDI-II.
-
Results. The original study found a statistically significant effect of eye movement on increasing false memories, while the replication study did not.
-
Background. This study investigates whether mind-body practices (yoga in experiment 1 and meditation in experiment 2) increase self-enhancement. In experiment 1, waves of local yoga participants were randomly assigned to treatment and control by week. In experiment 2, participants were recruited from an undergraduate psychology subject pool, with two waves completed offline and two online.
-
Sample sizes. The original study has n1 = 93 for experiment 1 and n2 = 162 (potentially repeated measure over a few weaks). The replication study has N1 = 97 and N2 = 300 for the two experiments.
-
Variables. There are a few outcome variables, including self-centrality, self-enhancement, self-esteem, etc. In our folder, we cleaned the datasets with easier-to-understand column names, and also provide the data cleaning scripts (adapted from the data sources) for reproducibility.
-
Results. Experiment 1 showed no significant effect of yoga for enhancing self-centrality, but did (largely) replicated the effect on self-enhancement, self-esteem and commnunal narcissism. The discrepancy was explained by sampling differences in Vaughan-Johnston et al. Experiment 2 showed no significant effect of medication on self-centrality; frequentisy and Bayesian analyses were contrary regarding self-enhancement; however, they found much stronger evidence for well-being effects than the original study.
-
Background. This study investigates the impact of queue design on worker productivity in service systems that involve human servers by varying between multiple parallel queues versus single pooled queue.
-
Sample sizes. The original study recruited n1 = 248 participants from a public university in US and n2 = 481 participants on M-Turk. The replication study recruited N1 = 246 and N2 = 252 participants for two rounds.
-
Variables. The outcome variable is median speed. The treatment variable is structure of the queue. Other baseline variables were also measured, including age, gender, device used in the experiment, and managerial experience of the participant.
-
Results. The original study found the singe-queue structure slows down servers, while the replication study failed to find such effect.
-
Background. This is a multi-lab replication of an original study from Eskine et al. (2011); unit-level data for the original study is not publicly available to our knowledge. They studied the effect of gustatory disgust on moral judgement, where participants were randomly assigned to bitter, neutral (control), or sweet beverages, and then judged the moral wrongness of six vignettes. We follow the ordering on OSF to clean the datasets and preserve common demographic, manipulation check, and outcome variables.
-
Sample sizes. The original study had sample size n = 57, while the replication studies had N = 1137 participants in total across k = 11 studies.
-
Variables. The outcome variable is the average moral rating of the six vignettes. The treatment variable is
condition
, coded asdummysweet
,dummybitter
anddummywater
in the cleaned datasets. Baseline covariates including religiosity, gender, age, years in colledge, major, ethnicity, potilical orientation, etc. We preserve gender, age, and political orientation for consistency in cleaned data. To evaluate the intended effect of the beverages on subjective ratings (bitter, disgusting, neutral, and sweet) is also assessed, named ascheck_...
in the cleaned data. -
Results. The original study showed that gustatory disgust triggers a significantly heightened sense of moral wrongness. In the multi-lab replication study the overall estimates of effect sizes were all smaller than the original study; some were in the opposite direction; all had 0.95 confidence intervals containing zero.
-
Background. Experiment 2 of Bastian et al. (2014) studied the effect of sharing painful experience on intergroup cooperation. Small groups (2-6 people each) of participants performed either two painful or two painless tasks and then played an economic game. Prochazka et al. (2022) conducts a pilot nonpreregistered direct replication and a second preregistered direct replication, with group sizes fixed at three.
-
Sample sizes. The original study had sample size n = 62. The pilot replication had N = 153 from Czech Republic, and the second preregistered replication had N2 = 158 students from Slovakia.
-
Variables. The outcome variable is
cooperation
, the average score from the six games. The treatment variable iscondition
. We cleaned the datasets by preserving overlapping variables, while the original data additionally contains group size information. Baseline covariates include age and gender. After the experiments, intermediate outcomes such as the level of pain and unpleasantness of sensations were measured as a manipulation check. -
Results. The original study found that shared pain increases cooperation among group members. Both replication studies failed to replicate this finding.
-
Background. This study investigates the impact of physical cleaness on the severity of moral judgement. Participants are randomly assigned to be primed with the concept of cleanliness (Exp.1) and wash hands after experiencing disgust (Exp.2), and then rate six moral vignettes.
-
Sample sizes. The original study had n1 = 40 for Exp.1 and n2 = 44 for Exp.2. The replication study had N1 = 219 for Exp.1 and N2 = 132 for Exp.2.
-
Variables. We cleaned the datasets and preserved common covariates in both studies. The outcome variable is
vignette
, the mean rating in all vignettes. The treatment variable iscondition
with treatment equal 1. Other variables include the emotionality collected after the experiments. -
Results. The original study finds statistically significant effects in both experiments, while Johnson et al. failed to replicate either of them.
-
Background. This study investigates the impact of foreign versus native language on lying. In the original study, German-speaking participants took a lie test where questions were presented randomly in German or English, and they answered with truth or lying in different languages. In the replication study, participants were Dutch-speaking.
-
Sample sizes. The original study had n = 41 participants, and the replication study had N = 63.
-
Variables. The measured outcome is the response time for truth-or-lie-telling answers in both languages. In our cleaned data, each row contains the mean response time of a participant (indicated by
ID
) for questions of differentEmotionality
,Veracity
(Lie or Truth) andLanguage
, as well as the participant's evaluation of emotionality for each category of (Emotionality
timesLanguage
). Due to limited access, only the replication data contains demographic features including age, gender, major, language proficiency as introduced in Frank, et al., 2019. -
Results. The original study showed smaller reaction time differences between lying and truth telling in the foreign compared to thenative language condition, which was mostly driven by prolonged truth responses. The replication study found statistically significant conclusion in the same direction, yet with a smaller effect size.
There are two multi-lab replications. Hagger, et al., [2016] failed, but Dang, et al., [2020] succeeded. Dang, et al., [2020] also pointed out inconsistent implementation of the intervention may be a potential reason for the replication failure in Hagger, et al., [2016]. Both OSF links contain datasets for each lab, which includes individual-level characteristics.
-
Background. This study investigates the impact of time pressure on cheating. In the original study, participants privately roll out a dice and get payment according to their reported amount on the dice (which does not have to be true). The reported amount is used as the outcome.
-
Sample sizes. The original study had n = 72. The replication study consisted of two experiments; the first one had larger sessions with N1 = 426, another one had the same session size as the original study with N2 = 297.
-
Variables. The outcome of interest is the reported dice number. The treatment variable (=1) indicates whether there is time pressure (i.e., having to report the dice number in a short time). Data for the original study only contains gender as demographic information. Data for the replication study contains age, gender, education, etc., as demographics, as well as ratings for their belief in the financial incentive and anonymousness of their die roll. The original study and the replication study 1 collected the participants' positive and negative feelings after the experiment; we preserve all such columns and put the common ones before others.
-
Results. The original study found that time pressure increases cheating, while neither of the replication studies replicated this conclusion.
-
Background. This study investigates the impact of information communication on protecting against misinformation about climate change.
-
Sample sizes. The original study had n = 2167. The replication study had N = 792.
-
Variables. The outcome of interest is the perceved concensus, and there are multiple treatment conditions. We clean the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset, with processing script included for reproducibility.
-
Results. The original study had multiple hypotheses; the replication study replicated a susbet of them.
-
Background. This study investigates the impact of common physical properties (such as 'cond') within a perseverance metaphor on increasing pain tolerance. Participants completed a cold pressor task before and after a randomly allocated intervention of metatphor exercise.
-
Sample sizes. The original study had n = 87. The replication study had N = 89.
-
Variables. The outcome of interest is the difference in pain tolerance. We save the replication dataset (with unit-level data), and the sample mean of demographic information in the original dataset.
-
Results. The original study found that physical metaphor increases pain tolerance, while the replication study did not replicate this result.
-
Background. This study investigates the impact of a computer-based evaluative conditioning (EC) procedure using positive social feedback on enhancing body satisfaction.
-
Sample sizes. The original study had n = 54. The replication study had N = 129.
-
Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.
-
Results. The conclusion in the original study was not successfully replicated.
-
Background. This study investigates the impact of affective priming as a behavioral intervention on the enhancement of exercise-related affect.
-
Sample sizes. The original study had n = 54. The replication study had N = 53.
-
Variables. The outcome of interest is the difference in body satisfaction and self-esteem before and after the intervention. We save the replication dataset (unit-level), and the sample mean of demographic information in the original dataset.
-
Results. The conclusion in the original study was not successfully replicated. The replication report emphasized potential heterogeneity among people as a potential factor for the failure.
Please use the following citation if you use this collection in your study, or you use our softwares for analyzing replication studies.
@article{jin2023diagnosing,
title={Diagnosing the role of observable distribution shift in scientific replications},
author={Jin, Ying and Guo, Kevin and Rothenh{\"a}usler, Dominik},
journal={arXiv preprint arXiv:2309.01056},
year={2023}
}