Abstract
We organized three evaluations of the paper: "Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization"[1]. The evaluations are generally extremely positive. However, evaluator 2 expresses some doubts about the novelty of the authors’ (“treatment variation aggregation”) approach and its practical advantages relative to Bayesian estimators that are more sophisticated than the ones they tested. Evaluator 3 has concerns about the robustness of the assumptions behind the econometric justifications for TVA. They also strongly encourage the authors to provide further software sharing and guidance. To read these evaluations, please see the links below.
Evaluations
1. Anonymous evaluation 1
2. Anonymous evaluation 2
Anonymous evaluation 3
Overall ratings
We asked evaluators to provide overall assessments as well as ratings for a range of specific criteria.
I. Overall assessment: We asked them to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”
II. Journal rank tier, normative rating (0-5): On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here.) Note: 0= lowest/none, 5= highest/best.
| Overall assessment (0-100) | Journal rank tier, normative rating (0-5) |
Anon Evaluator 1 | 98 | 5.0 |
---|
Anon. evaluator 2 | 95 | 4.7 |
Anon. evaluator 3 | 85 | 4.5 |
See “Metrics” below for a more detailed breakdown of the evaluators’ ratings across several categories. To see these ratings in the context of all Unjournal ratings, with some analysis, see our data presentation here.
See here for the current full evaluator guidelines, including further explanation of the requested ratings.
Evaluation summaries
Anonymous evaluator 1
This is an absolutely superb paper tackling a hugely important policy question. The authors develop a new econometric approach to aggregating high-dimensional factorial designs in RCTs in order to identify the most effective policies, and apply it to the question of increasing childhood immunization rates. I am thoroughly impressed by every aspect of this paper: the dataset used, the specific treatment arms used, the policy relevance, and the approach to identifying the best policy.
Anonymous evaluator 2
Strengths:
Introduces a new technique - treatment variant aggregation (TVA) - to answer policy-relevant questions
Strong theoretical and simulation results grounded in real-world data
Demonstrates advantages in both selection and estimation of policies by comparing to many alternative estimators
Application to an urgent, real-world problem
Weaknesses:
Simulations rely on parameters from a single dataset, limiting generalizability
Unclear practical advantage of TVA in selecting better policies compared to alternatives
Bayesian estimators [that are more sophisticated than the ones they tested] could rival or surpass TVA
Anonymous evaluator 3
The method is mostly a clever combination of existing methods; the main contribution is showing [the] consistency and normality of their estimator. These results need not always hold, as they depend on assumptions, and so applicability is not universal. The authors provide guidance by presenting simulations showing that for n > 3000, normality seems to hold. The field data reveal no surprise[s], and all the interventions have been tested before. Yet, the paper is strong, and the method will possibly remain relevant despite the recent advances in adaptive experimental design, if the authors provide additional guidance or software.
Metrics
Ratings
See here for details on the categories below, and the guidance given to evaluators.
| Evaluator 1 Anonymous** | | Evaluator 2 Anonymous | | Evaluator 3 Anonymous | | |
---|
Rating category | Rating (0-100) | 90% CI (0-100)* | Rating (0-100) | 90% CI (0-100)* | Rating (0-100) | 90% CI (0-100)* | Comments |
---|
Overall assessment | 98 | (96, 100) | 95 | (80, 99) | 85 | (75, 90) | |
---|
Advancing knowledge and practice | 97 | (95, 100) | 95 | (80, 99) | 85 | (80, 90) | |
---|
Methods: Justification, reasonableness, validity, robustness | 97 | (95, 100) | 90 | (70, 95) | 80 | (50, 100) | |
---|
Logic & communication | 96 | (93, 100) | 90 | (70, 95) | 80 | (50, 100) | |
---|
Open, collaborative, replicable | 96 | (92, 100) | 90 | (70, 95) | 70 | (25, 80) | |
---|
Real-world relevance | 100 | (100, 100) | 95 | (75, 99) | 95 | (70, 100) | |
---|
Relevance to global priorities | 100 | (100, 100) | 85 | (70, 95) | 95 | (90, 100) | |
---|
** Manager’s note: We will not incorporate the ratings from Evaluation 1 into our database/dashboard and analysis, as these show signs of being uncalibrated. In general, we ask our evaluators to take a Bayesian approach to specifying their credible intervals for percentiles and predictions, avoiding degenerate intervals.
Journal ranking tiers
See here for more details on these tiers.
| Evaluator 1 Anonymous** | | Evaluator 2 Anonymous | | Evaluator 3 Anonymous | | |
---|
Judgment | Ranking tier (0-5) | 90% CI | Ranking tier (0-5) | 90% CI | Ranking tier (0-5) | 90% CI | Comments |
---|
On a ‘scale of journals’, what ‘quality of journal’ should this be published in? | 5.0 | (5.0, 5.0) | 4.7 | (4.0, 5.0) | 4.5 | (4.0, 5.0) | |
---|
What ‘quality journal’ do you expect this work will be published in? | 5.0 | (5.0, 5.0) | 4.7 | (4.0, 5.0) | 4.7 | (4.0, 5.0) | |
---|
See here for more details on these tiers. | We summarize these as: 0.0: Marginally respectable/Little to no value 1.0: OK/Somewhat valuable 2.0: Marginal B-journal/Decent field journal 3.0: Top B-journal/Strong field journal 4.0: Marginal A-Journal/Top field journal 5.0: A-journal/Top journal
|
Evaluation manager’s discussion
Unjournal process notes
Note on versions: Evaluator 1 and Evaluator 2 considered an earlier (Sept. 2022) version of this paper — the version on Arxiv here. Evaluator 3 considered the June 2024 version linked here. (Evaluators 1 and 2 were given a chance to adjust their evaluations to the more recent version but declined to do so.)
We shared a set of “bespoke evaluation notes” with the evaluators; these are linked here, and included several specific suggestions for “some key issues and claims to vet”.
Author engagement
We reached out to the authors on April 9 2024 to let them know we were commissioning the evaluation of this paper, asking if they had any particular requests, and asking about any newer or forthcoming versions. On August 10 we shared these evaluations with the authors, which they acknowledged, and declined to respond, noting that the paper had been accepted for publication in the journal Econometrica.
Why we chose this paper
This paper seems to be already having a direct influence on funding/policy in impactful areas. According to the member of our team that suggested this this is “already informing a J-PAL project in India, as well as Suvita.” The discussion of incentives vs reminders seems particularly relevant, it would seem to inform charities like New Incentives.
The methodological question is how to “select the best policy among potential bundles that combine several interventions” and report a credible measure of the impact of this ‘best’ intervention’. This seems valuable for a range of high-impact contexts, from global health interventions to fighting misinformation to promoting effective charitable giving. I expect approaches like TVA (or more Bayesian approaches to this question) to be increasingly used in research and practice.
Issues meriting further evaluation
The “issues and claims” we suggested (in the aforementioned notes) were partially addressed by the evaluators. Some examples:
“Lasso then prune”/results relying on sparsity assumptions,
As one example, we asked them to consider whether their ‘lasso then prune’ approach was justified. Their statistical justification relies on a sparsity assumption, i.e., in the real world some treatments need to have precisely a zero marginal effect. The third evaluator considered this in part, noting
Treatments that have no effect whatsoever are either "pooled or pruned (pooled with control)". Scientifically, I am a bit unsure whether this should be allowed; a discussion seems warranted. It is obviously principled and data-driven, but conceptually, it seems wrong to me.
… Whether or not the strong assumptions are likely to be fulfilled in different settings will remain a judgment call.
This issue may merit further, more formal discussion from statisticians and econometricians who work with machine learning/statistical learning.
We asked about “scalability of the interventions/economic feasibility for more widespread adoption: Can these strategies be effectively implemented in other regions of India or similar contexts or are there aspects of the context in Haryana state that might affect the generalizability of the results?…” None of the evaluators discussed this issues, nor considered the context (India, immunizations) in detail.
The authors report “The most cost-effective policy (information hubs, SMS reminders, no incentives) increases the number of immunizations per dollar by 9.1%” This is not a value/cost measure because it depends on the immunization base rate etc. We’d like to encourage the authors to estimate and report even more relevant Benefit/Cost measures. Future evaluators could offer specific suggestions for this, or estimate these themselves (using the code and data that will presumably be provided along with the final publication in Econometrica.)