Skip to main content

Evaluation Summary and Metrics: "Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization"

Evaluation Summary and Metrics: "Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization" for The Unjournal. Evaluators: Anonymous (1) and Anonymous (2)

Published onAug 10, 2024
Evaluation Summary and Metrics: "Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization"
·
key-enterThis Pub is a Review of
Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization
Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization
Description

Policymakers often choose a policy bundle that is a combination of different interventions in different dosages. We develop a new technique—treatment variant aggregation (TVA)—to select a policy from a large factorial design. TVA pools together policy variants that are not meaningfully different and prunes those deemed ineffective. This allows us to restrict attention to aggregated policy variants, consistently estimate their effects on the outcome, and estimate the best policy effect adjusting for the winner’s curse. We apply TVA to a large randomized controlled trial that tests interventions to stimulate demand for immunization in Haryana, India. The policies under consideration include reminders, incentives, and local ambassadors for community mobilization. Cross-randomizing these interventions, with different dosages or types of each intervention, yields 75 combinations. The policy with the largest impact (which combines incentives, ambassadors who are information hubs, and reminders) increases the number of immunizations by 44% relative to the status quo. The most cost-effective policy (information hubs, ambassadors, and SMS reminders but no incentives) increases the number of immunizations per dollar by 9.1% relative to status quo.

Abstract

We organized three evaluations of the paper: "Selecting the Most Effective Nudge: Evidence from a Large-Scale Experiment on Immunization"[1]. The evaluations are generally extremely positive. However, evaluator 2 expresses some doubts about the novelty of the authors’ (“treatment variation aggregation”) approach and its practical advantages relative to Bayesian estimators that are more sophisticated than the ones they tested. Evaluator 3 has concerns about the robustness of the assumptions behind the econometric justifications for TVA. They also strongly encourage the authors to provide further software sharing and guidance. To read these evaluations, please see the links below.

Evaluations

1. Anonymous evaluation 1

2. Anonymous evaluation 2

  1. Anonymous evaluation 3

Overall ratings

We asked evaluators to provide overall assessments as well as ratings for a range of specific criteria.

I. Overall assessment: We asked them to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

II. Journal rank tier, normative rating (0-5):1 On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here.) Note: 0= lowest/none, 5= highest/best.

Overall assessment (0-100)

Journal rank tier, normative rating (0-5)

Anon Evaluator 12

98

5.0

Anon. evaluator 2

95

4.7

Anon. evaluator 3

85

4.5

See “Metrics” below for a more detailed breakdown of the evaluators’ ratings across several categories. To see these ratings in the context of all Unjournal ratings, with some analysis, see our data presentation here.3

See here for the current full evaluator guidelines, including further explanation of the requested ratings.4

Evaluation summaries

Anonymous evaluator 15

This is an absolutely superb paper tackling a hugely important policy question. The authors develop a new econometric approach to aggregating high-dimensional factorial designs in RCTs in order to identify the most effective policies, and apply it to the question of increasing childhood immunization rates. I am thoroughly impressed by every aspect of this paper: the dataset used, the specific treatment arms used, the policy relevance, and the approach to identifying the best policy.

Anonymous evaluator 2

Strengths:

  • Introduces a new technique - treatment variant aggregation (TVA) - to answer policy-relevant questions

  • Strong theoretical and simulation results grounded in real-world data

  • Demonstrates advantages in both selection and estimation of policies by comparing to many alternative estimators

  • Application to an urgent, real-world problem

Weaknesses:

  • Simulations rely on parameters from a single dataset, limiting generalizability

  • Unclear practical advantage of TVA in selecting better policies compared to alternatives

  • Bayesian estimators [that are more sophisticated than the ones they tested] could rival or surpass TVA

Anonymous evaluator 3

The method is mostly a clever combination of existing methods; the main contribution is showing [the] consistency and normality of their estimator. These results need not always hold, as they depend on assumptions, and so applicability is not universal. The authors provide guidance by presenting simulations showing that for n > 3000, normality seems to hold. The field data reveal no surprise[s], and all the interventions have been tested before. Yet, the paper is strong, and the method will possibly remain relevant despite the recent advances in adaptive experimental design, if the authors provide additional guidance or software.

Metrics

Ratings

See here for details on the categories below, and the guidance given to evaluators.

Evaluator 16

Anonymous**

Evaluator 2

Anonymous

Evaluator 3

Anonymous

Rating category

Rating (0-100)

90% CI

(0-100)*

Rating (0-100)

90% CI

(0-100)*

Rating (0-100)

90% CI

(0-100)*

Comments

Overall assessment7

98

(96, 100)

95

(80, 99)

85

(75, 90)

8

Advancing knowledge and practice9

97

(95, 100)

95

(80, 99)

85

(80, 90)

10

Methods: Justification, reasonableness, validity, robustness11

97

(95, 100)

90

(70, 95)

80

(50, 100)

12

Logic & communication13

96

(93, 100)

90

(70, 95)

80

(50, 100)

14

Open, collaborative, replicable15

96

(92, 100)

90

(70, 95)

70

(25, 80)

16

Real-world relevance 17

100

(100, 100)

95

(75, 99)

95

(70, 100)

18

Relevance to global priorities19

100

(100, 100)

85

(70, 95)

95

(90, 100)

20

** Manager’s note: We will not incorporate the ratings from Evaluation 1 into our database/dashboard and analysis, as these show signs of being uncalibrated. In general, we ask our evaluators to take a Bayesian approach to specifying their credible intervals for percentiles and predictions, avoiding degenerate intervals.

Journal ranking tiers

See here for more details on these tiers.

Evaluator 121

Anonymous**

Evaluator 2

Anonymous

Evaluator 3

Anonymous

Judgment

Ranking tier (0-5)

90% CI

Ranking tier (0-5)

90% CI

Ranking tier (0-5)

90% CI

Comments

On a ‘scale of journals’, what ‘quality of journal’ should this be published in?

5.0

(5.0, 5.0)

4.7

(4.0, 5.0)

4.5

(4.0, 5.0)

22

What ‘quality journal’ do you expect this work will be published in?

5.0

(5.0, 5.0)

4.7

(4.0, 5.0)

4.7

(4.0, 5.0)

23

See here for more details on these tiers.

We summarize these as:

  • 0.0: Marginally respectable/Little to no value

  • 1.0: OK/Somewhat valuable

  • 2.0: Marginal B-journal/Decent field journal

  • 3.0: Top B-journal/Strong field journal

  • 4.0: Marginal A-Journal/Top field journal

  • 5.0: A-journal/Top journal

Evaluation manager’s discussion

Unjournal process notes

Note on versions: Evaluator 1 and Evaluator 2 considered an earlier (Sept. 2022) version of this paper — the version on Arxiv here. Evaluator 3 considered the June 2024 version linked here. (Evaluators 1 and 2 were given a chance to adjust their evaluations to the more recent version but declined to do so.)

We shared a set of “bespoke evaluation notes” with the evaluators; these are linked here, and included several specific suggestions for “some key issues and claims to vet”.

Author engagement

We reached out to the authors on April 9 2024 to let them know we were commissioning the evaluation of this paper, asking if they had any particular requests, and asking about any newer or forthcoming versions.  On August 10 we shared these evaluations with the authors, which they acknowledged, and declined to respond, noting that the paper had been accepted for publication in the journal Econometrica.24

Why we chose this paper

This paper seems to be already having a direct influence on funding/policy in impactful areas. According to the member of our team that suggested this this is “already informing a J-PAL project in India, as well as Suvita.” The discussion of incentives vs reminders seems particularly relevant, it would seem to inform charities like New Incentives.

The methodological question is how to “select the best policy among potential bundles that combine several interventions” and report a credible measure of the impact of this ‘best’ intervention’. This seems valuable for a range of high-impact contexts, from global health interventions to fighting misinformation to promoting effective charitable giving. I expect approaches like TVA (or more Bayesian approaches to this question) to be increasingly used in research and practice.

Issues meriting further evaluation

The “issues and claims” we suggested (in the aforementioned notes) were partially addressed by the evaluators. Some examples:25

  1. “Lasso then prune”/results relying on sparsity assumptions,
    As one example, we asked them to consider whether their ‘lasso then prune’ approach was justified. Their statistical justification relies on a sparsity assumption, i.e., in the real world some treatments need to have precisely a zero marginal effect. The third evaluator considered this in part, noting

Treatments that have no effect whatsoever are either "pooled or pruned (pooled with control)". Scientifically, I am a bit unsure whether this should be allowed; a discussion seems warranted. It is obviously principled and data-driven, but conceptually, it seems wrong to me.

… Whether or not the strong assumptions are likely to be fulfilled in different settings will remain a judgment call.

This issue may merit further, more formal discussion from statisticians and econometricians who work with machine learning/statistical learning.

  1. We asked about “scalability of the interventions/economic feasibility for more widespread adoption: Can these strategies be effectively implemented in other regions of India or similar contexts or are there aspects of the context in Haryana state that might affect the generalizability of the results?…” None of the evaluators discussed this issues, nor considered the context (India, immunizations) in detail.

  2. The authors report “The most cost-effective policy (information hubs, SMS reminders, no incentives) increases the number of immunizations per dollar by 9.1%” This is not a value/cost measure because it depends on the immunization base rate etc. We’d like to encourage the authors to estimate and report even more relevant Benefit/Cost measures. Future evaluators could offer specific suggestions for this, or estimate these themselves (using the code and data that will presumably be provided along with the final publication in Econometrica.)

Comments
0
comment
No comments here
Why not start the discussion?