Evaluation Summary and Metrics: "When do "Nudges" Increase Welfare?"

Published onMay 15, 2024
We organized two evaluations of the paper: "When do "Nudges" Increase Welfare?"[1].1 The evaluators agree that the paper provides a novel, valuable contribution to the understanding of how nudges impact welfare in equilibrium. They acknowledge the paper’s substantive theoretical modeling and its tentative empirical validation. Nonetheless, they highlight several limitations. To read these evaluations, please see the links below.

This paper was selected as part of our (NBER) direct evaluation track.


Anonymous Evaluation 1

Anonymous Evaluation 2

Overall ratings

We asked evaluators to provide overall assessments as well as ratings for a range of specific criteria.

I. Overall assessment: We asked them to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

II. Journal rank tier, normative rating (0-5):2 On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here.) Note: 0= lowest/none, 5= highest/best.

Overall assessment (0-100)

Journal rank tier, normative rating (0-5)

Anon. Evaluator 1



Anon. Evaluator 2



See “Metrics” below for a more detailed breakdown of the evaluators’ ratings across several categories. To see these ratings in the context of all Unjournal ratings, with some analysis, see our data presentation here.3

See here for the current full evaluator guidelines, including further explanation of the requested ratings.4

Evaluation summaries

Anonymous evaluator 1

The present paper reports on nudges failing to improve utilitarian welfare because they increase higher-order choice distortions. While the paper is highly complex in practically all regards, their general idea is important for public policy. The welfare losses implied by nudges may not be large, and real-life policymaking emphatically does not proceed by maximizing utilitarian welfare, but this paper could represent a step forward in how we think about policy incidence over the whole distribution of behaviors.

Anonymous evaluator 2

This is a very strong and interesting paper. More consideration of the welfare impacts of nudges is to be welcomed, and there is clearly a gap in the market for considering heterogeneous effects. However, there are concerns about whether or not the experimental method is really designed to answer questions of this type.



See here for details on the categories below, and the guidance given to evaluators.

Evaluator 1


Evaluator 2


Rating category

Rating (0-100)

90% CI



Rating (0-100)

90% CI


Overall assessment5


(70, 95)



(76, 96)

Advancing knowledge and practice7


(10, 20)



(45, 55)

Methods: Justification, reasonableness, validity, robustness9


(50, 90)



(66, 84)

Logic & communication11


(75, 90)



(66, 74)

Open, collaborative, replicable13


(90, 100)



(93, 100)

Real-world relevance 15


(10, 20)



(17, 32)

Relevance to global priorities17


(10, 20)



(1, 20)

Journal ranking tiers

See here for more details on these tiers.

Evaluator 1


Evaluator 2



Ranking tier (0-5)

90% CI

Ranking tier (0-5)

90% CI

On a ‘scale of journals’, what ‘quality of journal’ should this be published in?


(3.5, 4.2)


(3.3, 4.1)

What ‘quality journal’ do you expect this work will be published in?


(4.0, 5.0)


(4.0, 4.9)

See here for more details on these tiers.

We summarize these as:

  • 0.0: Marginally respectable/Little to no value

  • 1.0: OK/Somewhat valuable

  • 2.0: Marginal B-journal/Decent field journal

  • 3.0: Top B-journal/Strong field journal

  • 4.0: Marginal A-Journal/Top field journal

  • 5.0: A-journal/Top journal

Evaluation manager’s discussion

The two evaluations underscore the paper’s important theoretical contributions, formally highlighting and experimentally underlining the role that (nudge) treatment effect heterogeneity plays, in contrast to traditional welfare analysis focusing on changes in average outcomes. Even if nudges cause average behavior to move in the desired welfare-improving direction, they might lead to greater variance in distortions across individuals, which could lead to net welfare losses. In addition to the impacts through the ‘variance’ channel, the authors’ analysis highlights the importance of (i) the supply side with market power, where firms adjust price in response to changes in demand, and of (ii) considering the presence of distortion-correcting taxation. Both of these tend to reduce the average effect of nudges in equilibrium (considering the tax and pricing responses) so that the nudge’s effects on variance more easily dominate. 

The evaluators agree that the paper provides a novel, valuable contribution to the understanding of how nudges impact welfare in equilibrium. They acknowledge the paper’s substantive theoretical modeling and its tentative empirical validation. Nonetheless, they highlight several limitations.

Evaluator 1 found the presentation difficult and suggested more directness, clarity, and simpler terminology.

Both evaluators raised concerns about the experimental methods. At least one of the experiments (fuel efficiency) involved a rather complex choice task in an unfamiliar, hypothetical situation (the value of a car if gasoline did not matter). In one experiment, the incentives seemed rather weak (buying a beverage to be delivered with ⅕ probability). 

‘Research degrees of freedom’ was also highlighted, with one evaluator noting some author discretion in dropping “outlying” observations. This raised questions about the reliability of the experiments’ main results. This may be of concern as some of their estimates finding a welfare loss (see their Table 2) appeared statistically and economically insignificant (or marginally significant). 

Another lingering concern (largely from one of our evaluation managers): “How relevant is this to typical domains of interest?” Is it plausible in practice that the highlighted (negative) effect of increasing variance could dominate the desired effect of the nudge on average behavior? In many cases, nudges may be used where taxation is impractical or politically infeasible. For a case with little or no distortion-correcting taxation, for welfare to be reduced in net, a high enough share of people have to overreact in the nudge’s desired direction. In addition to the cases used in the experiment–car fuel efficiency and sugary drinks–the authors cite nudge policies aiming at increasing retirement savings, healthy eating, exercise, environmental conservation, and organ donation. Is it plausible that nudges would compel large shares of the population to save for their retirement, to eat healthily, to exercise, to protect the environment, or to donate organs too much relative to their own (or social) interest? For some of these domains it seems unlikely that many people will be ‘overnudged’ so a focus on the average effects of nudging may be reasonable.

Unjournal process

Note: Paper updates, versions considered

Each evaluator considered the most recent NBER or non-NBER version of the paper available at the time of their evaluation: Evaluator 1 considered the 20 Sep 2023 version of the working paper, and Evaluator 2 considered the April 2024 version of the NBER Working Paper 30740 When do "Nudges" Increase Welfare?.

In May 2024, a newer version was published on NBER that addressed at least one concern raised in earlier evaluations—the use of the term 'non-standard policy instrument' (NPI)—as noted by the authors.

Why we chose this paper

We chose this paper because nudges are widely used, including in impactful areas like global health and managing pandemics (see e.g., this Guardian article), and understanding their effects on welfare is important. The paper's findings may affect how policies are considered in the context of the emergence of public sector ‘nudge units’ as well as animportant pushback against the ‘nudge agenda’. The importance and potential impact of the paper suggested that a thorough evaluation could provide valuable insights into both the theoretical framework and the credibility of the experiments as a source of evidence.

How we chose the evaluators

We looked for evaluators with:

  1. an applied understanding of public finance/public econ. theory, considering normative issues

  2. experience considering the value of framed choice experiments and elicitations like this, and how generalizable they are to real-world markets of interest.

Evaluation process: Suggestions

We suggested to evaluators that they might consider some of the issues below

  1. Is a model in which government adjusts taxes/subsidies optimally a relevant one for this discussion? Do their arguments meaningfully carry over to a more general environment?

    • We had a question regarding the relevance of a model where there is zero (or low) pass-through. And to put it very explicitly: Don’t we think of nudges usually exactly in places where we have reason to believe behavior can have an impact, and where we see no other easy way to steer people in the right direction, e.g., as adequate taxes are politically infeasible?!

  2. “Welfare also depends on whether the NPI reduces the variance of distortions from heterogenous biases and externalities, and the average effect becomes irrelevant with zero pass-through or optimal taxes.”

    • Is this main argument correct, are there flaws in it, is it robust to relevant small modifications in the theoretical model?

  3. “We apply our framework to randomized experiments evaluating automotive fuel economy labels and sugary drink health labels ...” These experiments were run on a particular pool of participants, asking questions about particular contexts.

    • Are the inferences made from these participants in this context meaningful? Does the experiment itself add any ‘value of information’? I.e., do we know more about something important in the real world, adjusting or concentrating our beliefs, after having seen the results of the experiment

    • -(If so) do the results in this context meaningfully generalize in ways that will help inform (globally) important real-world policies?  

  4. There were some real incentives to choose according to one’s true preferences (in the drinks case?) or to get state WTP ‘correctly’ as imputed from their earlier survey responses (in the autos case).  

    • Were these incentives strong, meaningful, and correctly aligned?

  5. Statistical modeling of the results, statistical inferences, structure of any structural model used to derive welfare implications; consistency with pre-registration/pre-analysis plan

Author engagement

We reached out to the authors on Feb. 2, 2024 to let them know we were evaluating their paper, asking them about updated versions, or other concerns. They did not respond at this time. We notified them again on May 3, sharing the evaluations. They responded on May 9, noting that they did not have the bandwidth to provide a detailed response. They let us know about a version that had been released on NBER in the last few days, noting that it clarified some of the issues the evaluators had raised.

