Skip to main content

Evaluation 1 of "When do "Nudges" Increase Welfare?"

Evaluation of "When do "Nudges" Increase Welfare?" for The Unjournal. Evaluator: Anonymous

Published onMay 15, 2024
Evaluation 1 of "When do "Nudges" Increase Welfare?"
key-enterThis Pub is a Supplement to


The present paper reports on nudges failing to improve utilitarian welfare because they increase higher-order choice distortions. While the paper is highly complex in practically all regards, their general idea is important for public policy. The welfare losses implied by nudges may not be large, and real-life policymaking emphatically does not proceed by maximizing utilitarian welfare, but this paper could represent a step forward in how we think about policy incidence over the whole distribution of behaviors.

Summary Measures

We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1


90% Credible Interval

Overall assessment


70 - 95

Journal rank tier, normative rating


3.5 - 4.2

Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.

See here for the full evaluator guidelines, including further explanation of the requested ratings.

Written report

This is a review of ‘When Do “Nudges” Increase Welfare?’ by Allcott et al. (September 20, 2023 version). I review this version over the NBER version because it is more recent.


The authors show theoretically how nudges change prices in equilibrium and how their economic effects differ from those of taxes. Even when nudges change average behavior in beneficial ways, they may increase the variance of distortions, decreasing welfare. They use two artefactual incentivized experiments to demonstrate these theoretical findings.


  1. This paper appears to be a very substantial contribution to the study of nudges and similar behavioral interventions. While the theoretical analysis hinges on the assumption that taxes are optimal, there is of course a large literature on the heterogeneous effects of policy. To pick a random example, Manning, Blumberg, and Moulton (1995)[1] show how different utilizers of alcohol react differently to increases in alcohol taxes. That policy has heterogeneous effects is by itself not a novel insight, but it is nonetheless important.

  2. The most significant finding in this paper is in section 1.4, equation 6. That nudges can influence prices in equilibrium is a result of high practical relevance. Previous work may have implicitly assumed nudges to be “too small.” Section 1.4 would deserve a more central place in the paper. Policymakers simply cannot know (Hayek 1945)[2] what the boundaries of interventions are, even when their effects are thought to be local.

  3. The presentation of the paper is slightly difficult. In the whole paper, highly technical and complicated language is used. But simple is better than complex, and complex is better than complicated. An example is “The remaining bars present conditional ATEs within the sample of participant-by-product pair observations with below-median or above-median bias or externality estimates” (p. 22). In other cases, the paper uses unknown terms such as “highly statistically insignificant” (sic, emphasis mine, p. 22). The introduction is already quite technical; section 1, in my opinion, is appropriate. Section 2 is almost unintelligible. This is because the experimental design is never clearly stated. Even though the “cars” and “drinks” experiments share many features (as hinted at in section 2 after the heading), readers require clarity to understand the multiple elicitations and the exact procedure that subjects underwent. The understanding of the “cars” experiments is complicated by the unceasing use of brand and product names that are not necessary for understanding. Section 3 is highly technical; many econometric aspects could be moved to the appendix.

  4. Once the experiments are understood, they certainly make sense. However, I do have issues with how subjects would perceive certain aspects of them. Most striking is the WTP elicitation for car leases if gas is free. This seems difficult to make sense of intuitively. Because this setup is so unusual, the experiment will inevitably elicit numeracy, and any elicitation also be affected by cognitive uncertainty (Enke and Graeber 2019)[3]. Bias is only identified under the strong assumption that such considerations are not relevant. While recent work has demonstrated that demand effects may not matter much or at all in experimental designs (De Quidt, Haushofer, and Roth 2018)[4], they may still be significant here due to the multiple elicitation.

  5. In general, I have to praise the transparency the researchers use when making their assumptions. Even though sections 3 and 4 are highly technical—and contain some of the densest writing I have ever encountered—I believe the authors take assumptions and models seriously. On the one hand, the models described in section 3 are clearly in reduced form. On the other hand, the authors make substantial assumptions to identify otherwise structural parameters such as Var[τ]\text{Var}\lbrack\tau\rbrack. This positions their econometric model somewhere between “reduced form” and “structural” approaches; it also means that this paper uses a “model” approach, where many objects of interest are not directly observable through the “design,” but only implicitly contained.

  6. It should be clarified that some of the effects seem quite small, especially for cars. The abstract already says that “labels […] may decrease welfare.” The authors effectively use figures to communicate their ideas, but they should seek to find plausible numerical configurations that allow them to make clearer statements about the effects of nudges in their examples. It is important to note that this paper represents a “proof of concept”—and that is a true strength of this work. It is still important to provide readers with tangible, real-life ideas about the effect of some sample interventions.

  7. In general, I do not believe it to be appropriate to drop observations without preregistration. (At the time of review, a full and transparent preregistration was not publicly available.) This point becomes even clearer when the authors write that “excluding these outlying responses is conservative” (p. 15). I obviously understand the wish for precision in estimation, but if it doesn’t matter, then don’t do it. The preregistration should always be followed. If there is no preregistration, then no data should be dropped.

  8. In general, if a difference is “not statistically significant,” then no conclusion should be drawn as if it were. The manuscript (p. 22) contains several instances of suggestive phrases that frame the experimental results in terms favorable to the authors’ interpretation. That is not permissible. However, I am not sure whether section 3.2.2 is necessary. Its many caveats paired with overconfident statements certainly make for a confusing read. The entire paper could benefit from clarifications and simplifications.

Global priorities

How does this paper help us deal with global priorities, such as statecraft, governance and democracy? It certainly highlights the challenges of policymaking. While the paper itself represents a “proof of concept” that nudges can diminish welfare through unexpected channels, we have to ask ourselves what this implies. On the one hand, optimal policies are shown to be difficult (and the paper’s application of optimal tax theory should not be taken as a prediction that taxes always work optimally).

However, it was recognized early by the public finance literature that policymakers can insert their own views into policymaking, even if such insertion is justified on the presumption of asymmetric information (e.g., the case of “merit goods:” Musgrave 1959, 1956)[5][6]. Furthermore, Peltzman (1976)[7] considered a situation in which policymakers work to improve their own lot. Recent research by Ambuehl, Bernheim, and Ockenfels (2021)[8] extends to paternalism, with paternalists imposing their own preference. To that extent, the present paper does not necessarily provide a description of policymaking processes as they currently exist; and because the authors (understandably) miss the true motivations for governance and intervention, this paper is unlikely to improve what is globally important. Their demonstration is still significant, as it shows how optimality is easily missed even with well-intentioned, utilitarian policymakers.

Open science and replicability

The authors provide a full replication package. Furthermore, their experimental design—once fully understood—is easy to replicate materially. The authors are transparent about their assumptions vis-á-vis econometrics (despite the substantial complexity of their analyses). In this regard, the authors made laudable efforts.


The present paper reports on nudges failing to improve utilitarian welfare because they increase higher-order choice distortions. This is a highly original contribution. While the paper is highly complex in practically all regards, their general idea is a curious addition to what we know about public policy. The welfare losses implied by nudges may not be large, and real-life policymaking emphatically does not proceed by maximizing utilitarian welfare, but this paper could represent a step forward in how we think about policy incidence over the whole distribution of behaviors.

Further discussion in response to management notes/questions2

Managers’ notes: We normally provide evaluators with a set of ‘bespoke evaluation notes’ in advance. We neglected to share these in advance, but the evaluator graciously responded to these afterwards. We excerpt these below, leaving out those questions which the evaluator already addressed above.

Managers: Is a model in which government adjusts taxes/subsidies optimally a relevant one for this discussion? …

Evaluator: This article is embedded in a recent literature that tries to compare the relative efficacy of taxes vis-á-vis alternative policy measures. One of the earliest works that clearly articulates this line of research is Loewenstein & Chater (2017, reference in original paper), which has been followed up by various theoretical and empirical investigations. To properly compare the performance of alternative policy measures (such as nudges), it is necessary and proper to fix the alternative that nudges are compared against at the optimal level. Of course, taxes may (and do) reduce welfare; they are emphatically not neutral. But this literature seeks to “steelman” taxes and ascertain whether nudges hold up to the theoretical optimum achievable with taxes.

Managers: …Don’t we think of nudges usually exactly in places where we have reason to believe behavior can have an impact, and where we see no other easy way to steer people in the right direction, e.g., as adequate taxes are politically infeasible?

Evaluator: I do not see it that way. Governments everywhere have many tools at their disposal; nudging is a recognition that there may be additional “knobs” available. The selling point of nudges is that they have minimal cost yet still are effective. If high taxes on sugary drinks or cars are politically infeasible, then nudges compare favorably, but that is only one possible alternative. Nudging may simply be used to steer a different kind of consumer. In fact, the recognition of the heterogenous effects of policy tools is one of the major contribution by the authors.

Managers: “We apply our framework to randomized experiments evaluating automotive fuel economy labels and sugary drink health labels ...” These experiments were run on a particular pool of participants, asking questions about particular contexts. Are the inferences made from these participants in this context meaningful?…

(If so) do the results in this context meaningfully generalize in ways that will help inform (globally) important real-world policies? 

Evaluator: See the main review. The experiment is necessarily highly stylized and artefactual. Insofar as the authors attempt to verify theoretical predictions, the experiment is conceptually sound. However, as stated in the main review, the design is quite complex and the presentation could be much improved.

Neither the experiment nor the theoretical exhibition can help with global priorities because governments are by their nature ignorant of welfare; and welfare is indeed not what is optimized.3

Managers: There were some real incentives to choose according to one’s true preferences (in the drinks case?) or to get state WTP ‘correctly’ as imputed from their earlier survey responses (in the autos case).  Were these incentives strong, meaningful, and correctly aligned?

Evaluator: The incentives have a typical size for experiments, but they are not as “high stakes” as real-world decisions. Conceptually, the idea of rewarding participants according to their consistency is sound, but the complexity of the experimental design may inhibit rational responses in the first place. Issues with MPLs [multiple price lists] are well-known. However, such a skew would be systematic and independent of treatment. Sadly, the authors did not present any discussion as to this inherent cognitive load imposed by their design.

Managers: [Please consider issues surrounding ] the statistical modeling of the results, statistical inferences, structure of any structural (?) model used to derive welfare implications; consistency with pre-registration/pre-analysis plan

Evaluator: The most important issue I see is that the authors never give an example for plausible parameter values. To gauge the importance of welfare losses of nudging, we need to know some values. Indeed, the example that they give hints at tiny losses (section 4.4). See the main review.

The statistical model is quite sophisticated and indeed almost of a structural nature. This makes it susceptible to modelling choices and “researcher degrees of freedom,” but we cannot know to what extent. I was not able to find a meaningful preregistration or preanalysis plan. This must count against the authors, as do the significant exclusions that they applied seemingly without reason (see section 2.1, “ Excluding these outlying responses is conservative”). If sample sizes and the exclusions were preregistered (and perhaps even the analyses committed to via a preanalysis plan), I will happily withdraw this point. Nonetheless, all assumptions are transparently made and a replication package is provided.


[1]Manning, Willard G, Linda Blumberg, and Lawrence H Moulton. 1995. “The Demand for Alcohol: The Differential Response to Price.” Journal of Health Economics 14 (2): 123–48.

[2]Hayek, Friedrich August von. 1945. “The Use of Knowledge in Society.” The American Economic Review 35 (4): 519–30.

[3]Enke, Benjamin, and Thomas Graeber. 2019. “Cognitive Uncertainty.” National Bureau of Economic Research.

[4]De Quidt, Jonathan, Johannes Haushofer, and Christopher Roth. 2018. “Measuring and Bounding Experimenter Demand.” American Economic Review 108 (11): 3266–3302.

[5]———. 1959. “The Theory of Public Finance—a Study in Public Economy.”

[6]Musgrave, Richard A. 1956. “A Multiple Theory of Budget Determination.” FinanzArchiv/Public Finance Analysis, no. H. 3: 333–43.

[7]Peltzman, Sam. 1976. “Toward a More General Theory of Regulation.” The Journal of Law and Economics 19 (2): 211–40.

[8]Ambuehl, Sandro, B Douglas Bernheim, and Axel Ockenfels. 2021. “What Motivates Paternalism? An Experimental Study.” American Economic Review 111 (3): 787–830.

Evaluator details

  1. How long have you been in this field?

    • [5-10 years]4

  2. How many proposals and papers have you evaluated?

    • About 8

No comments here
Why not start the discussion?