Skip to main content

Evaluation 2 of "When do "Nudges" Increase Welfare?"

Evaluation of "When do "Nudges" Increase Welfare?" for The Unjournal. Evaluator: Anonymous

Published onMay 15, 2024
Evaluation 2 of "When do "Nudges" Increase Welfare?"
key-enterThis Pub is a Supplement to


This is a very strong and interesting paper. More consideration of the welfare impacts of nudges is to be welcomed, and there is clearly a gap in the market for considering heterogeneous effects. However, there are concerns about whether or not the experimental method is really designed to answer questions of this type.

Summary Measures

We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1


90% Credible Interval

Overall assessment


76 - 96

Journal rank tier, normative rating


3.3 - 4.1

Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.

See here for the full evaluator guidelines, including further explanation of the requested ratings.

Written report

As articulated by the authors of this paper, there has been a dramatic rise in the use of behavioural insights based interventions, or nudges, in the last decade or more, starting with governmental efforts in the United Kingdom. Understanding the welfare effects of these ‘nudges’ is valuable, insofar as it allows us to better understand the impacts of the behavioural science movement more generally.

In general, I think the authors have set themselves a challenging task, and have approached it in a defensible and sensible way. I do not feel particularly well qualified to comment on the theoretical section of the work, but there is nothing in this that seems prima facie problematic.

My comments on the remainder of the paper come under a small number of headings. Pedantry; Improvements on previous work; Samples and Populations; Experiment Design and implications; Public Policy.


The term ‘non-standard policy instruments (NPIs)’ is not one that I’m familiar with. It invokes Dellavigna (2009)’s [1] model of non-standard preferences, beliefs and decision-making. I am not a fan. There are lots of interventions (both behavioural and otherwise) that are non-standard instruments in policy terms but are not nudges. This feels like overclaiming, although I understand that the word nudge seems silly. I would like a clearer understanding of the scope of this project.

Improvements on previous work

Recent meta-analyses of the use of nudge, and considerations of the cost effectiveness of nudges, have been conducted in a fairly blunt way, considering, for example, whether nudges have, on average, positive effects on their outcomes of interest. This is useful insofar as it tells us whether, on average, behavioural interventions are making a positive change, but it tells us little about the effectiveness of individual nudges, and cannot be useful for forecasting. This is therefore of fairly limited value. Assessments of cost effectiveness of nudges similarly make use of fairly blunt tools in this consideration. Implicit in this paper’s consideration is that government interventions rarely care about average effects on the population, but on particular groups. In this sense, the effort to consider non-mean effects on welfare is welcome.

There is also considerable attraction to a paper that seeks to consider a small number (2) of examples, rather than trying to analytically ‘boil the ocean’.

Samples and Populations

The paper’s findings are reasonable on their own terms, and recognise the value of looking at heterogeneity of effects. From the perspective of a policymaker, it would be useful to consider the objective function of policymakers in thinking about welfare improvements. For example, policymakers might value welfare benefits for poor or disadvantaged groups much more highly than richer people. Considering their welfare effects is therefore paramount, and then these need weighting.

However, for many experiments, particularly field experiments where budgets are constrained, the sample has already been narrowed down to a group where the main question of interest is the average effect of the intervention. For example, Burland et al (2023)[2], narrow down their study sample to those participants who would most be of interest. Against this backdrop, there needs to be decision-rules for when to consider heterogeneous welfare impacts and when not to.

Experiment Design and analysis

The experiments are well designed to answer their research questions, and per the references in the paper itself, are mostly conducted in ways that are similar to other published work. There is hence relatively little value in nitpicking elements of the study design.

As such, I focus my review on those areas of the experimental design and analysis that are particular to this paper, of which emerge as a consequence of the intention of this paper in contrast to other papers with similar experimental designs.

My first concern is about multiple comparisons, in particular when thinking about heterogeneous treatment effects. Looking for heterogeneous treatment effects involves in this case a large number of potential comparisons compared to looking at mean effects. This seems problematic, as given enough tests, some group for whom a positive welfare effect can be found is likely to exist, but the effort of finding this both reduces our confidence in these effects and in all other tests. A rigorous approach to identifying heterogeneous effects will require either a gateway (whereby we look for heterogeneity in the face of a mean effect), which seems problematic given the paper’s arguments about the flaws in mean effects for assessing welfare; or much larger sample sizes married to pre-registration of a priori hypotheses about differential impacts.

My second concern is about what is being measured in the car experiment. Participants’ incentives are tied to their ability to approximate the WTP of participants in another, hypothetical survey. As a result, this is in effect a beauty contest in which participants are attempting to approximate the preferences (biases et al), of other participants, rather than necessarily expressing their own preferences.

Third, while the drink experiment is in theory incentive compatible, the stakes are low - with a ⅕ chance of receiving a drink and paying for it. Although for these experiments this is a reasonable method to elicit preferences while minimising burden, low stakes is identified as an identification problem for experiments like these and so the welfare implications are likely to be miscalculated. This matters particularly when we are considering heterogeneous effects, as it is another source of noise.

Public Policy

The authors propose a novel way of estimating welfare benefits for Nudges or non-standard policy instruments, which evaluates heterogeneous impacts for different groups of participants based on their characteristics, or where on the distribution of prior beliefs they sit. This is a valuable contribution to the way in which we estimate welfare benefits of these kinds of interventions.

However, we should be cautious about holding low-cost, modest-impact interventions like Nudges to a higher standard than that to which we hold other, more ‘standard’ policy interventions. As with other recent debates, like Chater and Loewenstein (2023)[3], the burden for success of nudge interventions, in terms of their impacts, their legitimacy, and their welfare implications, is placed far higher than for e.g. taxes of regulation.


[1]DellaVigna, S. (2009). Psychology and economics: Evidence from the field. Journal of Economic literature, 47(2), 315-372.

[2]Burland, E., Dynarski, S., Michelmore, K., Owen, S., & Raghuraman, S. (2023). The power of certainty: Experimental evidence on the effective design of free tuition programs. American Economic Review: Insights, 5(3), 293-310.

[3]Chater, N., & Loewenstein, G. (2023). The i-frame and the s-frame: How focusing on individual-level solutions has led behavioral public policy astray. Behavioral and Brain Sciences, 46, e147.

Evaluator details

  1. How long have you been in this field?

    • 15 Years

  2. How many proposals and papers have you evaluated?

    • ~200

No comments here
Why not start the discussion?