Abstract
This research provides valuable empirical insights into population ethics intuitions through well-designed experiments. The methodological progression from simple to complex scenarios demonstrate important patterns in moral reasoning. While the utilitarian framework dramatically overlooks crucial dimensions of human flourishing, reducing the validity of these studies, the data show how ordinary people resist purely arithmetic approaches to population welfare.
Summary Measures
We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.
| Rating | 90% Credible Interval |
Overall assessment | 65/100 | 60 - 80 |
Journal rank tier, normative rating | 4.1/5 | 3.4 - 4.4 |
Overall assessment (See footnote)
Journal rank tier, normative rating (0-5): On a ‘scale of journals’, what ‘quality of journal’ should this be published in? Note: 0= lowest/none, 5= highest/best.
Claim identification and assessment
I. Identify the most important and impactful factual claim this research makes
“Next, a one-sample t-test against the midpoint 4 revealed that participants on average judged it as an improvement to add one neutral person into the world, (M = 4.23, SD = 0.67), t(156) = 4.40, p < .001, d = 0.35. This suggests the existence of a weak general preference to create a new person, even if their happiness level is neutral.” (p. 9).
My report states why I believe this claim is important. It could induce effective altruists to take increasing the birth rate more seriously.
II. To what extent do you *believe* the claim you stated above?
Credible Interval: [0.8, 1]
III. Suggested robustness checks
Study 2a should be globally replicated; whether a new neutral person is seen as beneficial to this world should be correlated with life conditions in each country that is studied.
Written report
This research evaluates how ordinary people weigh human happiness on a population level. The authors conduct experiments to understand how various factors shape such preferences.
Important assumptions made by this research
This research is able to inform us about how people think about trade-offs between happiness and population size under the critical assumption that their choices within the experimentally provided settings are meaningful.
Experiments and their analysis are constructed by researchers to study a particular aspect of reality. In order to do so, a number of important choices have to be made. The problem is that simple results can often be found with fewer assumptions. Less simple results, on the other hand, often require more “structure,” which make the experiment more arbitrary and complex. For example, an experiment to see whether people believe that more good is required to cancel out bad (both within a population and within a lifetime, study 1a) requires fewer instructions that an experiment that carefully attempts to account for the intensity of good and bad (study 1b and others). The introduction of the happiness scale pins down subjects’ preferences under the assumption that the scale is accepted by subjects. However, in study 1b, we do not know whether subjects actually accept the scale. The problem is that behavior that does not fit neatly within the utilitarian framework is nonetheless identified and analyzed as meaningful. This problem is not present in later studies, where subjects had to accept such underlying assumptions before proceeding. But such an acceptance requirement is a double-edged sword. On the one hand, it ensures some basic validity of responses. On the other hand, one selects away those respondents who are not ready to accept researcher suppositions, perhaps with a good reason for not doing so.
Secondly, this whole research rests on the implicit assumption that choices between populations are somehow legitimate (see also the next section in this evaluation). The experimental subjects did not have any such experience with choosing between populations, nor did these experiments in any way contain a variable payment for choosing correctly. What then, do these experiments measure? They might measure intuitions, but they could also measure social norms, or what respondents deemed appropriate. It is important to note that this issue exists in all treatments and hence does not affect the identification of differences between treatments. But it does affect baseline values, which are heavily discussed throughout the whole paper.
Choices between populations
One does not hope that anyone should ever have to, let alone wish to choose between populations. As Jean-Luc Picard once so eloquently stated, “I refuse to let arithmetic decide questions like that.” The choices presented in the experiments are—thankfully—hypothetical and lack any kind of realism.
Nonetheless, while the paper discusses in great length the varieties of utilitarian thought on this issue, the fact that utilitarianism in any variety frames such questions as able to be asked and able to be answered is itself a severe indictment of it. Perhaps most fundamentally, such considerations can only be considered in the abstract. To what extent such judgments, foreign from any relevant experience, are “axiological” and not just arbitrary, is open to debate.
This points to a deeper problem with the utilitarian framework employed and accepted by this paper. It assumes that human flourishing can be reduced to a single metric that can be meaningfully aggregated, compared, and traded off across populations. This reduction obliterates the rich complexity of what makes a life worth living. The experiments force participants into a framework where qualities like courage, wisdom, justice, and temperance (what non-utilitarian philosophers understood as virtues) simply disappear from view.
Even if one were to accept utilitarianism as true, there is a deeper challenge to the entire enterprise of measuring and comparing happiness across populations: the well-documented phenomenon of hedonic adaptation. Humans remarkably adjust to both positive and negative changes in their circumstances, with lottery winners returning to baseline happiness and accident victims often reporting life satisfaction levels similar to their pre-accident state. This psychological reality undermines the very foundation of these experiments, which ask participants to imagine and evaluate fixed levels of happiness as if these were stable, measurable states rather than fleeting experiences to which people inevitably adapt. The researchers’ happiness scale implies a kind of hedonic arithmetic that simply doesn’t correspond to lived human experience. If a person at +100 happiness will likely adapt downward and a person at −50 will likely adapt upward, what exactly are we measuring when we ask people to choose between populations with different distributions of these ephemeral states? This is not merely a technical problem but points to a deeper issue: the utilitarian framework’s impoverished understanding of human flourishing, which mistakes temporary hedonic states for the kind of lasting eudaimonia that comes from the cultivation of virtue and the life well lived.
The knowledge problem in population ethics
The research encounters what philosopher and economist F. A. Hayek would identify as a fatal conceit: the assumption that we can possess the kind of knowledge necessary to make these judgments. The dispersed, tacit, and highly contextualized knowledge that individuals possess about their own circumstances, relationships, and conceptions of the good life cannot be captured in a happiness scale, let alone aggregated. For that reason, too, the idea of constructing “symmetricity” in happiness and suffering must be rejected.
Consider Study 3c, where participants are asked to choose between populations of different sizes with varying average happiness levels. The experiment assumes that “happiness” means the same thing to a subsistence farmer in Bangladesh as it does to a software engineer in Silicon Valley, or that these experiences can be meaningfully compared on a single scale. This is not merely a measurement problem. It reflects a fundamental epistemological error about the nature of human experience and the limits of our knowledge. Needless to say, the authors are not to blame for this error. Rather, it is their utilitarian ancestors who are responsible.
Furthermore, the experiments reveal an interesting tension: when prompted to think “reflectively” rather than “intuitively,” participants’ responses shifted toward totalism. But what if this shift represents not greater wisdom but rather the corrosive influence of abstract theorizing divorced from lived experience? The dangers of constructivist rationalism abound: the belief that we can design optimal outcomes through abstract reasoning and deliberation alone.
Implications for governance
Perhaps most concerning are the potential policy implications that might be drawn from this research. The authors note that understanding population ethics has “direct implications for decision-making” and “global priority setting.”
The experiments inadvertently demonstrate why judgments based on abstract aggregate calculations contain utmost danger. In Study 3d, participants exhibited what the authors call “averagist” tendencies even in cases that lead to the “Sadistic Conclusion.” That ordinary people might endorse such conclusions when thinking abstractly should give us pause about using these frameworks to guide real policy.
Nowhere in these 24 pages do we encounter discussion of individual rights, consent, or the relationships that give meaning to human lives. The experimental subjects are asked to evaluate populations as if they were spreadsheet entries rather than communities of people bound together by love, friendship, obligation, and a shared history and future. This reflects the fundamental poverty of the utilitarian approach.
The question “should we create a new happy person?” cannot be meaningfully answered in the abstract: it depends entirely on whether there are people ready and willing to love, nurture, and guide that person toward virtue. The experiments’ finding that people don’t view creating new people as “morally neutral” might reflect this intuition, though the utilitarian framework cannot properly capture it, or reduce it to a single statistic.
The findings of these nine studies are certainly informative, to some extent, about people’s (average, as discussed below) value judgments. I believe that they could be used to inform certain parameters of effective altruism priorities where no strong normative “reference points” exist; similar to how surveys may be used to ascertain acceptable discount rates, the present studies could be used to re-evaluate the relative value of when charitable activities could involve the creation of new human life (study 2a). Studies like this can help reveal truths that are somehow missed by academics, and perhaps effective altruists should spend more time advocating for increasing birth rates. I do not believe, however, that these studies tell us much about current vs. future populations (no discount rates), or that they could inform deeply normative issues (is–ought problem). Only because judgments are common does not make them right; but they may well be more acceptable, and that can truly matter.
Design aspects, construct and construal validity
Within the utilitarian framework, these experiments demonstrate reasonable construct validity. The authors’ scenarios do map onto the philosophical distinctions they aim to test. The distinction between averagism and totalism is cleanly operationalized, the asymmetry between creating happy versus unhappy lives is directly tested, and the researchers carefully control for intensity levels in later studies. If one accepts that population ethics questions are meaningful and that happiness can be meaningfully scaled and aggregated, then these experiments successfully instantiate the relevant philosophical thought experiments. The progression from simple trade-offs to more complex scenarios with explicit intensity controls shows methodological sophistication in isolating the conceptual distinctions that matter to utilitarian theory. To that extent, the experiments are appropriate, minimal and well-designed in general.
The evidence suggests participants struggled not with understanding the scenarios but with their mathematical implications. In Study 3b, fewer than half of participants correctly identified that two populations had equal total happiness, with only 35% in happiness conditions and 26% in unhappiness conditions accurately calculating total welfare levels. This reveals not conceptual confusion but computational difficulty. More fundamentally, as discussed above, the question is whether participants’ responses represent meaningful moral intuitions or merely random behavior when forced into the utilitarian framework. The systematic patterns in the data (consistent preference for averages, asymmetric weighting of suffering) suggest these responses reflect genuine moral commitments, even if those commitments cannot be coherently expressed within received utilitarian arithmetic. The problem is not that people misunderstand what they’re being asked, but that what they’re being asked may be a category error for those who don’t share utilitarian assumptions. However, we cannot know the true extent of this issue.
Aggregation in analysis, heterogeneity
The aggregation of responses across participants reveals a methodological flaw that potentially threatens the paper’s central analyses. When the authors report that “people believe” a certain trade-off ratio between happiness and suffering, etc., they are committing a statistical fallacy. The average of heterogeneous moral positions does not represent any actual person’s view, much less “people’s” views (see Michel Regenwetter’s work on this issue). This becomes particularly problematic when comparing across scenarios, as the mean response may shift not because individuals are changing their judgments (as in a kind of unified movement), but because different scenarios activate different subpopulations of moral reasoners. The paper hints at this heterogeneity (noting that 37% of participants in Study 1c believed only 50-51% happiness was needed to outweigh suffering, while most required substantially more), yet fails to pursue this crucial insight systematically.
It is deeply regrettable that Figures 6 and 9 present only averages rather than distributions, obscuring what one could expect to be meaningfully clustered and rich within-data. The scattered mentions of response patterns suggest the existence of distinct moral types, perhaps consistent utilitarians, threshold deontologists, and various hybrid reasoners each applying different decision rules across scenarios. Without identifying these behavioral types, the paper cannot distinguish whether cross-scenario differences reflect genuine context-sensitivity in moral reasoning or merely compositional effects as different questions activate different moral frameworks in different subsets of participants. The authors’ claim that “people show averagist tendencies” is thus doubly misleading: not only does the average not represent any individual’s consistent view, but it actively conceals the existence of principled disagreement about fundamental moral questions. A proper analysis would have used latent class analysis or similar techniques to identify moral types, then traced how these types respond across scenarios. This could have allowed the authors to reveal the structure of population ethics intuitions rather than broad-based statistics.
Statistics and methods
The paper follows generally good practices when it comes to statistical analyses, though it relies almost exclusively on null‑hypothesis significance testing. p‑values are reported for dozens of individual t‑tests and ANOVAs, yet confidence intervals or other direct characterisations of uncertainty are almost entirely absent. Nonetheless, the authors correct for multiple hypothesis testing using Tukey post-hoc tests where necessary. While I did not check every solitary detail in the authors’ preregistrations (where available), I like their style – they contain essential study information and at least the most important details do not contradict or deviate from what has been reported in the paper.
Some studies include an a priori power analysis. Several others admit that the sample size was set by rough approximations. Because these latter samples lose up to 26% of respondents to exclusions (see below), the detectable effect sizes are unclear. The upshot is that many “no noteworthy differences” may simply be under‑powered comparisons rather than tightly bounded null effects. Moreover, it is regrettable that several studies were not preregistered at all.
High and uneven exclusion rates create a second layer of uncertainty. Study 3c removes 161 of 622 respondents (26%) for failing to accept premises or attention checks. Study 1b drops 75 of 431 (17 %) on similar grounds, and Study 3d discards 67 premise‑rejecters and 68 attention‑check failures (22% total). As noted above, these exclusions systematically favour respondents willing to accept the researchers’ utilitarian framing, a kind of omitted-variable bias. Most critically, premise rejection is treated as participant error rather than as meaningful dissent from the underlying assumptions. This decision filters out precisely those intuitions that conflict most strongly with the experimenters’ worldview.
The current state of the literature does not allow me to address past uses of online convenience samples. I can only note that at the point these experiments where run (around 2019), these pools were considered reliable and acceptable. That such pools were always non-representative is true but fundamentally irrelevant, as these studies could easily be replicated with any kind of representative sample, perhaps even in different countries.
Conclusion
While this research provides interesting data about how people respond to abstract population ethics scenarios, it demonstrates the limitations of the utilitarian framework rather than its strengths. The careful experimental design and the generally appropriate methodological choices cannot overcome the fundamental problem: the illegitimate reduction of human flourishing to a single metric, the knowledge problem inherent in population-level calculations, and the absence of any other consideration of what makes a life worth living.
Most tellingly, the research shows that ordinary people’s intuitions often conflict with utilitarian conclusions. They weigh suffering more than happiness, they care about averages even when the logic suggests they shouldn’t, and they resist certain implications of the framework.
The authors conclude by noting that population ethics intuitions “can be biased and inconsistent.” But perhaps what appears as inconsistency is actually evidence that human moral reasoning cannot and should not be reduced to the kind of arithmetic that both utilitarianism demands. This piper need not be paid.
Evaluator details
What is your research field or area of expertise, as relevant to this research?
How long have you been in your field of expertise?
How many proposals, papers, and projects have you evaluated/reviewed (for journals, grants, or other peer-review)?