Skip to main content

Evaluation 1 of The Long-Run Effects of Psychotherapy on Depression, Beliefs, and Economic Outcomes

Evaluation of The Long-Run Effects of Psychotherapy on Depression, Beliefs, and Economic Outcomes

Published onJun 28, 2024
Evaluation 1 of The Long-Run Effects of Psychotherapy on Depression, Beliefs, and Economic Outcomes
1 of 2
key-enterThis Pub is a Supplement to


This paper’s most important contribution is the evidence for sustained positive impacts on mental health after 5 years (one of the longest follow-ups in the field), after short, lower-cost (community based) therapy. As critical comments, I don't believe the 'depressive realism' task was well selected, I see opportunities to better contextualize the effects for the reader as well as opportunities to improve figures and make stronger policy recommendations.

Summary Measures

We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1


90% Credible Interval

Overall assessment


70 - 84

Journal rank tier, normative rating


3.5 - 4.5

Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.

See here for the full evaluator guidelines, including further explanation of the requested ratings.

Written report

  1. Main contributions

This work has many strengths, including:

  • Evidence for positive impacts on mental health following therapy 4-5 years later, one of the longest follow-ups in the field

  • Consideration of wider outcomes, including nulls for economic outcomes - sheds important light on limits of what therapy cannot do

  • Very valuable to see low cost and relatively short duration therapeutic approaches be effective in the long run - the fact that these interventions were community/ peer-delivered is great both in terms of cost savings but also in terms of community empowerment

  • The collection of experts opinions is not often seen and I think nicely complements the surprising, positive nature of results

The two main policy-relevant findings are (1) that therapy has sustained positive effects in the long run, thus highlighting therapy as relevant for future global health prioritization, and (2) that therapy can be delivered in lower-cost and shorter durations that are impactful. There is existing evidence on the second point, so this triangulation is welcome to see in shaping a robust body of evidence. On point one, future replication will be needed to better understand these effects, but this is a promising result in an overall well done study.

  1. Critical points

  • One of the main results is that, after pooling the 2 trials, the treatment groups had 0.15 SDs lower PHQ-9 scores relative to the control, at a p = 0.08. This is the result I’m least confident in and most interested in seeing replicated. While p <.1 is a convention for significance in economics, it is not in many medical and epidemiological fields, and I too prefer a more conservative cutoff <.05, especially for novel or surprising effects. Importantly, recent large-scale robustness replications of work published in top journals suggests effects at the p >.05 but <.1 level were least likely to replicate and effect sizes would frequently diminish very notable (approximately halve I believe; see Brodeur et al. 2024)[1]

  • Decomposition of PHQ: good to see this. Sometimes in practice specific symptoms are measures with singular or very few items, making these less robust measures overall. It’s hard to get a sense of how robust each measure was from the current write-up.

  • Measure of mental health: would have been better in my opinion to survey a wider battery of mental health measures, rather than just depression, especially since this is the focal point of the paper. In some ways I think the language of the paper may overstate benefits and under-specify their domain, we’re really looking at depression and most likely targeting behavior and mood here. I appreciate the difficulty of administering many measures in LMICs especially, but there are now good and validated and even short measures that can further capture broader mental health measures. (e.g. GAD-2 or 7).

  • Would be nice to see Cronbach’s alphas or similar for PHQ-9.

  • Very brief mention on missing data - could this be reported more clearly, how much data was missing, was it at random, which measures were affected, was any of it at high degrees? Something like Little’s MCAR could be a quick start here.

  • With some uncertainty, but it seems to be that baseline mental health might have mattered, considering HAP adults had moderately severe depression whereas the THPP participants had ‘lower (moderate)’ depression. This might be interesting to unpack, as recent interventions have found no heterogeneity based on baseline symptomatology (Barker et al. 2022)[2], but other meta-analytic work has suggested that interventions seem to be more effective when baseline problems (Bower et al., 2013)[3]

  • It was very interesting to read the discussion around the THPP and lack of stronger effects, particularly around high rates of spontaneous recovery. I wonder if more specific measures for postpartum depression at pre- and post- would help in the future.

  • On the note of population characteristics, I imagine the authors would be underpowered to test for gender differences given women are more widely surveyed across two interventions. This should be a target for future work, as well as capturing a more general population. As above, pregnancy is associated with particular patterns of mental health risk and recovery that make generalization difficult.

  • I found it difficult to get a sense of population characteristics beyond what is in the tables, which was brief - e.g. a sense of income?

  • Overall, I think the findings need to be better contextualized. The authors define them as ‘substantial’ but what does that really mean? A few ideas to make this clearer to readers could be:

    • The classic epidemiological argument that at population level this matters a lot (Rose 2001)[4]

    • Reference and directly compare effects targeted by other comparable interventions, perhaps such as behavioral change - e.g. around substance misuse, sexual risk, bullying etc., preferably in the long run; and then also around other primarily general health focused interventions such as e.g. obesity, ADHD. There are many systematic reviews on these topics and my reading is that mental health interventions are as tractable and comparable/ not too far behind to general physical health interventions. This might not be clear to a reader or someone in policy, the latter of whom might be facing difficult decisions on triage and funding prioritization.

  • Further on the population-level point: I would have loved to have seen a stronger conclusion based on the evidence on how we should address the inverse care problem. Therapy may seem as an individual-level approach to health, but when scaled to a governmental health system, it will be more impactful, and that’s where effort should go. See further Chater & Loewenstein 2023[5] on why individual-level framed interventions and recommendations misdirect funding.

  • More detail on why pooling the two trials is appropriate and what limitations that may have would be welcome for readers without a clinical background. The shared treatment approach allows this, but the very different populations examined jointly could be a difficulty for precision in interpretation and understanding mechanisms. For instance, the authors’ broad level interpretation is that HAP drives the overall effects but certainly someone could say it’s population dependent, e.g. not being pregnant that contributes or matters more than the intervention.

  • A little bit more on why behavioral activation matters would be good. For instance, it is not clear from the current writeup why treatment primarily targeting behavior and then affecting mood would be expected to change different types of cognitive outcomes, including some of the examined ones.

  • If tasks were financially incentivized, was it clear (any formal test?) which was linked to receiving or not the financial payment? From past experience in global health, instructions around payment in the context of experiments can be very hard to understand for participants.

  • A few points on the bead bracelet making task. Overall, I don’t think this is a suitable task for testing for depressive realism, though it is an interesting way of showcasing the Dunning-Kruger effect. These points are relatively more minor, as I don’t see this as the main focus of the paper, and indeed the authors do not frame it as such. This is also a controversial literature, where many of us have different views, though there is a building consensus from reviews that if there is such an effect, it is likely more constrained to certain circumstances and domains rather than universal.

    • Deficits in self-perception, negative views of self, rumination are fairly established for depressed individuals.

    • Tasks should not be self-relevant if we want to isolate the effects of such deficits from perception about others or the environment.

    • Tasks where there is no clear standard for reality, such as relative positioning to others, are a weak benchmark (Moore & Fresco, 2012[6] - see this review on this point and further critical points)

    • If participants were informed they could get a job later reporting overconfidence is not illogical - how was this accounted for?


[1]Brodeur, Abel et al. (2024) : Mass Reproducibility and Replicability: A New Hope, I4R Discussion Paper Series, No. 107, Institute for Replication (I4R), s.l.

[2]Barker, N., Bryan, G., Karlan, D., Ofori-Atta, A., & Udry, C. (2022). Cognitive Behavioral therapy among Ghana's rural poor is effective regardless of baseline mental distress. American Economic Review: Insights, 4(4), 527-545.

[3]Bower, P., Kontopantelis, E., Sutton, A., Kendrick, T., Richards, D. A., Gilbody, S., ... & Liu, E. T. H. (2013). Influence of initial severity of depression on effectiveness of low intensity interventions: meta-analysis of individual patient data. Bmj, 346.

[4]Rose, G. (2001). Sick individuals and sick populations. International journal of epidemiology, 30(3), 427-432.

[5]Chater, N., & Loewenstein, G. (2023). The i-frame and the s-frame: How focusing on individual-level solutions has led behavioral public policy astray. Behavioral and Brain Sciences, 46, e147.

[6]Moore, M. T., & Fresco, D. M. (2012). Depressive realism: A meta-analytic review. Clinical psychology review, 32(6), 496-509.

Evaluator details

  1. How long have you been in this field?

    • ~5 in global mental health specifically

  2. How many proposals and papers have you evaluated?

    • ~50

1 of 2
No comments here
Why not start the discussion?