Abstract
Pro:
Raises important points and brings them to wider attention in simple language.
Useful for considering individual RCTs.
Con:
Not clear enough about intended use cases and framing.
Writing should be clearer, shorter, more purposeful.
Guidelines need more clarity and precision before they can be genuinely used.
I think best to reframe this as a research note, rather than a ready-to-use ‘guideline’.
Unclear whether this is applicable to considering multiple studies and doing meta-analysis.
Note: Applied and Policy Stream
This paper was evaluated as part of our “applied and policy stream”, described here. The ratings should not be directly compared to those in our main academic stream.
Summary Measures
We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.
| Rating | 90% Credible Interval |
Overall assessment | 25 | (0, 50) |
Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to applied and policy research you have read aiming at a similar audience, and with similar goals.” We requested they “consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.”
Written report
Summary
Overall I agree on the importance of the points that the author raises. It’s very good to bring them to wider attention. I will mainly focus on the ways to improve this text, but that should not take away from the value already present in this text. For example, I will critique clarity of text in several places, but overall it’s very good. It’s just about making improvements at the margins.
Crucially, it’s hard for me to assess the usefulness of this note, because I do not know what a typical user of this note already knows. However, the note itself does not really define its readership and situations where it should be applied. I think it would be great if the author started the note by stating this very clearly. For example, there are several sentences early in first person singular, but I couldn’t quite parse them. Does the author mean the team at FP, effective altruist decision makers, people working in global health more broadly?
I’d recap the main problems as follows:
To this reader it was not clear who the intended audience is and the scope of situations where this report applies felt poorly defined. I think for the intended end user (and here again I am guessing who that is) these guidelines won’t be clear yet. I think they are more of a list of considerations that should be applied when evaluating individual (more on that below) RCTs.
The document is clearly work in progress. I think more care is needed to prepare figures and tables, as well as drafting the guidelines more precisely before they could genuinely be used. They would also need more (and more comprehensive) case studies. Alternatively, this document could be reframed as a research note or a case study, dropping the “guidelines” framing.
I think various methodological revisions are best done only after there is more clarity on scope and purpose of the doc. However, I have tried to list various suggestions throughout my review.
I think the very last sentence of the report (“I think this work is likely to be most helpful for establishing discounts for ‘risky’ interventions that are comparatively less well-studied”) would work well right at the beginning. It is important to recognise that different situations will require different approaches.
Details
I believe that throughout the document every time effect size is mentioned, it is implied that the author is interested in deriving an average effect size (as opposed to, e.g. calculating a probability of the effect exceeding some threshold or working with the entire predictive distribution). If that is the case, it would be good to state this very clearly right at the beginning. It would be even better to explain why. I assume this implicit assumption will be clear to nearly all readers, but even then it is worth pointing out what is in scope of this report and making people consider the alternatives.
I am unsure about the way that Type S errors are incorporated into the practical instructions. First of all, why would you check for it AFTER assessing adjustments to effect size? I think this would feel somewhat confusing to a hypothetical reader. More generally, I think this type of “type M” and “type S” error dichotomy makes sense in the context of evaluating quality of various domains of scientific literature more broadly (“look, studies make this type of errors this often”), but if your focus is on evaluating a single RCT, then I think it is most natural to frame the internal validity adjustment as a single process.
Relatedly, why say “baseline estimates” and not priors? I am somewhat agnostic about this choice, but just curious? Is it because the intention is to keep this as simple as possible and the worry is that readers will be confused by Bayesian concepts? Or does the author think that it is not a good idea to view the problem through a Bayesian lens in these situations?
I would be more careful with claims about what is underappreciated: I think the problems described in this text have been part of the “zeitgeist” for about a decade now (look at publication years and consider that each of these were a result of at least a few years of work on these topics). I grant that this is a minor comment, but it is part of the more important question of who this report is from and how it frames its message.
Some writing is a bit unclear or redundant:
A good example of repetition is how pages 7-8 repeat much information from the beginning of the doc, which is then explained again on page 9. At that point it feels like internal replicability is being explained for the third time. Then a repeat of what's already been explained (1.5 pages earlier, in fact) "I have split Internal replicability into two components". On the other hand, some material that belongs together is then strewn around: pg 10 and pg 12 have different references to low power in large sets of published studies. I’d scrap nearly all of the intro to avoid the repetition and make the note much shorter. Note that the Gelman and Carlin paper is only about 10 pages, with a couple of case studies.
I believe in several places I saw reference to “a discount”, but the author then referred to the adjusted value (after applying the discount)
I don’t think there is a need to capitalise “Internal replicability” or say things like “increases inflation”. “The ‘winner’s curse’ means that the filter of 0.05 significance level means that the under-powered studies”.
In the part early on which starts “I split Internal replicability”, there is talk about Type M errors and adjustments. Sometimes these terms to be used interchangeably in ways that may not be intuitive to people who have not heard these terms before. I think a very careful explanation of “error” and “adjustment” terminology may be helpful to some non-technical readers.
Some statistical language is a bit imprecise (although this is a very minor point) power as "probability that hypothesis test will successfully identify an effect" or "due to low power, the study does not successfully detect the 0.1 or 0.2 effect sizes". It would be better to say this in the language of hypothesis testing.
I think the main blind spot of this document is what happens when there are multiple studies. There is some discussion of that and a mention of “averaging effects” after adjusting each RCT separately, but I find it a bit confusing. There are whole systematic literature reviews and evidence synthesis/meta-analysis literature journals. I don’t know whether that is in the scope for this report, so I won’t go into that, but I think this either has to be put “out of scope” of this document or completely reworked by referencing the existing approaches better.
I think a proper treatment of this topic would require a whole separate review. At this stage I think it’s best to assume that this is not in the scope of the report. But here are some pointers in case the author decided it should be in scope:
I’d start from reviewing the Cochrane handbook and grading of evidence as it is done in meta-analyses. I think the publication bias assessment methods cited are quite new and I am not sure they are a genuine improvement over what already exists in the literature (this is an expression of ignorance, not skepticism), so it may be good to first stick to standard methods used to assess publication bias. There is also a lot of debate in the literature on whether to downweigh low quality studies or simply exclude them: any potential guidelines would have to address that problem.
Next, the author would have to consider an operational problem: at what stage is it better to outsource a meta-analysis to experts rather than do it “in-house”.
Lastly, when dealing with multiple studies, the division between internal and external validity is no longer clear cut: in the use cases that author has in mind you would typically be averaging on studies where effects can be expected to be heterogeneous. So the relevant adjustments may have to happen within a meta-analysis model. My personal feeling given this very abbreviated “tour” is that in such situations it gets too complex for an “operationalised” approach, but I am perhaps bound to be biased that way.
I think the worked example is great and a must. But if these are really guidelines, then there should be several worked examples, of different levels of “difficulty” (highlighting how the rules can be applied in diverse and ambiguous cases). I think it is imperative to making these widely usable.
Still thinking about the case study, it’s better to calculate power yourself even if a study has a sensible looking pre-registered power calculation. For the example used in the article, power/MDE depend primarily on the number of women randomised into the trial and how many of them use contraception in absence of treatment. Both of these quantities are of course authors’ predictions made before the trial was conducted and may differ a lot from what actually happened.
If researchers are really supposed to use Fig 9, then there should be a table from which they can read the right value.
I do not like that “Hawthorne effects” are assessed twice. I think there should be more consistency.
I don’t understand the role of the sanity checks recommended. I think it’s better to just be familiar with individual studies and studies and assess their merits on a case-by-case basis, or at least this is how I try to do it in my practice. There is, I presume, little value in using some discount across all studies to bring them closer to zero: even if you believe that “most published findings are false”, but your funding process is in fact based on reviewing published findings and a fixed budget, then applying some skepticism correction indiscriminately to all studies will not change your decisions.
The report talks about “attending to” sample differences, but I feel like detail there is lacking.
I think you need a mechanistic model for external validity (how different population factors impact the effects; how implementation factors impact the effects). For example, if you think that future implementations are going to differ in terms of take-up, then you need to postulate a simple model of how you think take-up impacts the effects of the intervention. I don’t think a “blanket” prior on how much variance there is across studies will suffice. In my experience subject-matter experts can provide domain-specific heuristics (“if take-up halved, then I think the effect size would roughly halve, because x, y, and z”) which are sufficient to postulate such a model. A lot of it is also common-sense. On the other hand, heterogeneity in effects in large body of literature is itself heterogeneous. Moreover, these mechanistic models are also amenable to statistical analysis in cases where you can obtain extra data. Or the relevant work has already been done by other authors (to give a made up example, for modeling effects of malaria vaccine roll-out you can find a whole set of epidemiological modeling papers of how benefits scale as a function of coverage).
On that note, I’d like much more detail in section 3b, Step 1. What is the list of factors compared between RCT and target population? Why was the discount 20%?
Evaluator details
How long have you been in this field?
How many proposals and papers have you evaluated?