Description
Evaluation Summary and Metrics: "Replicability & Generalisability: A Guide to CEA discounts" for The Unjournal (Applied stream).
Evaluation of "Replicability & Generalisability: A Guide to CEA discounts" for The Unjournal (Applied Stream).
The proposal makes an important practical contribution to the question of how to evaluate effect size estimates in RCTS. I also think overall the evaluation steps are plausible and well justified and will lead to a big improvement in comparison to using an unadjusted effect size. However, I am unsure whether they will lead to an improvement over simpler adjustment rules (e.g., dividing the effect size by 2) and see serious potential problems when applying this process in practice, especially related to the treatment of uncertainty.
This paper was evaluated as part of our “applied and policy stream”, described here. The ratings should not be directly compared to those in our main academic stream.
We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1
Rating | 90% Credible Interval | |
Overall assessment | 40/100 | 10 - 60 |
Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to applied and policy research you have read aiming at a similar audience, and with similar goals.” We requested they “consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.”
First, I want to say that I think the proposal is a good starting point on the important topic of how to shrink study estimates in practice and I appreciate the opportunity to review this important work. I think this topic is difficult to tackle in a numerically quantifiable way and the author made a very important contribution by starting this process. Further, I think all points outlined below can be addressed and I would be happy to see how this guideline develops further.
My main concern is that in each step of the evaluation, there is quite a lot of uncertainty about what is the correct estimate to derive for the shrinkage in this step. Therefore, using this multi-step process with very uncertain estimates in each step may actually worsen the estimates in comparison to using a simpler rule (e.g., ½ of the primary study effect size). The reason is that at each step of the process, an additional possibility for error or misspecification is introduced. I created a small simulation to illustrate this process. For example, assuming the distribution of true shrinkage factors is beta(2, 2) and the six-step process is on average unbiased (i.e., the same as the true shrinkage), but in each step error from normal(0, 0.2) is added, I find that using ½ leads to better replicability assessment than following the 6 step proposal here (I am attaching the R script with my report).
Manager note: this script is given at the bottom, and we have also hosted this R script in a Google Collab here
Of course, the simulation depends a lot on the assumption about the distribution of true shrinkage and error and with other assumptions, I would likely find the opposite. I am not proposing that ½ is certainly better. Instead, I am only proposing that it is a serious possibility worthy of consideration that the 6-step process will introduce much more noise to be worthwhile overall. Note also that this could be tested empirically by having Founders Pledge members predict the replicability of already replicated studies using the tool and then comparing their forecasts against ½. The problem discussed here is also amplified by two other major concerns discussed below.
All the adjustments only operate on point estimates, and the uncertainty is largely ignored. I think it would be very valuable to take uncertainty into account in this approach. For example, researchers could specify a range of values for the plausible effect size and the different plausible adjustments and then do the other steps (e.g., power analysis) for this range. Then you could use a probabilistic tool such as Squiggle to get to an estimate for the final shrinkage that takes uncertainty into account. The reason why I am emphasizing this is that I think when selecting between multiple donation or investment options (which I assume is ultimately the goal of the cost-effectiveness assessment), a selection only on point estimates can introduce a variety of serious problems, in particular, a selection in favour of more speculative and less effective interventions (see for example here).
I know that you could argue as a counterpoint that the proposed guidance already shrinks more speculative effects more; however, I think that is the case only for effects that are likely to be null rather than effects that have a wider range of plausible effect sizes, which is what I am referring to here and which is a separate issue. For example, if one effect is likely between 0 and 1 and another likely between 0.4 and 0.6, these two would be treated similarly in the proposed analysis. I would really recommend reading the blog post (if you are not familiar with it already) and thinking about this issue, as this may otherwise introduce big problems in the later point estimate-based selection of effect sizes.
The worked example, which would be an opportunity to give more specific guidance on how to get from the general guidelines to specific estimates for the amount of shrinkage, is very short, and it is often unclear how the numbers are derived. This is especially important as the theoretical steps in the main part are quite abstract and, therefore, I imagine that the worked example will be crucial for other researchers to actually be able to apply this method in practice. Currently, several things are not explained in enough detail in the worked example; I list most of them below, but I would recommend reworking this in general and adding much more detail.
Step 1: Are those 16 studies that you found on exactly the same effect or related effects? If it is exactly the same why not use a meta-analytic estimate across all of them? Also how do you arrive at the 50% shrinkage to those studies? Could there be a problem of an infinite regress, where technically you would need to go through the stepwise process to adjust those studies, which then again requires primary studies to get an initial guess and so on?
Step 2: You refer to Figure 9 to get the adjusted estimate due to the interaction of power and publication bias but it would be good to give more detail on how to read of the values. Further, Figure 9 is between Figure 3 and Figure 4 in the manuscript I would order the Figures so the ascend from 1 to 9.
How did you estimate that social desirability increases the estimate by 10%? Where does this number come from? Is there any work that can be referred to on how much social desirability would usually inflate effect sizes?
Step 5: If it is likely that you did not sufficiently discount the estimated effect size in the baseline estimate then why did you not use a lower baseline estimate to begin with?
Step 6: Needs more detail on how exactly you arrived at the 5% shrinkage.
It would be good to also have an example for a Type-S adjustment. Maybe for a different study if this one is well powered already.
I am wondering whether for the internal validity guidance for researchers step 1 and step 5 essentially constitute a way of updating on the prior twice? The reason is that one first informs once estimate for the power by once prior intuition and then discounts the effect size if the expected effect size (and consequently power) is low and then discounts this already discounted estimate again in step 5. This seems to violate ideas of Bayesian updating, where the same data should not be used twice.
I think 2b(ii) in the internal validity guidance for researchers could be more specific. For many people without extensive statistical training in the social sciences these signs may be hard to spot, so I would add at least a short list of some of the common issues (e.g., hypotheses that seem like researchers only came up with them after seeing the data, extensive subgroup analyses without adjustments for multiple comparisons, p-values just below the .05 threshold)
For 2b(i) I would emphasize to always go back to the preregistration and check that people said what they did. I know it already says “did the researchers stick to this”, but I would really emphasize this, as people deviate from the preregistration without saying anything very often and therefore just noting that there is a preregistration does not really add a lot of robustness without actually checking i.
Further for 2b(i) I would also mention registered reports: if the article is a registered report that would increase my confidence that there is little researcher bias quite a lot, much more than only for a preregistration
For point 3 it took me a moment to realise that those are the example numbers from above rather than some intrinsically meaningful numbers. I think that could be addressed with a simple rewrite.
Calculate the total discount, as 1/ the product of the individual discounts above. So here, this would be 1/ (1.2 * 1.1 * 1.15) = 0.66. This suggests that your total adjustment is 66% x [study estimate of effect size].
Calculate the total discount, as 1/ the product of the individual discounts. Using the example numbers above (20% inflation due to power, 10% inflation due to XXX …), this would result in an adjusted estimate of 1/ (1.2 * 1.1 * 1.15) = 0.66. This suggests that your total adjustment is 66% x [study estimate of effect size].
The Bartos et al[1] paper on comparing meta-analyses is now peer-reviewed. The appendix also includes a matched K-analysis comparing meta-analyses of similar size, which should help with some of the concerns about meta-analysis type. However, I agree that there could be differences between the Cochrane meta-analyses in medicine and psychology and economics that are based on journal articles.
On p.18 “If the ratio of the standard deviation to the mean is >1, this suggests that the data is noisy.” I am not aware of this guidance or where it may come from, it would be good to provide some reference and explanation.
On p.17 “One difficulty of this method is the estimation of an unbiased sample size”
Should this mean unbiased effect size?
For Figure 4 I would increase the Figure labels.
Another source of evidence for Table 1 could be comparing registered reports to unadjusted effect sizes in the same field
Kvarven, A., Strømland, E. & Johannesson, M. (2019)[2]
I think while the internal validity is often mostly dependent on statistical issues that can be identified by following the guidance in the document, external validity assessment would usually require much more domain expertise, so it might be good to recommend consulting with an expert in the domain area of the study for the external validity assessment.
The first time you introduce the power calculation I would probably add a few notes on the problems of post hoc power calculations, just to make completely sure that users are aware of this and are not tempted to use the provided effect size estimate for the power analysis.
Despite this mixed evaluation, I want to emphasize again that I think the proposal was a good and important starting point on a topic that is also very difficult to tackle. I think all of the issues raised here are addressable, and I would be very happy to read a revised version in the future.
#this vector stores the distance between the estimates from the six step process and the true values
dist_process <- c()
#his vector stores the distance between 1/2 and the true values
dist_half <- c()
epsilon <- 0.2 #noise added in each step
n_steps <- 6 #number of steps
for(i in 1:1000){
#sample true value
true <- rbeta(1, 2, 2)
#add noise for each step (this could be rewritten more simply based on sum of normals)
estimate <- true + rnorm(1, 0, epsilon)
for(j in 1:(n_steps-1)){
estimate <- estimate + rnorm(1, 0, epsilon)
}
#add bounds between 0 and 1
if(estimate < 0){
estimate <- 0
}
if(estimate > 1){
estimate <- 1
}
#store the results from this simulation run
dist_process <- c(dist_process, abs(true - estimate))
dist_half <- c(dist_half, abs(true - 0.5))
}
[1]Bartoš, František, Maximilian Maier, Eric-Jan Wagenmakers, Franziska Nippold, Hristos Doucouliagos, John P. A. Ioannidis, Willem M. Otte, et al. 2024. ‘Footprint of Publication Selection Bias on Meta-Analyses in Medicine, Environmental Sciences, Psychology, and Economics’. Research Synthesis Methods 15 (3): 500–511.
[2]Kvarven, Amanda, Eirik Strømland, and Magnus Johannesson. 2020. ‘Comparing Meta-Analyses and Preregistered Multiple-Laboratory Replication Projects’. Nature Human Behaviour 4 (4): 423–34.
How long have you been in this field?
~3 years
How many proposals and papers have you evaluated?
~10
Evaluation of "Replicability & Generalisability: A Guide to CEA discounts" for The Unjournal. Evaluator: Anonymous