Abstract
I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existential risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly eliciting forecasts for events far into the future, and the researchers do their best to meet them, but it might still fall short. The policy implications of the results so far could also be further clarified.
Summary Measures
We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.
| Rating | 90% Credible Interval |
Overall assessment | 75/100 | 60 - 90 |
Journal rank tier, normative rating | 4.0/5 | 3.5 - 5.0 |
Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”
Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.
See here for the full evaluator guidelines, including further explanation of the requested ratings.
Written report [1]
I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existen1al risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly elici1ng forecasts for events far into the future, and the researchers do their best to meet them, but it might stil fall short. The policy implications of the results so far could also be further clarified. The team could do a lot more analysis with the rich data, but I understand that this report was aimed at a general audience.
Methodology
How might we think about the reciprocal scoring results in light of the robustly-documented false consensus effect? It would be good to spell out the theoretical properties of reciprocal scoring and why it’s supposedly superior to other methods in both cases of the forecasters being rational or behavioral agents. For these low-probability events, it has been shown that having participants list the possible ways in which an outcome could be reached or to list reasons why an outcome cannot happen could increase or decrease their forecasted probability of that outcome.
Analysis
Given the target audience, it might make sense that the report did not contain more nonparametric tests and regression analysis. However, the current tables do not show the individual heterogeneity that may be leading to very different forecasts beyond just the domain expert vs. superforecaster distinction or the type of domain expertise. Regression analysis can properly control for other important differences between the forecasters. For example, is previous performance as a superforecaster correlated with forecasts given in the current contest?
Policy
To understand if the differences in the forecasts of domain experts vs. superforecasters are important from a policy implication perspective, it would be interesting to know for what ranges of existential risk levels would policy makers choose the same or different options. While the differences in forecasts are very interesting from an academic perspective, if both sets of risk assessment would still lead to the same policy choices, then perhaps we don’t need to be as concerned from the application perspective.
The team could use simple Natural Language Processing methods to detect and categorize the arguments that the forecasters made in their team discussions. This could perhaps give clues to the types of arguments that were successful in changing a few minds vs. the many that were not. There have been a number of theoretical models about what types of narratives or arguments may be the most persuasive, e.g. the one that fits the past data the best, the one that gives the most optimistic outlook for the future, and so on. Perhaps changing the incentive structure could also lead to more persuasion. What if the one forecast was picked at random to count for the team?