Abstract
We organized two evaluations of the paper “Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament”[1]. This paper was evaluated as part of our “applied and policy stream”, described here. To read these evaluations, please see the links below.
23 Sep 2024 note: The above link to the paper is not currently working. As a backup, see here or the (‘Wayback machine’) Web Archive here.
Evaluations
1. Anonymous evaluation 1
2. Anonymous evaluation 2
Overall ratings
Note: Applied and Policy Stream
This paper was evaluated as part of our “applied and policy stream”, described here. The ratings should not be directly compared to those in our main academic stream.
We asked evaluators to provide overall assessments as well as ratings for a range of specific criteria. Note: The second evaluator was given an updated form geared specifically for our Applied and Policy stream, while the first evaluator used the (very similar) interface intended for the academic stream.
I. Overall assessment: We asked them to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”
II. Journal rank tier, normative rating (0-5): On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here.) Note: 0= lowest/none, 5= highest/best.
| Overall assessment (0-100) | Journal rank tier, normative rating (0-5) |
Anonymous evaluation 1 | 75 | 4.0 |
Anonymous evaluation 2 | 80 | N/A |
See “Metrics” below for a more detailed breakdown of the evaluators’ ratings across several categories. To see these ratings in the context of all Unjournal ratings, with some analysis, see our data presentation here.
See here for the current full evaluator guidelines, including further explanation of the requested ratings.
Evaluation summaries
Anonymous evaluator 1
I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existential risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly eliciting forecasts for events far into the future, and the researchers do their best to meet them, but it might still fall short. The policy implications of the results so far could also be further clarified.
Anonymous evaluator 2
This report considers a forecasting tournament in which experts in relevant domains and super-forecasters from previous tournaments are asked to predict a large number of quantities of interest related to existential threats to human existence on Earth. A third group, taken from the public more widely, is also consulted. The tournament structure involves several rounds with interaction within and between groups in online forums, sharing anonymised predictions and rationales, rewards for active individuals and assessment of the quality of predictions in terms of how well individuals could predict the values given by others and accuracy on quantities to be revealed over short time horizons.
As an exercise it is very interesting and could potentially be valuable, both to society in general and perhaps decision makers at various levels. It is written appropriately for a relatively non-technical audience, with most of the detail contained in the appendices. The methodology used is consistent with best practice in the literature for this type of tournament, although as the report points out there are question areas where they had to make decisions based on inconclusive research such as predictions for low probability events. Overall, I enjoyed reading the document and found that the summaries do for the most part match the data collected. I do think the report would benefit from revision, based on the comments in my full report.
Metrics
Ratings
| Evaluator 1* Anonymous | | Evaluator 2 Anonymous |
---|
Rating category* | Rating (0-100) | 90% CI (0-100)* | Rating (0-100) |
---|
Overall assessment | 75 | (60, 90) | 80 |
---|
Advancing knowledge and practice | 75 | (60, 90) | 85 |
---|
Methods: Justification, reasonableness, validity, robustness | 70 | (60, 80) | 75 |
---|
Logic & communication | 80 | (70, 90) | 75 |
---|
Open, collaborative, replicable | 70 | (60, 80) | 85 |
---|
Real-world relevance /Relevance to global priorities ** | 90 | (85, 95) | 95 |
---|
Evaluator 1 was given a form following the categories described below, and explained in detail here, with guidance. Evaluator 2 was given the updated form here, with slightly different categories and guidelines, aimed at our Applied stream. Evaluator 2 opted to give midpoint ratings only, not CIs.
** The final categories (Real-world…/Relevance) were separate for Evaluator 1 (although they gave the same ratings for both). These were combined for Evaluator 2, as per our new evaluation template.
Journal ranking tiers
See here for more details on these tiers.
| Evaluator 1 Anonymous | |
---|
Judgment | Ranking tier (0-5) | 90% CI |
---|
On a ‘scale of journals’, what ‘quality of journal’ should this be published in? | 4.0 | (3.5, 5.0) |
---|
What ‘quality journal’ do you expect this work will be published in? | 4.0 | (3.5, 5.0) |
---|
See here for more details on these tiers. | We summarize these as: 0.0: Marginally respectable/Little to no value 1.0: OK/Somewhat valuable 2.0: Marginal B-journal/Decent field journal 3.0: Top B-journal/Strong field journal 4.0: Marginal A-Journal/Top field journal 5.0: A-journal/Top journal
|
Evaluation manager’s discussion
23 Aug 2024: We are posting this without extensive comments for now, but we intend to add further discussion here within the next month or so.
Why we prioritized this paper (in brief)
Potential to provide credible measures of existential and catastrophic risk, and insight into key debates (e.g. over AI risk)
Already seems to be influencing other analyses and decisions (such as Clancy 2023 [2])
The XPT data and results seem likely to be used widely in modeling for global prioritization
Understanding and harnessing expert judgment is key to other
Understanding persuasion and consensus-building may help foster cooperation on global issues
Author engagement, suggestions for evaluators
Ezra Karger supported our efforts, and offered some suggestions for particular areas they would like feedback. We shared these suggestions for feedback, along with a detailed list of our own suggestions, in the Bespoke Evaluation Notes linked here. The author team was particularly concerned with the importance of this work, the usefulness of the data, and the characterization of the relevant debates. Also see the authors’ EA Forum post.
The authors have not provided a written response to these evaluations. They are invited to do so, and if they do, we will integrate it here.
How we chose the evaluators
We sought evaluators with expertise and experience in
Expert elicitation and aggregation
Rare events elicitation methods
Existential and catastrophic risks in particular areas, and the quantification of this (Note: neither of the evaluators we recruited had particular expertise in the specific risk areas).
Issues meriting further evaluation
The evaluators (especially Evaluator 2) addressed many of the suggested issues (here). However, this was a long paper, and an extensive list. We suspect some issues merit further detailed evaluation [3], including…
Data sharing: Is the data shared in a clear and useful way? How could it be made more useful?
Design and implementation, context
Are they correctly describing the debates in these spaces? Did they ask the right questions? Were the prediction questions well-framed?
Was the choice of ‘experts’ reasonable? Were these and other groups recruited in ways that would tilt them towards one side or the other, or towards being intransigent?
Could there be substantial anchoring behavior as a result of their displaying the “Prior Forecasts”?
Did they do a good job of framing of ‘questions about how they would allocate resources to mitigate potential risks’ (Note some prior literature on ‘eliciting policy choices’ from the public and policymaker participants.)
What specifically are the most appropriate methods for eliciting these rare event forecasts?
Statistical/quantitative analysis
Are their aggregation methods appropriate? If not, what approach would be more approproate.?
Lack of statistical modeling inference (this was noted by evaluators; further evaluators could propose or implement specific approaches)
Did they adequately consider the potential for attrition bias?