Skip to main content

Evaluation Summary and Metrics: "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament", Applied Stream

Evaluation Summary and Metrics: "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament" for The Unjournal Applied and Policy Stream

Published onAug 23, 2024
Evaluation Summary and Metrics: "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament", Applied Stream
·
history

You're viewing an older Release (#5) of this Pub.

  • This Release (#5) was created on Sep 23, 2024 ()
  • The latest Release (#6) was created on Oct 15, 2024 ().
key-enterThis Pub is a Review of
Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament
Description

The Existential Risk Persuasion Tournament (XPT) aimed to produce high-quality forecasts of the risks facing humanity over the next century by incentivizing thoughtful forecasts, explanations, persuasion, and updating from 169 forecasters over a multi-stage tournament. In this first iteration of the XPT, we discover points where historically accurate forecasters on short-run questions (superforecasters) and domain experts agree and disagree in their probability estimates of short-, medium-, and long-run threats to humanity from artificial intelligence, nuclear war, biological pathogens, and other causes. We document large-scale disagreement and minimal convergence of beliefs over the course of the XPT, with the largest disagreement about risks from artificial intelligence. The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?

Abstract

We organized two evaluations of the paper “Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament”[1]. This paper was evaluated as part of our “applied and policy stream”, described here. To read these evaluations, please see the links below.

23 Sep 2024 note: The above link to the paper is not currently working. As a backup, see here or the (‘Wayback machine’) Web Archive here.

Evaluations

1. Anonymous evaluation 1

2. Anonymous evaluation 2

Overall ratings

Note: Applied and Policy Stream

This paper was evaluated as part of our “applied and policy stream”, described here. The ratings should not be directly compared to those in our main academic stream.

We asked evaluators to provide overall assessments as well as ratings for a range of specific criteria. Note: The second evaluator was given an updated form geared specifically for our Applied and Policy stream, while the first evaluator used the (very similar) interface intended for the academic stream.

I. Overall assessment: We asked them to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

II. Journal rank tier, normative rating (0-5):1 On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here.) Note: 0= lowest/none, 5= highest/best.

Overall assessment (0-100)

Journal rank tier, normative rating (0-5)

Anonymous evaluation 1

75

4.0

Anonymous evaluation 2

80

N/A

See “Metrics” below for a more detailed breakdown of the evaluators’ ratings across several categories. To see these ratings in the context of all Unjournal ratings, with some analysis, see our data presentation here.2

See here for the current full evaluator guidelines, including further explanation of the requested ratings.3

Evaluation summaries

Anonymous evaluator 1

I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existential risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly eliciting forecasts for events far into the future, and the researchers do their best to meet them, but it might still fall short. The policy implications of the results so far could also be further clarified.

Anonymous evaluator 2

This report considers a forecasting tournament in which experts in relevant domains and super-forecasters from previous tournaments are asked to predict a large number of quantities of interest related to existential threats to human existence on Earth. A third group, taken from the public more widely, is also consulted. The tournament structure involves several rounds with interaction within and between groups in online forums, sharing anonymised predictions and rationales, rewards for active individuals and assessment of the quality of predictions in terms of how well individuals could predict the values given by others and accuracy on quantities to be revealed over short time horizons.

As an exercise it is very interesting and could potentially be valuable, both to society in general and perhaps decision makers at various levels. It is written appropriately for a relatively non-technical audience, with most of the detail contained in the appendices. The methodology used is consistent with best practice in the literature for this type of tournament, although as the report points out there are question areas where they had to make decisions based on inconclusive research such as predictions for low probability events. Overall, I enjoyed reading the document and found that the summaries do for the most part match the data collected. I do think the report would benefit from revision, based on the comments in my full report.

Metrics

Ratings

Evaluator 1*

Anonymous

Evaluator 2

Anonymous

Rating category*

Rating (0-100)

90% CI

(0-100)*

Rating (0-100)

Overall assessment4

75

(60, 90)

80

Advancing knowledge and practice5

75

(60, 90)

85

Methods: Justification, reasonableness, validity, robustness6

70

(60, 80)

75

Logic & communication7

80

(70, 90)

75

Open, collaborative, replicable8

70

(60, 80)

85

Real-world relevance 9/Relevance to global priorities10 **

90

(85, 95)

95

  • Evaluator 1 was given a form following the categories described below, and explained in detail here, with guidance. Evaluator 2 was given the updated form here, with slightly different categories and guidelines, aimed at our Applied stream. Evaluator 2 opted to give midpoint ratings only, not CIs.

** The final categories (Real-world…/Relevance) were separate for Evaluator 1 (although they gave the same ratings for both). These were combined for Evaluator 2, as per our new evaluation template.

Journal ranking tiers

See here for more details on these tiers.11

Evaluator 1

Anonymous

Judgment

Ranking tier (0-5)

90% CI

On a ‘scale of journals’, what ‘quality of journal’ should this be published in?

4.0

(3.5, 5.0)

What ‘quality journal’ do you expect this work will be published in?

4.0

(3.5, 5.0)

See here for more details on these tiers.

We summarize these as:

  • 0.0: Marginally respectable/Little to no value

  • 1.0: OK/Somewhat valuable

  • 2.0: Marginal B-journal/Decent field journal

  • 3.0: Top B-journal/Strong field journal

  • 4.0: Marginal A-Journal/Top field journal

  • 5.0: A-journal/Top journal

Evaluation manager’s discussion

23 Aug 2024: We are posting this without extensive comments for now, but we intend to add further discussion here within the next month or so.

Why we prioritized this paper (in brief)

  1. Potential to provide credible measures of existential and catastrophic risk, and insight into key debates (e.g. over AI risk)

  2. Already seems to be influencing other analyses and decisions (such as Clancy 2023 [2])

  3. The XPT data and results seem likely to be used widely in modeling for global prioritization

  4. Understanding and harnessing expert judgment is key to other

  5. Understanding persuasion and consensus-building may help foster cooperation on global issues

Author engagement, suggestions for evaluators

Ezra Karger supported our efforts, and offered some suggestions for particular areas they would like feedback. We shared these suggestions for feedback, along with a detailed list of our own suggestions, in the Bespoke Evaluation Notes linked here. The author team was particularly concerned with the importance of this work, the usefulness of the data, and the characterization of the relevant debates. Also see the authors’ EA Forum post.

The authors have not provided a written response to these evaluations. They are invited to do so, and if they do, we will integrate it here.

How we chose the evaluators

We sought evaluators with expertise and experience in

  1. Expert elicitation and aggregation

  2. Rare events elicitation methods

  3. Existential and catastrophic risks in particular areas, and the quantification of this (Note: neither of the evaluators we recruited had particular expertise in the specific risk areas).

Issues meriting further evaluation

The evaluators (especially Evaluator 2) addressed many of the suggested issues (here). However, this was a long paper, and an extensive list. We suspect some issues merit further detailed evaluation [3], including…

Data sharing: Is the data shared in a clear and useful way? How could it be made more useful?

Design and implementation, context

  • Are they correctly describing the debates in these spaces? Did they ask the right questions?  Were the prediction questions well-framed?

  • Was the choice of ‘experts’ reasonable?  Were these and other groups recruited in ways that would tilt them towards one side or the other, or towards being intransigent?

  • Could there be substantial anchoring behavior as a result of their displaying the “Prior Forecasts”?

  • Did they do a good job of framing of ‘questions about how they would allocate resources to mitigate potential risks’ (Note some prior literature on ‘eliciting policy choices’ from the public and policymaker participants.)

  • What specifically are the most appropriate methods for eliciting these rare event forecasts?

Statistical/quantitative analysis

  • Are their aggregation methods appropriate? If not, what approach would be more approproate.?

  • Lack of statistical modeling inference (this was noted by evaluators; further evaluators could propose or implement specific approaches)

  • Did they adequately consider the potential for attrition bias?

Connections
1 of 2
A Supplement to this Pub
Comments
0
comment
No comments here
Why not start the discussion?