Skip to main content

Evaluation 1 of "Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament", Applied Stream

Evaluation of "Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament" for The Unjournal (Applied Stream)

Published onAug 23, 2024
Evaluation 1 of "Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament", Applied Stream
·
1 of 2
key-enterThis Pub is a Review of
Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament
Description

The Existential Risk Persuasion Tournament (XPT) aimed to produce high-quality forecasts of the risks facing humanity over the next century by incentivizing thoughtful forecasts, explanations, persuasion, and updating from 169 forecasters over a multi-stage tournament. In this first iteration of the XPT, we discover points where historically accurate forecasters on short-run questions (superforecasters) and domain experts agree and disagree in their probability estimates of short-, medium-, and long-run threats to humanity from artificial intelligence, nuclear war, biological pathogens, and other causes. We document large-scale disagreement and minimal convergence of beliefs over the course of the XPT, with the largest disagreement about risks from artificial intelligence. The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?

Abstract

I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existential risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly eliciting forecasts for events far into the future, and the researchers do their best to meet them, but it might still fall short. The policy implications of the results so far could also be further clarified.

Summary Measures

We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1

Rating

90% Credible Interval

Overall assessment

75/100

60 - 90

Journal rank tier, normative rating

4.0/5

3.5 - 5.0

Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.

See here for the full evaluator guidelines, including further explanation of the requested ratings.

Written report [1]

I really appreciate the team’s ambitions for tackling difficult long-term forecasts of important existen1al risks. Just as importantly, they are exploring how experts might update their forecasts when confronted with arguments from peers. Of course there are many methodological challenges to credibly elici1ng forecasts for events far into the future, and the researchers do their best to meet them, but it might stil fall short. The policy implications of the results so far could also be further clarified. The team could do a lot more analysis with the rich data, but I understand that this report was aimed at a general audience.

Methodology

How might we think about the reciprocal scoring results in light of the robustly-documented false consensus effect? It would be good to spell out the theoretical properties of reciprocal scoring and why it’s supposedly superior to other methods in both cases of the forecasters being rational or behavioral agents. For these low-probability events, it has been shown that having participants list the possible ways in which an outcome could be reached or to list reasons why an outcome cannot happen could increase or decrease their forecasted probability of that outcome.

Analysis

Given the target audience, it might make sense that the report did not contain more nonparametric tests and regression analysis. However, the current tables do not show the individual heterogeneity that may be leading to very different forecasts beyond just the domain expert vs. superforecaster distinction or the type of domain expertise. Regression analysis can properly control for other important differences between the forecasters. For example, is previous performance as a superforecaster correlated with forecasts given in the current contest?

Policy

To understand if the differences in the forecasts of domain experts vs. superforecasters are important from a policy implication perspective, it would be interesting to know for what ranges of existential risk levels would policy makers choose the same or different options. While the differences in forecasts are very interesting from an academic perspective, if both sets of risk assessment would still lead to the same policy choices, then perhaps we don’t need to be as concerned from the application perspective.

The team could use simple Natural Language Processing methods to detect and categorize the arguments that the forecasters made in their team discussions. This could perhaps give clues to the types of arguments that were successful in changing a few minds vs. the many that were not. There have been a number of theoretical models about what types of narratives or arguments may be the most persuasive, e.g. the one that fits the past data the best, the one that gives the most optimistic outlook for the future, and so on. Perhaps changing the incentive structure could also lead to more persuasion. What if the one forecast was picked at random to count for the team?

Comments
0
comment
No comments here
Why not start the discussion?