Skip to main content

Evaluation 2 of "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament", Applied Stream

Evaluation of "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament" for The Unjournal.

Published onAug 23, 2024
Evaluation 2 of "Forecasting Existential Risks: Evidence from a Long Run Forecasting Tournament", Applied Stream
·
1 of 2
key-enterThis Pub is a Review of
Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament
Description

The Existential Risk Persuasion Tournament (XPT) aimed to produce high-quality forecasts of the risks facing humanity over the next century by incentivizing thoughtful forecasts, explanations, persuasion, and updating from 169 forecasters over a multi-stage tournament. In this first iteration of the XPT, we discover points where historically accurate forecasters on short-run questions (superforecasters) and domain experts agree and disagree in their probability estimates of short-, medium-, and long-run threats to humanity from artificial intelligence, nuclear war, biological pathogens, and other causes. We document large-scale disagreement and minimal convergence of beliefs over the course of the XPT, with the largest disagreement about risks from artificial intelligence. The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?

Abstract

This report considers a forecasting tournament in which experts in relevant domains and super-forecasters from previous tournaments are asked to predict a large number of quantities of interest related to existential threats to human existence on Earth. A third group, taken from the public more widely, is also consulted. The tournament structure involves several rounds with interaction within and between groups in online forums, sharing anonymised predictions and rationales, rewards for active individuals and assessment of the quality of predictions in terms of how well individuals could predict the values given by others and accuracy on quantities to be revealed over short time horizons.

As an exercise it is very interesting and could potentially be valuable, both to society in general and perhaps decision makers at various levels. It is written appropriately for a relatively non-technical audience, with most of the detail contained in the appendices. The methodology used is consistent with best practice in the literature for this type of tournament, although as the report points out there are question areas where they had to make decisions based on inconclusive research such as predictions for low probability events. Overall, I enjoyed reading the document and found that the summaries do for the most part match the data collected. I do think the report would benefit from revision, based on the comments in my full report.

Summary Measures

We asked evaluators to give some overall assessments, in addition to ratings across a range of criteria. See the evaluation summary “metrics” for a more detailed breakdown of this. See these ratings in the context of all Unjournal ratings, with some analysis, in our data presentation here.1

Rating

Overall assessment

80/100

Overall assessment: We asked evaluators to rank this paper “heuristically” as a percentile “relative to all serious research in the same area that you have encountered in the last three years.” We requested they “consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.”

Journal rank tier, normative rating (0-5): “On a ‘scale of journals’, what ‘quality of journal’ should this be published in? (See ranking tiers discussed here)” Note: 0= lowest/none, 5= highest/best”.

See here for the full evaluator guidelines, including further explanation of the requested ratings.

Claim identification

I. Identify the most important and impactful factual claim this research makes – e.g., a binary claim or a point estimate or prediction.2

“The median expert predicted a 20% chance of catastrophe and a 6% chance of human extinction by 2100. Superforecasters saw the chances of both catastrophe and extinction as considerably lower than did experts. The median superforecaster predicted a 9% chance of catastrophe and a 1% chance of extinction.” 

This is a claim made directly from the results of the tournament conducted. It is an accurate claim based on the data collected from the tournament (presuming the authors analysed the data correctly). These numbers are relatively large, so if they can be thought of as reasonable estimates this implies that governments may want to take action to reduce them where possible.

 

II. To what extent do you believe this claim?3

 I certainly believe that these were the medians from the groups. I have more concerns about how well these numbers summarise the views of these two groups.

 

III. What additional information, evidence, replication, or robustness check would make you substantially more (or less) confident in this claim?

 There are two potential issues with this claim:

  1. The claim cites medians from the two groups, but there was substantial (orders of magnitude) variation in the estimates made by different experts and superforecasters.

  2. It isn’t completely clear, as I discuss in my review, how likely the estimates elicited from the groups were of high quality. More information is required to evaluate that, as I discuss in my comments.

 

Written report4

This report considers a forecasting tournament in which experts in relevant domains and super-forecasters from previous tournaments are asked to predict a large number of quantities of interest related to existential threats to human existence on Earth. A third group, taken from the public more widely, is also consulted. The tournament structure involves several rounds with interaction within and between groups in online forums, sharing anonymised predictions and rationales, rewards for active individuals and assessment of the quality of predictions in terms of how well individuals could predict the values given by others and accuracy on quantities to be revealed over short time horizons.

As an exercise it is very interesting and could potentially be valuable, both to society in general and perhaps decision makers at various levels. It is written appropriately for a relatively non-technical audience, with most of the detail contained in the appendices. The methodology used is consistent with best practice in the literature for this type of tournament, although as the report points out there are question areas where they had to make decisions based on inconclusive research such as predictions for low probability events. Overall, I enjoyed reading the document and found that the summaries do for the most part match the data collected. I do think the report would benefit from revision, based on the comments below.

Main comments[1]

  1. [Bootstrapped CIs/medians] I’m not convinced by the use of bootstrap confidence intervals to represent uncertainty/variation. They are providing an assessment of variability in the group median. This isn’t of primary interest as a measure of uncertainty here. What is of interest is the differences in predictions from the individuals, not how the median would change with a different group of individuals. It would make a lot more sense to replace all bootstrap confidence intervals in tables and text with relevant quantiles (e.g., quartiles) from the empirical distribution of individual predictions.

  2. I have some concerns about the process of preparing the experts and eliciting the quantities of interest. There is a large body of evidence which demonstrates that training experts in making probabilistic predictions is critical for them to provide accurate predictions. It would be good to add some information on how the experts who had not taken part in such an exercise before were trained in making predictions of the types required. Without such an extensive training process I don’t believe that the predictions from the experts will be of sufficiently high quality to be trustworthy.

  3. For the questions in which you ask individuals for probabilities, please make clear what you mean by probability and how the experts were asked to think about probability.

  4. How did you ensure that a forecaster understood the question? Some of the [questions] are complex and could be interpreted in various ways. How did you ensure that all experts all understood the question in the same way, and in the way that was intended? In my experience of conducting elicitations in person, no matter how carefully a question is constructed, different experts will interpret it differently.

  5. When asking the experts for continuous quantities, are you asking them for their mean value, their median, their modal value or something else? How was this communicated to the experts?

  6. Is the large number of questions a problem? Can anyone really be expected to provide high quality forecasts over such a large number of questions over a range of different specialisms?

  7. [Oversimplified language, inaccuracy] I’m aware that this isn’t for a technical audience, but I’m a little concerned that some of the reporting risks the reader coming out with an inaccurate impression of the results of the elicitation exercise. An example is the box on page 15: “Superforecasters estimated a 1% chance of human extinction by 2100, while experts estimated a 6% chance.” I don’t think that this statement is true. This applies to other summaries of the results in the report.

  8. Many of the tables in the report only provide median predictions from each of the groups. Whenever a median is provided in a table, an interval estimate should also be provided.

  9. [Group differences vs framing differences] The statement “Members of the public estimated probabilities of extinction in between the estimates of experts and superforecasters, but diverged significantly with alternative probability elicitation formats” is interesting, and is certainly backed up by the numbers provided. However, how interesting it is depends on whether the same was found when asking experts and superforecasters the questions in different ways. Were they also susceptible to framing in this way?

  10. [Need for statistical inference] In Figure 7, the increase in AI extinction risk by quintile is not as clear as for previous figures, particularly for the experts. If the authors wish to make this claim I think they need to provide some formal inferential evidence to back it up. They could then apply this inferential technique to the data in the previous figures too.

  11. In the section on the “Relationships between risk predictions in different domains” there needs to be some correlations reported, possibly in the form of a coloured correlation matrix. If the relationships exist then this is the best way to demonstrate them. I don’t find the partitioning analysis convincing about the relationships, as I will discuss below.

  12. In particular, I am not convinced by any of the comparisons between the AI skeptics and the AI concerned. Since the makeup of the two groups is very different in terms of the proportions of experts and superforecasters, I am not convinced that the differences found are a result of the individuals’ attitudes to AI. Instead, I think they are likely to be as a result of the differential constitutions of the two groups. Or at the very least, I don’t think that this can be ruled out.

  13. [Need for statistical inference] There are some relatively strong statements made in the section on “How time spent thinking about existential risk relates to forecasts of those risks”. I don’t think the results presented, particularly given the small sample sizes in each group, support the statements. If the authors wish to keep these statements, I suggest they perform some inferential statistics to demonstrate that the results they see by eye are supported by more formal evidence.

  14. On the lack of convergence of different groups over the different rounds, consensus-based elicitation methods encourage individuals to think not just about their own beliefs but what an impartial individual exposed to the range of estimates, rationales and evidence would think. This would be interesting here. An example of this approach is:

O’Hagan, A. (2019). Expert Knowledge Elicitation: Subjective but Scientific. The American Statistician, 73(sup1), 69–81. https://doi.org/10.1080/00031305.2018.1518265

In addition, Individuals may have been more likely to change their views if a skilled moderator was employed in the project. In general, a lack of “real” interaction could be a barrier to changing individual opinions.

  1. [Between-group differences: need to report uncertainty] There are lots of statements in the results about the differences between the groups. These are of course differences between the medians in the groups. I would like to see more reporting of uncertainty within the groups in the narrative. The probabilities were very wide within groups for almost all of the quantities elicited, and so reporting the median as if it represents all of the individuals in a group is somewhat disingenuous.

  2. In terms of better methods for eliciting very low probability events, I suggest the authors perform a literature review in fields where such assessments are commonly required, such as nuclear safety and food safety.

Minor points5

  1. Please explain what is meant by a “pundit” and how it differs from an expert.

  2. 6

  3. With regards to the recruitment of domain experts, there is some evidence in the literature that those with the "most expertise" are the least accurate experts, and that it is best to recruit experts who are relatively early/mid-career, rather than those who have reached very senior positions. For example, see:

Burgman MA, McBride M, Ashton R, Speirs-Bridge A, Flander L, Wintle B, et al. (2011) Expert Status and Performance. PLoS ONE 6(7): e22998.

  1. On page 11, scoring is only described for binary variables. How was scoring done for continuous quantities?

  2. The y-axis on the various boxplots in the report is given on the log scale. This should be explained somewhere.

  3. I’m interested in the risk avoidance question: “If you could allocate an additional $10,000 to x-risk avoidance, how would you divide the money among the following topics?” What is the rationale for using such a small amount of money (given the scales of the risks involved)?

  4. 7

  5. I suggest replacing “intersubjective” with another phrase. It is very jargony for this type of report.

  6. In the sentence: “In other words, the better experts were at discerning what other people would predict, the less concerned they were about extinction.” I don’t think this was just the expert group?

  7. Page 31: “Within both groups, those who did best on reciprocal scoring had lower forecasts of extinction risk on average.”

  8. 8

  9. 9

  10. In terms of “Global temperature changes”, the main message is the similarity between the two groups? Is this also true n the range of values in the two groups?

  11. In the results for “Renewable energy”, could you provide a rationale for why the two groups of forecasters have been combined?

  12. “Nuclear fusion reactors will deliver 1% of all utility-scale power consumed in the U.S. by 2075, according to median forecasts. Superforecasters predicted a median of 2077, while subject matter experts’ median was 2100.” I can’t work out how all three of these figures can be true at the same time.

  13. “The median superforecaster prediction was that 80% of an expert panel would, for Russia and North Korea separately, agree that that country had a bioweapons program between 2022 and 2050”. I don’t understand what this means.

  14. Have the experts been given feedback on the questions to which the answers are now known?

  15. 10

Connections
1 of 1
Comments
0
comment
No comments here
Why not start the discussion?