Super-Forecasting

“Perhaps most troubling, we found an Intelligence Community in which analysts have a difficult time stating their assumptions up front, explicitly explaining their logic, and, in the end, identifying unambiguously for policymakers what they do not know.” https://www.createbetterreasoning.com/about_ext.html

“… the importance of good analysis goes beyond accuracy. Decision makers need to understand the reasoning behind good conclusions, including what alternative explanations were considered, what assumptions were made, how evidence was evaluated, and how confident analysts are in their findings”.

1 Aim and potential outcomes

The purpose of this activity was:

To understand the super-forecasting literature and its potential value to O&A
to get some hand-on experience with this approach (how to set it up, run it, monitor it and evaluate it) and
to train the project team members in carrying out a forecasting activity.

Outcomes of this project include:

a better understanding of the forecasting process.
The development of a software platform designed for forecasting exercises
the comparison between forecasting and foresighting activities
an appreciation of the potential benefits and challenges involved in porting the learning to decision-makers and stakeholders as a part of an OF outreach process.

2 Introduction

Our Ocean Futures project uses several approaches to imagine the future, including: (i) models, (ii) foresights, and (iii) expert-based methods.

An example of the third category is super-forecasting (SF). The rationale for including a super-forecasting element in the project includes i) the belief that better forecasting leads to better decision making and ii) the empirical results that “… forecasts can be improved with behavioural interventions“ (Mellers et al., 2014).

In particular, the SF literature suggests that forecasting skills can be improved by

training, to address well-known cognitive biases
teaming, to encourage sharing information and discussing the rationales behind the forecast
tracking the forecasters performance, used as feedback for learning and
elite teaming: placing the highest performers in elite teams.

An alternative set of instructions (the HABIT) to improve forecasting skills is included in the Hybrid Forecast Competition (HFC) training material:

Hunt for the right information
Answer the questions (poor forecasters answer the wrong question)
Base rate awareness (inside vs outside view; the outside view sees a question as an instance of a wider set of questions)
Inferring wisely (combine consistent and inconsistent evidence)
Tracking news and adjust your forecast (belief updating by small frequent adjustment was the strongest mark of accurate forecasts)

3 What we have done

We decided to use this time to get experience with different types of SF activities rather than focussing on a single exercise. There are three main reasons for this decision:

This activity had only 1/2 year to run
We wished to learn as much as possible about this body of work
We do not have enough forecasters and time to come up with effective teaming and tracking components of the approach for an internally driven exercise.

As a result, three types of activities were carried out

two team members joined the Good Judgement Project, https://www.gjopen.com/ This competition consists in forecasting the outcome of a number of geo-political events.
two team members joined the Hybrid Forecasting Competition (HFC), https://www.gjopen.com/challenges/19-coming-soon-hfc-challenge. This competition also consists in forecasting the outcome of a number of geo-political events. The main difference with the Good Judgement Project, is that forecast is carried out at the interface between human and machine learning
11 members participated in an internally-run forecast exercise, consisting in forecasting the price of BitCoin 1 week ahead.

The reasons for participating in three exercises is that they provide different challenges and different experiences:

The Good Judgement Project provides the opportunity to interact with other forecasters around the world. This interaction occurs mainly by sharing and discussing the rationale for the forecasts. This is the essence of the teaming component of the approach. It was important for us to gain experience on how it works. In addition, by participating in the competition, our forecasters are ranked against other competitors. The focus here is not on the individual performances per se, as much as in getting a rough idea on whether the time we planned to devote to the exercise (.5-1 hour per week) is sufficient for a reasonable forecast.
The Hybrid Forecasting Competition (HFC) provides the same benefit as above, in addition to discovering how the human-computer interaction is managed. It includes 2 types of questions i) geopolitical ones, like in the GJP and ii) ‘time series’ analysis (predicting the price of gold, or interest rates in various countries on a certain day), for which they provide ~1 year of data.
Our own exercise can provide more learning to the participants, since it involves more forecasts than both GJP and HFC and thus more feedback. It also provides information about the type of infrastructure we need to carry out these exercises internally and possibly involving external stakeholders

4 Results and feedback

This section include the results from the BitCoin forecast exercise and some feedback from the participants in BitCoin, GJP and HFC exercises

4.1 BitCoin forecast exercise

Chris Moeseneder developed a platform for the forecast, http://aqua.marine.csiro.au:8888/apex/f?p=167. This platform has allowed participants to provide forecasts on the value of the BitCoin on a weekly basis. Specifically, the target of the forecast was the value of the BitCoin at the closing of Sunday. We aimed to follow the protocol used by the GJP to score the forecasts. As a result, multiple forecasts were allowed with each forecast updating the previous one. Participants were provided with the plot of a year-long time series of BitCoin values. They were also provided with 5 bins each, including a possible range of future BitCoin value. Participants assigned a probability of occurrence to each of the 5 bins. These 5 bins were designed so that the probability of a forecast to fall into each bin (according to past records) was uniform. This made the forecast particularly difficult, but also reduced problems with the forecast base rate.

Two measures were used to assess the forecasts

The Brier score (Yaniv et al., 1991). In order to simplify the analysis, this was normalised between 0 (exact forecast) and 1 (maximally wrong forecast). It is important to remember that this was not a binary forecast task (yes/no) but involved 5 options (as represented by the 5 BitCoin bins/intervals). In this case, a maximally wrong forecast is one which assigns full confidence (probability=1) to a bin different from the correct one[1].
Forecast Confidence. In a binary forecast the higher probability assigned to an outcome, max(p,1-p), can be used as a measure of confidence. This is not the case in non-binary forecast problems, when more than 2 options are available. Here, we have used the entropy of the forecast as a measure of Confidence. To simplify the joint analysis of Brier score and Confidence, Confidence was as normalised between 0 (full confidence, all bets on a single bin) and 1 (fully uncommitted, bets equally spread among all options).

The results from this exercise are summarised in this report. Overall, the team displayed little learning, because

The task was very difficult;
12 weeks is a very short time to allow for significant learning;
Participants dedicated only around 30-60 minutes per week on average to the task (see below);
Participants were instructed to only use the time series provided in the forecaster platform to carry out the forecast. This was done to prevent forecasts to be made very close to the sampling event. However, this effectively prevented participants from seeking additional information, as it is expected in a SF exercise. The platform can be modified to circumvent this issue, should it be used for future applications.
No communication occurred between participants. This prevented not only teaming (as understood in the SF literature) but also sharing and feedback on the mental models used for the forecasts.

4.2 Feedback on BitCoin, GJP and HFC exercises

Feedback on the exercises was provided by a number of participants and is summarised here

Most people dedicated around 30-60 minutes per week to the task. Most felt this was not sufficient to digest all relevant material needed to make a meaningful forecast
Most people felt humbled by the task
Most people believe this exercise is worth pursuing. Learning is cited as the main potential benefit from continuing the exercise.
Most people believe that since a considerable time investment is needed to carry out the exercise properly, the forecasts should target topics which are closer to O&A business
Most people believe that the SF platform Chris has developed could be used for future exercises and that two main improvements may be necessary
1. The provision of feedback on past forecasts
2. The provision of the opportunity to post notes on the forecasts for discussions to occur

5 References:

Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S.E., Moore, D., Atanasov, P., Swift, S.A., 2014. Psychological strategies for winning a geopolitical forecasting tournament. Psychological science 25, 1106-1115.

Yaniv, I., Yates, J.F., Smith, J.K., 1991. Measures of discrimination skill in probabilistic judgment. Psychological Bulletin 110, 611.