futuresearch evals

Want to run on our benchmarks? Please contact us at evals@futuresearch.ai.

How our forecasting and research agents perform on our own benchmarks, on live public leaderboards, and on real markets.

Bench to the Future 3 (BTF-3)latest pastcasting benchmark, with binary and numeric forecasting questions Bench to the Future 2 (BTF-2)"past-casting" forecasts measured by accuracy Deep Research Bench (DRB)hard open-web tasks with curated answers.Live forecasting tournamentsour live standings on Metaculus and ForecastBench Kalshi, Polymarket, S&P 500our forecasts live on real prediction and stock markets

Bench to the Future 3 (BTF-3)

BTF-3 is the third edition of our pastcasting benchmark: 1,907 resolved forecasting questions — 1,515 binary and 392 numeric — researched and forecast against a frozen web corpus. Paper and dataset to follow.

BTF-3 Leaderboard

Evaluated: June–July 2026

All scores are on the Brier scale; lower is better, and the best score in each column is bolded.

Agent	Pooled score(n=1,907)	Binary(Brier, n=1,515)	Numeric(RPS, n=392)
1FutureSearch SOTA*	0.122 [0.115–0.129]	0.120 [0.110–0.130]	0.124 [0.114–0.135]
2Claude Opus 4.8 (xhigh)	0.130 [0.123–0.138]	0.131 [0.121–0.142]	0.129 [0.119–0.139]
3Claude Fable 5 (high)	0.131 [0.123–0.138]	0.132 [0.122–0.143]	0.129 [0.119–0.139]
4GPT-5.5 (high, agent SDK)‡	0.134 [0.128–0.140]	0.142 [0.134–0.150]	0.124 [0.114–0.135]
5GPT-5.6 Sol (high)	0.135 [0.128–0.143]	0.141 [0.132–0.150]	0.129 [0.117–0.141]
6Claude Opus 4.8 (high, agent SDK)‡	0.137 [0.129–0.145]	0.135 [0.124–0.146]	0.140 [0.130–0.151]
7Claude Opus 4.8 (high)	0.140 [0.132–0.147]	0.135 [0.125–0.146]	0.145 [0.134–0.157]
8GPT-5.5 (high)	0.143 [0.136–0.149]	0.148 [0.140–0.156]	0.136 [0.125–0.147]
9Claude Sonnet 5 (xhigh)	0.154 [0.146–0.162]	0.154 [0.144–0.164]	0.154 [0.143–0.166]

Binary questions are scored by the Brier score (mean squared error of the forecast probability), numeric questions by a normalized ranked probability score (RPS), which generalizes the Brier score to distributional forecasts. The pooled score averages across all questions, counting each numeric question three times as much as a binary one (numeric forecasts are more informative per question).

Brackets are 95% confidence intervals, computed by percentile bootstrap (5,000 resamples of the question set).

* FutureSearch SOTA synthesizes forecasts from multiple FutureSearch agent runs. ‡ Self-driving run via the model vendor's agent SDK (Claude Agent SDK / OpenAI Agents SDK) instead of our forecasting agent. FutureSearch SOTA is missing 88 binary questions (n=1,427) and 9 numeric questions (n=383). Claude Fable 5 (high) is missing 29 binary questions (n=1,486) and 3 numeric questions (n=389). GPT-5.5 (high, agent SDK) is missing 50 binary questions (n=1,465) and one numeric question (n=391). Claude Opus 4.8 (high, agent SDK) is missing 8 binary questions (n=1,507).

CHAMPS KNOW strategic emphasis

Mean Borda score per dimension (rank 1 = 10 … rank 10 = 1); higher means the dimension is more prominent in the agent's rationales. The top 3 agents are shown by default — click a name to add or remove it.

Pairwise comparisons

Paired bootstrap on pooled scores (numeric weighted 3×)

1. FutureSearch SOTA

2. Claude Opus 4.8 (xhigh)

3. Claude Fable 5 (high)

4. GPT-5.5 (high, agent SDK)

5. GPT-5.6 Sol (high)

6. Claude Opus 4.8 (high, agent SDK)

7. Claude Opus 4.8 (high)

8. GPT-5.5 (high)

9. Claude Sonnet 5 (xhigh)

1. FutureSearch SOTA

—

-.010***

-.012***

-.015***

-.016***

-.019***

-.021***

-.033***

2. Claude Opus 4.8 (xhigh)

.010***

—

-.001

-.003

-.005

-.007***

-.009***

-.012***

-.024***

3. Claude Fable 5 (high)

.010***

.001

—

-.003

-.004

-.006*

-.009**

-.012***

-.023***

4. GPT-5.5 (high, agent SDK)

.012***

.003

—

-.002

-.004

-.006*

-.009***

-.020***

5. GPT-5.6 Sol (high)

.015***

.005

.004

.002

—

-.002

-.004

-.007**

-.018***

6. Claude Opus 4.8 (high, agent SDK)

.016***

.007***

.006*

.004

.002

—

-.003

-.005

-.017***

7. Claude Opus 4.8 (high)

.019***

.009***

.009**

.006*

.004

.003

—

-.003

-.014***

8. GPT-5.5 (high)

.021***

.012***

.009***

.007**

.005

.003

—

-.011***

9. Claude Sonnet 5 (xhigh)

.033***

.024***

.023***

.020***

.018***

.017***

.014***

.011***

—

Each cell is the difference in pooled score (row − column) on the questions both agents forecast; negative (green) means the row agent is more accurate. Bold, bordered cells are statistically significant (two-sided paired-bootstrap * p<.05, ** p<.01, *** p<.001); grey cells are not. Hover a cell for the 95% confidence interval, p-value, and shared question count.

Bench to the Future 2 (BTF-2)

BTF-2 evaluates agents on 1,417 hard forecasting questions. Agents research and forecast offline against a frozen 15M-document corpus. Rationales and reasoning traces are evaluated for strategic reasoning.

BTF-2 Leaderboard

Last updated: 2026-04-20

Agent	Brier (accuracy)	Calibration	Refinement
FutureSearch Agent	0.119	0.002	0.081
Opus 4.6 Agent	0.130	0.005	0.075
Gemini 3.1 Pro Agent	0.141	0.012	0.069
GPT-5.4 Agent	0.152	0.010	0.056
Grok 4.20 Beta Agent	0.165	0.003	0.039

Brier scores on 1,417 pastcasting questions (lower is better). The FutureSearch Agent is an ensemble significantly more accurate than any single frontier agent. Radar chart shows CHAMPS KNOW strategic emphasis (Borda scores, 8 of 10 dimensions).

Papers

Evaluating Strategic Reasoning in Forecasting Agents (Apr 2026)Automating Forecasting Question Generation and Resolution for AI Evaluation (Jan 2026)Bench to the Future: A Pastcasting Benchmark for Forecasting Agents (Jun 2025)

Datasets

BTF-2 Questions and Forecasts (Hugging Face)

Deep Research Bench (DRB)

DRB benchmarks how well LLM agents do research on the web. Each of the 0 diverse, real-world tasks provides 10-100k webpages stored offline for search and reasoning, accompanied by carefully curated answers.

DRB Leaderboard

Last updated:

Agent	Score	Cost ($)	Runtime (s)

Scores averaged first per task category (radar chart), then across all tasks (table). Runtime is estimated from ReAct steps, not wall-clock time.

Papers

Deep Research Bench: Evaluating AI Web Research Agents (May 2025)Towards a Realistic Long-Term Benchmark for Open-Web Research Agents (Sep 2024)

Loading radar chart...

No data available

Metaculus AI Forecasting Tournaments

Metaculus runs live bot tournaments where forecasting agents predict real, unresolved questions and are scored against the field by spot peer score. Our standing in the tournaments we take part in, refreshed at each deploy:

Tournament	Our standing	Leader
Summer 2026 FutureEval Bot Tournamentlive	#1 of 163	FutureSearch (1079.70)
MiniBench - 2026-06-29	#1 of 134	FutureSearch (732.76)
MiniBench - 2026-06-15	#1 of 118	FutureSearch (1267.95)
MiniBench - 2026-06-01	#2 of 113	laertes (740.66)
MiniBench - 2026-05-18	#3 of 114	Preseen-Chestnut (1142.57)
MiniBench - 2026-05-04	#7 of 119	mmBot (1766.40)

Standings are pulled from the Metaculus API at deploy time. Bots are scored by spot peer score (a per-question comparison against every other forecaster on the same question); higher is better. The leader column shows the top-ranked bot and its score for context. MiniBench tournaments run on a rolling two-week cadence.

ForecastBench

ForecastBench is a dynamic, contamination-free benchmark of AI forecasting accuracy run by the Forecasting Research Institute. Bots forecast hundreds of unresolved real-world questions, scored on a Brier Index (0–100, higher is better). FutureSearch's forecasting agent currently ranks #14 of 273 submitted models, at a Brier Index of 63.7.

Preliminary leaderboard

Updated: 2026-07-17

Brier Index; higher is better. Showing the top 15 of 273 models.

Model	Brier Index(95% CI)	N
1 Torchcast AIrice-demon	65.4 [63.9–67.0]	477
2 Torchcast AIcaptain-jack	65.1 [63.6–66.8]	477
3 Torchcast AIdragon-brother	65.0 [63.6–66.5]	477
4 Voicetreevoicetree-axiom-2	64.9 [63.8–66.1]	477
5 Voicetreevoicetree-axiom-0	64.6 [63.4–65.7]	477
6 Google DeepMindblue croc	64.5 [63.0–66.2]	477
7 Google DeepMindbig green leaf / fire hedgehog / silver-anchor	64.4 [62.9–66.1]	477
10 Google DeepMindgreen tree	64.2 [63.2–65.3]	724
10 Voicetreevoicetree-axiom-1	64.2 [63.0–65.3]	477
12 Torchcast AIcarb-bomb-nano	64.0 [62.0–66.0]	243
13 Torchcast AIwyrm-warlord-nano	63.8 [61.8–65.8]	243
14 Superforecaster median forecastForecastBench	63.7 [62.5–65.0]	521
14 FutureSearchfb_early_closer_v2 / fb_forecaster_v2	63.7 [62.5–65.0]	477
17 Google DeepMindceramic-kettle / iron-compass	63.5 [61.8–65.3]	477
17 Torchcast AIskipper-morgan-nano	63.5 [61.5–65.5]	243

The preliminary leaderboard ranks models on questions from the current dataset that have already resolved; it is regenerated nightly from the public dataset repository. The Brier Index rescales the mean Brier score to a 0–100 scale (100 = perfect, 50 = uninformed) and adjusts for question difficulty. Brackets are 95% confidence intervals; N is the number of resolved questions scored.

RetroSearch

DRB and BTF-2 use RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages, allowing reproducible runs even as the internet changes, and enabling forecasting tasks to be run as "pastcasting".

RetroSearch aims to emulate Google search (specifically, the Serper search API) as closely as possible, so as to minimize differences between live and "retro" agent runs. A single RetroSearch search query follows the following steps:

Run a live Serper search for the query
Look up pages obtained from live search in the RetroSearch database and other archive sources
If the page is not found in the RetroSearch database, remove it from the results
Write new snippets from a sample of page content using a simple LLM
Return the results in the original format of the Google results

This approach ensures a search experience for agents that is consistent with real search, but backed exclusively by pages we have a frozen candidate for. The following diagram from the paper illustrates the process:

Diagram showing how RetroSearch provides frozen web snapshots to agents — Illustration of the system architecture of Deep Research Bench using RetroSearch. This shows the flow from task definition through the scraping pipeline that populates the RetroSearch database prior to running the benchmark, and then how agents use RetroSearch via an API at the time of task evaluation.