Evaluation Results Leaderboard
This is the leaderboard for Deep Research Bench, FutureSearch's benchmark for deep research agents. It's automatically kept up to date as we add more tasks and improve existing ones. See the Deep Research Bench Paper for more!
Model | Architecture | Average Score |
---|---|---|
Gemini 2.5 Pro | Implicit ReAct | 0.46 |
o3 | Implicit ReAct | 0.46 |
Claude 3.7 Sonnet Thinking | Implicit ReAct | 0.44 |
GPT-4.1 | Explicit ReAct | 0.40 |
Claude 3.7 Sonnet Non-thinking | Explicit ReAct | 0.39 |
Gemini 2.5 Flash Thinking | Implicit ReAct | 0.36 |
Gemini 2.5 Flash Non-thinking | Explicit ReAct | 0.34 |
DeepSeek-R1 | Implicit ReAct | 0.30 |
Mistral Small | Explicit ReAct | 0.27 |
Gemma 3 | Explicit ReAct | 0.24 |
GPT-4 Turbo | Explicit ReAct | 0.23 |