Evaluation Results Leaderboard

This is the leaderboard for Deep Research Bench, FutureSearch's benchmark for deep research agents. It's automatically kept up to date as we add more tasks and improve existing ones. See the Deep Research Bench Paper for more!

Model
Architecture
Average Score
Gemini 2.5 ProImplicit ReAct0.46
o3Implicit ReAct0.46
Claude 3.7 Sonnet ThinkingImplicit ReAct0.44
GPT-4.1Explicit ReAct0.40
Claude 3.7 Sonnet Non-thinkingExplicit ReAct0.39
Gemini 2.5 Flash ThinkingImplicit ReAct0.36
Gemini 2.5 Flash Non-thinkingExplicit ReAct0.34
DeepSeek-R1Implicit ReAct0.30
Mistral SmallExplicit ReAct0.27
Gemma 3Explicit ReAct0.24
GPT-4 TurboExplicit ReAct0.23