Benchmarks

This page reports side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:

claude-agent-sdk baseline
Raysurfer reuse mode

The benchmark goal is straightforward: on similar tasks, Raysurfer should finish more work with fewer LLM interaction calls and less elapsed time.

Headline Results (February 20, 2026)

Run Set	Tasks	Baseline Consistency	Raysurfer Consistency	Baseline Interaction Calls	Raysurfer Interaction Calls	Baseline Total Time	Raysurfer Total Time
Public one-shot implementation tasks	20	5.0% (1/20)	100.0% (20/20)	81 total (4.05/attempt)	0 total (0.00/attempt)	860.9s	14.7s
Existing benchmark tasks (10 HumanEval + 10 MBPP)	20	0.0% (0/20)	100.0% (20/20)	68 total (3.40/attempt)	20 total (1.00/attempt)	291.3s	0.436s

These are the benchmark runs to share when the goal is demonstrating more consistent, faster, and cheaper agent execution.

Why This Shows Cheaper Agents

Fewer interaction calls means less model/tool loop work per attempt.
Higher consistency means fewer retries and fewer failed runs.
Lower elapsed time means lower end-to-end latency for the same task set.

Methodology

Use the same task list for baseline and Raysurfer runs.
Keep model, turn limits, and timeout budgets fixed between modes.
Seed Raysurfer with verified snippets before the Raysurfer run.
Record per-attempt completion, elapsed seconds, and interaction-call metric.
Score consistency as completed_within_180_seconds / total_attempts.

Interaction-Call Metric

In examples/raysurfer-public-oneshot-eval, calls come from tools= in run details (run_agent_eval.py).
In examples/raysurfer-existing-benchmarks-eval, calls come from metric= in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.

Re-run Commands

Public one-shot benchmark (shareable showcase)

cd examples/raysurfer-public-oneshot-eval
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_verified_snippets.py --tasks tasks/tasks.json
uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180
RAYSURFER_BASE_URL=http://127.0.0.1:8000 RAYSURFER_API_KEY=your_key uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180
uv run python scripts/score_eval.py --tasks tasks/tasks.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Existing benchmark showcase (shareable showcase)

cd examples/raysurfer-existing-benchmarks-eval
uv run python scripts/build_tasks.py --out tasks/existing_benchmarks_20.json --humaneval-limit 10 --mbpp-limit 10
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_reference_solutions.py --tasks tasks/existing_benchmarks_20.json
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/score_eval.py --tasks tasks/existing_benchmarks_20.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Artifacts

examples/raysurfer-public-oneshot-eval/runs/baseline.json
examples/raysurfer-public-oneshot-eval/runs/with_raysurfer.json
examples/raysurfer-public-oneshot-eval/runs/summary.json
examples/raysurfer-existing-benchmarks-eval/runs/baseline.json
examples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.json
examples/raysurfer-existing-benchmarks-eval/runs/summary.json

Get Started

Integrations

SDK Reference

Enterprise

Rate Limits & Pricing

Benchmarks

Benchmarks

Headline Results (February 20, 2026)

Why This Shows Cheaper Agents

Methodology

Interaction-Call Metric

Re-run Commands

Public one-shot benchmark (shareable showcase)

Existing benchmark showcase (shareable showcase)

Artifacts

Get Started

Integrations

SDK Reference

Enterprise

Rate Limits & Pricing

​Benchmarks

​Headline Results (February 20, 2026)

​Why This Shows Cheaper Agents

​Methodology

​Interaction-Call Metric

​Re-run Commands

​Public one-shot benchmark (shareable showcase)

​Existing benchmark showcase (shareable showcase)

​Artifacts

Benchmarks

Headline Results (February 20, 2026)

Why This Shows Cheaper Agents

Methodology

Interaction-Call Metric

Re-run Commands

Public one-shot benchmark (shareable showcase)

Existing benchmark showcase (shareable showcase)

Artifacts