Skip to main content

Benchmarks

This page reports side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:
  • claude-agent-sdk baseline
  • Raysurfer reuse mode
The benchmark goal is straightforward: on similar tasks, Raysurfer should finish more work with fewer LLM interaction calls and less elapsed time.

Headline Results (February 20, 2026)

Run SetTasksBaseline ConsistencyRaysurfer ConsistencyBaseline Interaction CallsRaysurfer Interaction CallsBaseline Total TimeRaysurfer Total Time
Public one-shot implementation tasks205.0% (1/20)100.0% (20/20)81 total (4.05/attempt)0 total (0.00/attempt)860.9s14.7s
Existing benchmark tasks (10 HumanEval + 10 MBPP)200.0% (0/20)100.0% (20/20)68 total (3.40/attempt)20 total (1.00/attempt)291.3s0.436s
These are the benchmark runs to share when the goal is demonstrating more consistent, faster, and cheaper agent execution.

Why This Shows Cheaper Agents

  1. Fewer interaction calls means less model/tool loop work per attempt.
  2. Higher consistency means fewer retries and fewer failed runs.
  3. Lower elapsed time means lower end-to-end latency for the same task set.

Methodology

  1. Use the same task list for baseline and Raysurfer runs.
  2. Keep model, turn limits, and timeout budgets fixed between modes.
  3. Seed Raysurfer with verified snippets before the Raysurfer run.
  4. Record per-attempt completion, elapsed seconds, and interaction-call metric.
  5. Score consistency as completed_within_180_seconds / total_attempts.

Interaction-Call Metric

  • In examples/raysurfer-public-oneshot-eval, calls come from tools= in run details (run_agent_eval.py).
  • In examples/raysurfer-existing-benchmarks-eval, calls come from metric= in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.

Re-run Commands

Public one-shot benchmark (shareable showcase)

cd examples/raysurfer-public-oneshot-eval
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_verified_snippets.py --tasks tasks/tasks.json
uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180
RAYSURFER_BASE_URL=http://127.0.0.1:8000 RAYSURFER_API_KEY=your_key uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180
uv run python scripts/score_eval.py --tasks tasks/tasks.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Existing benchmark showcase (shareable showcase)

cd examples/raysurfer-existing-benchmarks-eval
uv run python scripts/build_tasks.py --out tasks/existing_benchmarks_20.json --humaneval-limit 10 --mbpp-limit 10
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_reference_solutions.py --tasks tasks/existing_benchmarks_20.json
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/score_eval.py --tasks tasks/existing_benchmarks_20.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Artifacts

  • examples/raysurfer-public-oneshot-eval/runs/baseline.json
  • examples/raysurfer-public-oneshot-eval/runs/with_raysurfer.json
  • examples/raysurfer-public-oneshot-eval/runs/summary.json
  • examples/raysurfer-existing-benchmarks-eval/runs/baseline.json
  • examples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.json
  • examples/raysurfer-existing-benchmarks-eval/runs/summary.json