Benchmarks
This page reports side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:claude-agent-sdkbaseline- Raysurfer reuse mode
Headline Results (February 20, 2026)
| Run Set | Tasks | Baseline Consistency | Raysurfer Consistency | Baseline Interaction Calls | Raysurfer Interaction Calls | Baseline Total Time | Raysurfer Total Time |
|---|---|---|---|---|---|---|---|
| Public one-shot implementation tasks | 20 | 5.0% (1/20) | 100.0% (20/20) | 81 total (4.05/attempt) | 0 total (0.00/attempt) | 860.9s | 14.7s |
| Existing benchmark tasks (10 HumanEval + 10 MBPP) | 20 | 0.0% (0/20) | 100.0% (20/20) | 68 total (3.40/attempt) | 20 total (1.00/attempt) | 291.3s | 0.436s |
These are the benchmark runs to share when the goal is demonstrating more consistent, faster, and cheaper agent execution.
Why This Shows Cheaper Agents
- Fewer interaction calls means less model/tool loop work per attempt.
- Higher consistency means fewer retries and fewer failed runs.
- Lower elapsed time means lower end-to-end latency for the same task set.
Methodology
- Use the same task list for baseline and Raysurfer runs.
- Keep model, turn limits, and timeout budgets fixed between modes.
- Seed Raysurfer with verified snippets before the Raysurfer run.
- Record per-attempt completion, elapsed seconds, and interaction-call metric.
- Score consistency as
completed_within_180_seconds / total_attempts.
Interaction-Call Metric
- In
examples/raysurfer-public-oneshot-eval, calls come fromtools=in run details (run_agent_eval.py). - In
examples/raysurfer-existing-benchmarks-eval, calls come frommetric=in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.
Re-run Commands
Public one-shot benchmark (shareable showcase)
Existing benchmark showcase (shareable showcase)
Artifacts
examples/raysurfer-public-oneshot-eval/runs/baseline.jsonexamples/raysurfer-public-oneshot-eval/runs/with_raysurfer.jsonexamples/raysurfer-public-oneshot-eval/runs/summary.jsonexamples/raysurfer-existing-benchmarks-eval/runs/baseline.jsonexamples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.jsonexamples/raysurfer-existing-benchmarks-eval/runs/summary.json
