Real-World Results

Customer Results

Cartage: 70% to 95% Accuracy

Cartage integrated Raysurfer into their production agent workflow and saw accuracy improve from 70% to 95% on repetitive multi-step tasks. By retrieving proven code instead of regenerating from scratch each run, their agent produced consistent, correct results — even on complex tool chains that previously failed intermittently.

Benchmark Results

Side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:

claude-agent-sdk baseline
Raysurfer reuse mode

On similar tasks, Raysurfer finishes more work with fewer LLM interaction calls and less elapsed time.

Headline Numbers (February 20, 2026)

Run Set	Tasks	Baseline Consistency	Raysurfer Consistency	Baseline Interaction Calls	Raysurfer Interaction Calls	Baseline Total Time	Raysurfer Total Time
Public one-shot implementation tasks	20	5.0% (1/20)	100.0% (20/20)	81 total (4.05/attempt)	0 total (0.00/attempt)	860.9s	14.7s
Existing benchmark tasks (10 HumanEval + 10 MBPP)	20	0.0% (0/20)	100.0% (20/20)	68 total (3.40/attempt)	20 total (1.00/attempt)	291.3s	0.436s

What This Means

Consistent — 100% consistency on cached tasks vs 0-5% without caching
Faster — seconds instead of minutes for the same workloads
Cheaper — fewer interaction calls means less model/tool loop work per attempt

Methodology

Use the same task list for baseline and Raysurfer runs.
Keep model, turn limits, and timeout budgets fixed between modes.
Seed Raysurfer with verified snippets before the Raysurfer run.
Record per-attempt completion, elapsed seconds, and interaction-call metric.
Score consistency as completed_within_180_seconds / total_attempts.

Interaction-Call Metric

In examples/raysurfer-public-oneshot-eval, calls come from tools= in run details (run_agent_eval.py).
In examples/raysurfer-existing-benchmarks-eval, calls come from metric= in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.

Re-run Commands

Public one-shot benchmark

cd examples/raysurfer-public-oneshot-eval
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_verified_snippets.py --tasks tasks/tasks.json
uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180
RAYSURFER_BASE_URL=http://127.0.0.1:8000 RAYSURFER_API_KEY=your_key uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180
uv run python scripts/score_eval.py --tasks tasks/tasks.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Existing benchmark

cd examples/raysurfer-existing-benchmarks-eval
uv run python scripts/build_tasks.py --out tasks/existing_benchmarks_20.json --humaneval-limit 10 --mbpp-limit 10
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_reference_solutions.py --tasks tasks/existing_benchmarks_20.json
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/score_eval.py --tasks tasks/existing_benchmarks_20.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Artifacts

examples/raysurfer-public-oneshot-eval/runs/baseline.json
examples/raysurfer-public-oneshot-eval/runs/with_raysurfer.json
examples/raysurfer-public-oneshot-eval/runs/summary.json
examples/raysurfer-existing-benchmarks-eval/runs/baseline.json
examples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.json
examples/raysurfer-existing-benchmarks-eval/runs/summary.json

​Real-World Results

​Customer Results

​Cartage: 70% to 95% Accuracy

​Benchmark Results

​Headline Numbers (February 20, 2026)

​What This Means

​Methodology

​Interaction-Call Metric

​Re-run Commands

​Public one-shot benchmark

​Existing benchmark

​Artifacts