Skip to main content

Real-World Results

Customer Results

Cartage: 70% to 95% Accuracy

Cartage integrated Raysurfer into their production agent workflow and saw accuracy improve from 70% to 95% on repetitive multi-step tasks. By retrieving proven code instead of regenerating from scratch each run, their agent produced consistent, correct results — even on complex tool chains that previously failed intermittently.

Benchmark Results

Side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:
  • claude-agent-sdk baseline
  • Raysurfer reuse mode
On similar tasks, Raysurfer finishes more work with fewer LLM interaction calls and less elapsed time.

Headline Numbers (February 20, 2026)

Run SetTasksBaseline ConsistencyRaysurfer ConsistencyBaseline Interaction CallsRaysurfer Interaction CallsBaseline Total TimeRaysurfer Total Time
Public one-shot implementation tasks205.0% (1/20)100.0% (20/20)81 total (4.05/attempt)0 total (0.00/attempt)860.9s14.7s
Existing benchmark tasks (10 HumanEval + 10 MBPP)200.0% (0/20)100.0% (20/20)68 total (3.40/attempt)20 total (1.00/attempt)291.3s0.436s

What This Means

  1. Consistent — 100% consistency on cached tasks vs 0-5% without caching
  2. Faster — seconds instead of minutes for the same workloads
  3. Cheaper — fewer interaction calls means less model/tool loop work per attempt

Methodology

  1. Use the same task list for baseline and Raysurfer runs.
  2. Keep model, turn limits, and timeout budgets fixed between modes.
  3. Seed Raysurfer with verified snippets before the Raysurfer run.
  4. Record per-attempt completion, elapsed seconds, and interaction-call metric.
  5. Score consistency as completed_within_180_seconds / total_attempts.

Interaction-Call Metric

  • In examples/raysurfer-public-oneshot-eval, calls come from tools= in run details (run_agent_eval.py).
  • In examples/raysurfer-existing-benchmarks-eval, calls come from metric= in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.

Re-run Commands

Public one-shot benchmark

cd examples/raysurfer-public-oneshot-eval
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_verified_snippets.py --tasks tasks/tasks.json
uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180
RAYSURFER_BASE_URL=http://127.0.0.1:8000 RAYSURFER_API_KEY=your_key uv run python scripts/run_agent_eval.py --tasks tasks/tasks.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180
uv run python scripts/score_eval.py --tasks tasks/tasks.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Existing benchmark

cd examples/raysurfer-existing-benchmarks-eval
uv run python scripts/build_tasks.py --out tasks/existing_benchmarks_20.json --humaneval-limit 10 --mbpp-limit 10
PYTHONPATH=../../raysurfer-python/src uv run python scripts/seed_reference_solutions.py --tasks tasks/existing_benchmarks_20.json
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode baseline --out runs/baseline.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/run_benchmark_eval.py --tasks tasks/existing_benchmarks_20.json --mode raysurfer --out runs/with_raysurfer.json --model haiku --max-turns 4 --timeout-seconds 180 --validation-timeout-seconds 20 --raysurfer-source reference
uv run python scripts/score_eval.py --tasks tasks/existing_benchmarks_20.json --baseline-runs runs/baseline.json --raysurfer-runs runs/with_raysurfer.json --json-out runs/summary.json

Artifacts

  • examples/raysurfer-public-oneshot-eval/runs/baseline.json
  • examples/raysurfer-public-oneshot-eval/runs/with_raysurfer.json
  • examples/raysurfer-public-oneshot-eval/runs/summary.json
  • examples/raysurfer-existing-benchmarks-eval/runs/baseline.json
  • examples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.json
  • examples/raysurfer-existing-benchmarks-eval/runs/summary.json