Real-World Results
Customer Results
Cartage: 70% to 95% Accuracy
Cartage integrated Raysurfer into their production agent workflow and saw accuracy improve from 70% to 95% on repetitive multi-step tasks. By retrieving proven code instead of regenerating from scratch each run, their agent produced consistent, correct results — even on complex tool chains that previously failed intermittently.Benchmark Results
Side-by-side runs where baseline and Raysurfer modes are evaluated on the same task sets with the same budgets. Compared modes:claude-agent-sdkbaseline- Raysurfer reuse mode
Headline Numbers (February 20, 2026)
| Run Set | Tasks | Baseline Consistency | Raysurfer Consistency | Baseline Interaction Calls | Raysurfer Interaction Calls | Baseline Total Time | Raysurfer Total Time |
|---|---|---|---|---|---|---|---|
| Public one-shot implementation tasks | 20 | 5.0% (1/20) | 100.0% (20/20) | 81 total (4.05/attempt) | 0 total (0.00/attempt) | 860.9s | 14.7s |
| Existing benchmark tasks (10 HumanEval + 10 MBPP) | 20 | 0.0% (0/20) | 100.0% (20/20) | 68 total (3.40/attempt) | 20 total (1.00/attempt) | 291.3s | 0.436s |
What This Means
- Consistent — 100% consistency on cached tasks vs 0-5% without caching
- Faster — seconds instead of minutes for the same workloads
- Cheaper — fewer interaction calls means less model/tool loop work per attempt
Methodology
- Use the same task list for baseline and Raysurfer runs.
- Keep model, turn limits, and timeout budgets fixed between modes.
- Seed Raysurfer with verified snippets before the Raysurfer run.
- Record per-attempt completion, elapsed seconds, and interaction-call metric.
- Score consistency as
completed_within_180_seconds / total_attempts.
Interaction-Call Metric
- In
examples/raysurfer-public-oneshot-eval, calls come fromtools=in run details (run_agent_eval.py). - In
examples/raysurfer-existing-benchmarks-eval, calls come frommetric=in run details (run_benchmark_eval.py): baseline uses Claude tool-loop calls, Raysurfer uses retrieved candidates evaluated.
Re-run Commands
Public one-shot benchmark
Existing benchmark
Artifacts
examples/raysurfer-public-oneshot-eval/runs/baseline.jsonexamples/raysurfer-public-oneshot-eval/runs/with_raysurfer.jsonexamples/raysurfer-public-oneshot-eval/runs/summary.jsonexamples/raysurfer-existing-benchmarks-eval/runs/baseline.jsonexamples/raysurfer-existing-benchmarks-eval/runs/with_raysurfer.jsonexamples/raysurfer-existing-benchmarks-eval/runs/summary.json
