❯ /benchmarks — verified numbers
Every number has a command.
Nothing on this page is simulated. Two real runs back it: a library verification pass on a rented RTX A4000 ($0.16 end to end) and the HumanEval cascade on a rented RTX 3090 ($0.39 for the autoresearch stage). Every table names the command or artifact that produced it.
❯HumanEval on Qwen3.5-4B — the whole story in one chart
96/164
−4.2 pt — a regression, published on purpose
111/164 · +9.2 pt
141/164 · +27.5 pt · $0.39
117/164 — the bigger model this run beats
traces: training_results/realworld_vastai/humaneval_agent_4B.json · solved_programs.jsonl · rented RTX 3090
❯ read this before quoting 86%
This is a system result, not a model result.
The two big deltas come from verifier-gated sampling, not from the coprocessor weights. Per-problem analysis of the traces shows it plainly:
- +15 — every retry-stack rescue came from the sampled arm (gate 0.05, temp 0.5). The greedy NPCoT arms rescued zero problems.
- +30 — every autoresearch rescue is best-of-16 sampling (4 temperatures × 4 samples) filtered by the HumanEval test suite. The model already contained the correct programs; greedy decoding just never found them.
A harness like this would lift any model. The missing control — a gate=0, temp=0.5 arm with the same sampling budget — has not been run yet, so the NPCoT coprocessor's marginal contribution to the retry wins is unproven. That ablation is queued; this paragraph changes when it lands.
What is proven: the safety floor (first-try passes exactly match baseline's 96, by construction), the vanilla regression (−4.2 pt — we show it instead of hiding it), the cost ($0.39), and the persistence contract — every verified solve is cached, so repeat prompts return in O(1) without touching the model.
❯The cascade, layer by layer
| Configuration | pass@1 | pass/164 | Δ vs baseline | cost |
|---|---|---|---|---|
| Baseline (greedy, no NPCoT) | 58.5% | 96/164 | — | — |
| Vanilla NPCoT (regressed) | ~54.3% | 76/140 | −4.2 pt | — |
| + compounding retry | 67.68% | 111/164 | +9.2 pt | 2.24× attempts avg |
| + autoresearch | 85.98% | 141/164 | +27.5 pt | $0.39 GPU |
(1) Vanilla hurts. Applied unconditionally, the library fires on unrelated problems and drags pass@1 down 4.2 points. This is why naive "just bolt it on" integration is wrong.
(2) The retry stack can't lose. Baseline runs first; escalation happens only on a verified failure. The first-try pass count (96) matches baseline exactly — regression is impossible by construction.
(3) Autoresearch widens the search. 53 hard-fails remained after the retry stack. 51 of them exposed test signal the miner could use; 16-sample sweeps across four temperatures rescued 30 of those in two hours. Every +1 is a problem the baseline could never solve — and every solve persists.
The cascade did not add parameters. It widened the search and kept the receipts.
❯Qwen3.5 family baselines (no NPCoT)
Full HumanEval, greedy decoding, rented RTX 3090. These are the honest reference points every number above is measured against.
| Model | pass@1 | pass count | status |
|---|---|---|---|
| Qwen/Qwen3.5-0.8B | 23.2% | 38/164 | complete |
| Qwen/Qwen3.5-2B | 37.8% | 62/164 | complete |
| Qwen/Qwen3.5-4B | 58.5% | 96/164 | complete |
| Qwen/Qwen3.5-9B | 71.3% | 117/164 | complete |
❯The compounding store: every solve persists
Every verified solve writes three indices to disk. The next run short-circuits the cascade on any prompt hash it has seen before — solved once means free forever.
| Artifact | Purpose | Update rate |
|---|---|---|
| solved_programs.jsonl | append-only fact log — source of truth | 1 row per solve |
| prompt_cache.json | hash(prompt, entry_point) → program | 1 entry per unique prompt |
| temperature_stats.json | per-temp solve counts across sessions | +1 on successful solve |
Resumable and process-safe. The cache is rebuildable from the log via ncpu.autoresearch.cli rebuild.
❯From benchmark to coding assistant
The same cascade handles free-form prompts once the implicit tests are pulled out. The parser handles four patterns with no LLM: explicit asserts, doctest blocks, arrow notation (fn(x) → y), and "returns" prose.
❯ echo 'def add(a, b): """Return the sum.""" add(1, 2) -> 3 add(10, -5) -> 5' | python -m ncpu.autoresearch.cli user [user] entry_point=add io_pairs=2 sources={'arrow': 2} [user] SOLVED by template_match in 0.01s def add(a, b): """Implement add.""" return a + b
A prompt with any example I/O becomes a cascade-solvable work item, the solve persists into the compounding store, and the same prompt never costs anything again.
❯Runtime verification
These tables verify the runtime — the library executor, its ports, and its packaging — not the LLM story above. The self-consistency check is a CI canary against hand-coded buggy answers, not a model comparison.
Library self-consistency (200 problems)
python3 -m benchmarks.benchmark_npcot_coding_bench --n-problems 200
| System | pass@1 | MAE | wall |
|---|---|---|---|
| Ground-truth reference | 100.0% | 0.000 | 0 ms |
| Synthetic noise floor | 22.0% | 2.585 | 0 ms |
| NPCoT library consult | 60.5% | 0.740 | 7 ms |
Scale practicality (1,000 unseen)
python3 -m demos.npcot_scale_practicality
| Path | per-problem | MAE |
|---|---|---|
| Soft forward | 0.094 ms | 0.694 |
| Library hit (Python) | 0.038 ms | 0.560 |
| Library hit (Rust) | 2 µs | 0.560 |
The discrete program beats its soft parent (0.560 vs 0.694 MAE) — no sigmoid-relaxation drift.
Cross-platform correctness
| Platform | Library MAE | Soft MAE |
|---|---|---|
| macOS Apple Silicon (MPS) | 0.560 | 0.694 |
| Linux x86_64 CUDA 12.1 | 0.560 | 0.694 |
| Linux x86_64 CPU | 0.560 | 0.694 |
| Bit-for-bit identical | ✓ | ✓ |
Distribution artifacts
| Artifact | Size | Runs on |
|---|---|---|
| WASM (npcot_wasm.wasm) | 130 KB | any browser |
| Native binary (npcot_run) | 475 KB | macOS, Linux x86 |
| Release tarball | 224 KB | — |
| Typical library JSON | 2.2 KB | — |
458 tests across Python (436), Rust (18), and WASM (4) targets. No NPCoT test fails on any platform.