❯ /benchmarks — verified numbers

Every number has a command.

Nothing on this page is simulated. Two real runs back it: a library verification pass on a rented RTX A4000 ($0.16 end to end) and the HumanEval cascade on a rented RTX 3090 ($0.39 for the autoresearch stage). Every table names the command or artifact that produced it.

❯HumanEval on Qwen3.5-4B — the whole story in one chart

● pass@1 — 164 problems, same 4B weights

baseline · greedy58.5%

96/164

vanilla NPCoT · greedy54.3%

−4.2 pt — a regression, published on purpose

+ verified retry · sampled67.68%

111/164 · +9.2 pt

+ autoresearch · best-of-1685.98%

141/164 · +27.5 pt · $0.39

Qwen3.5-9B baseline · greedy (reference)71.3%

117/164 — the bigger model this run beats

traces: training_results/realworld_vastai/humaneval_agent_4B.json · solved_programs.jsonl · rented RTX 3090

❯ read this before quoting 86%

This is a system result, not a model result.

The two big deltas come from verifier-gated sampling, not from the coprocessor weights. Per-problem analysis of the traces shows it plainly:

+15 — every retry-stack rescue came from the sampled arm (gate 0.05, temp 0.5). The greedy NPCoT arms rescued zero problems.
+30 — every autoresearch rescue is best-of-16 sampling (4 temperatures × 4 samples) filtered by the HumanEval test suite. The model already contained the correct programs; greedy decoding just never found them.

A harness like this would lift any model. The missing control — a gate=0, temp=0.5 arm with the same sampling budget — has not been run yet, so the NPCoT coprocessor's marginal contribution to the retry wins is unproven. That ablation is queued; this paragraph changes when it lands.

What is proven: the safety floor (first-try passes exactly match baseline's 96, by construction), the vanilla regression (−4.2 pt — we show it instead of hiding it), the cost ($0.39), and the persistence contract — every verified solve is cached, so repeat prompts return in O(1) without touching the model.

❯The cascade, layer by layer

Configuration	pass@1	pass/164	Δ vs baseline	cost
Baseline (greedy, no NPCoT)	58.5%	96/164	—	—
Vanilla NPCoT (regressed)	~54.3%	76/140	−4.2 pt	—
+ compounding retry	67.68%	111/164	+9.2 pt	2.24× attempts avg
+ autoresearch	85.98%	141/164	+27.5 pt	$0.39 GPU

(1) Vanilla hurts. Applied unconditionally, the library fires on unrelated problems and drags pass@1 down 4.2 points. This is why naive "just bolt it on" integration is wrong.

(2) The retry stack can't lose. Baseline runs first; escalation happens only on a verified failure. The first-try pass count (96) matches baseline exactly — regression is impossible by construction.

(3) Autoresearch widens the search. 53 hard-fails remained after the retry stack. 51 of them exposed test signal the miner could use; 16-sample sweeps across four temperatures rescued 30 of those in two hours. Every +1 is a problem the baseline could never solve — and every solve persists.

The cascade did not add parameters. It widened the search and kept the receipts.

❯Qwen3.5 family baselines (no NPCoT)

Full HumanEval, greedy decoding, rented RTX 3090. These are the honest reference points every number above is measured against.

Model	pass@1	pass count	status
Qwen/Qwen3.5-0.8B	23.2%	38/164	complete
Qwen/Qwen3.5-2B	37.8%	62/164	complete
Qwen/Qwen3.5-4B	58.5%	96/164	complete
Qwen/Qwen3.5-9B	71.3%	117/164	complete

❯The compounding store: every solve persists

Every verified solve writes three indices to disk. The next run short-circuits the cascade on any prompt hash it has seen before — solved once means free forever.

Artifact	Purpose	Update rate
solved_programs.jsonl	append-only fact log — source of truth	1 row per solve
prompt_cache.json	hash(prompt, entry_point) → program	1 entry per unique prompt
temperature_stats.json	per-temp solve counts across sessions	+1 on successful solve

Resumable and process-safe. The cache is rebuildable from the log via ncpu.autoresearch.cli rebuild.

❯From benchmark to coding assistant

The same cascade handles free-form prompts once the implicit tests are pulled out. The parser handles four patterns with no LLM: explicit asserts, doctest blocks, arrow notation (fn(x) → y), and "returns" prose.

● zsh — autoresearch cli

❯ echo 'def add(a, b):
    """Return the sum."""
add(1, 2) -> 3
add(10, -5) -> 5' | python -m ncpu.autoresearch.cli user

[user] entry_point=add  io_pairs=2  sources={'arrow': 2}
[user] SOLVED by template_match in 0.01s

def add(a, b):
    """Implement add."""
    return a + b

A prompt with any example I/O becomes a cascade-solvable work item, the solve persists into the compounding store, and the same prompt never costs anything again.

❯Runtime verification

These tables verify the runtime — the library executor, its ports, and its packaging — not the LLM story above. The self-consistency check is a CI canary against hand-coded buggy answers, not a model comparison.

Library self-consistency (200 problems)

python3 -m benchmarks.benchmark_npcot_coding_bench --n-problems 200

System	pass@1	MAE	wall
Ground-truth reference	100.0%	0.000	0 ms
Synthetic noise floor	22.0%	2.585	0 ms
NPCoT library consult	60.5%	0.740	7 ms

Scale practicality (1,000 unseen)

python3 -m demos.npcot_scale_practicality

Path	per-problem	MAE
Soft forward	0.094 ms	0.694
Library hit (Python)	0.038 ms	0.560
Library hit (Rust)	2 µs	0.560

The discrete program beats its soft parent (0.560 vs 0.694 MAE) — no sigmoid-relaxation drift.

Cross-platform correctness

Platform	Library MAE	Soft MAE
macOS Apple Silicon (MPS)	0.560	0.694
Linux x86_64 CUDA 12.1	0.560	0.694
Linux x86_64 CPU	0.560	0.694
Bit-for-bit identical	✓	✓

Distribution artifacts

Artifact	Size	Runs on
WASM (npcot_wasm.wasm)	130 KB	any browser
Native binary (npcot_run)	475 KB	macOS, Linux x86
Release tarball	224 KB	—
Typical library JSON	2.2 KB	—

458 tests across Python (436), Rust (18), and WASM (4) targets. No NPCoT test fails on any platform.