/benchmarks — verified numbers

Every number has a command.

Nothing on this page is simulated. Two real runs back it: a library verification pass on a rented RTX A4000 ($0.16 end to end) and the HumanEval cascade on a rented RTX 3090 ($0.39 for the autoresearch stage). Every table names the command or artifact that produced it.

HumanEval on Qwen3.5-4B — the whole story in one chart

pass@1 — 164 problems, same 4B weights
baseline · greedy58.5%

96/164

vanilla NPCoT · greedy54.3%

−4.2 pt — a regression, published on purpose

+ verified retry · sampled67.68%

111/164 · +9.2 pt

+ autoresearch · best-of-1685.98%

141/164 · +27.5 pt · $0.39

Qwen3.5-9B baseline · greedy (reference)71.3%

117/164 — the bigger model this run beats

traces: training_results/realworld_vastai/humaneval_agent_4B.json · solved_programs.jsonl · rented RTX 3090

❯ read this before quoting 86%

This is a system result, not a model result.

The two big deltas come from verifier-gated sampling, not from the coprocessor weights. Per-problem analysis of the traces shows it plainly:

  • +15 — every retry-stack rescue came from the sampled arm (gate 0.05, temp 0.5). The greedy NPCoT arms rescued zero problems.
  • +30 — every autoresearch rescue is best-of-16 sampling (4 temperatures × 4 samples) filtered by the HumanEval test suite. The model already contained the correct programs; greedy decoding just never found them.

A harness like this would lift any model. The missing control — a gate=0, temp=0.5 arm with the same sampling budget — has not been run yet, so the NPCoT coprocessor's marginal contribution to the retry wins is unproven. That ablation is queued; this paragraph changes when it lands.

What is proven: the safety floor (first-try passes exactly match baseline's 96, by construction), the vanilla regression (−4.2 pt — we show it instead of hiding it), the cost ($0.39), and the persistence contract — every verified solve is cached, so repeat prompts return in O(1) without touching the model.

The cascade, layer by layer

Configurationpass@1pass/164Δ vs baselinecost
Baseline (greedy, no NPCoT)58.5%96/164
Vanilla NPCoT (regressed)~54.3%76/140−4.2 pt
+ compounding retry67.68%111/164+9.2 pt2.24× attempts avg
+ autoresearch85.98%141/164+27.5 pt$0.39 GPU

(1) Vanilla hurts. Applied unconditionally, the library fires on unrelated problems and drags pass@1 down 4.2 points. This is why naive "just bolt it on" integration is wrong.

(2) The retry stack can't lose. Baseline runs first; escalation happens only on a verified failure. The first-try pass count (96) matches baseline exactly — regression is impossible by construction.

(3) Autoresearch widens the search. 53 hard-fails remained after the retry stack. 51 of them exposed test signal the miner could use; 16-sample sweeps across four temperatures rescued 30 of those in two hours. Every +1 is a problem the baseline could never solve — and every solve persists.

The cascade did not add parameters. It widened the search and kept the receipts.

Qwen3.5 family baselines (no NPCoT)

Full HumanEval, greedy decoding, rented RTX 3090. These are the honest reference points every number above is measured against.

Modelpass@1pass countstatus
Qwen/Qwen3.5-0.8B23.2%38/164complete
Qwen/Qwen3.5-2B37.8%62/164complete
Qwen/Qwen3.5-4B58.5%96/164complete
Qwen/Qwen3.5-9B71.3%117/164complete

The compounding store: every solve persists

Every verified solve writes three indices to disk. The next run short-circuits the cascade on any prompt hash it has seen before — solved once means free forever.

ArtifactPurposeUpdate rate
solved_programs.jsonlappend-only fact log — source of truth1 row per solve
prompt_cache.jsonhash(prompt, entry_point) → program1 entry per unique prompt
temperature_stats.jsonper-temp solve counts across sessions+1 on successful solve

Resumable and process-safe. The cache is rebuildable from the log via ncpu.autoresearch.cli rebuild.

From benchmark to coding assistant

The same cascade handles free-form prompts once the implicit tests are pulled out. The parser handles four patterns with no LLM: explicit asserts, doctest blocks, arrow notation (fn(x) → y), and "returns" prose.

zsh — autoresearch cli
 echo 'def add(a, b):
    """Return the sum."""
add(1, 2) -> 3
add(10, -5) -> 5' | python -m ncpu.autoresearch.cli user

[user] entry_point=add  io_pairs=2  sources={'arrow': 2}
[user] SOLVED by template_match in 0.01s

def add(a, b):
    """Implement add."""
    return a + b

A prompt with any example I/O becomes a cascade-solvable work item, the solve persists into the compounding store, and the same prompt never costs anything again.

Runtime verification

These tables verify the runtime — the library executor, its ports, and its packaging — not the LLM story above. The self-consistency check is a CI canary against hand-coded buggy answers, not a model comparison.

Library self-consistency (200 problems)

python3 -m benchmarks.benchmark_npcot_coding_bench --n-problems 200

Systempass@1MAEwall
Ground-truth reference100.0%0.0000 ms
Synthetic noise floor22.0%2.5850 ms
NPCoT library consult60.5%0.7407 ms

Scale practicality (1,000 unseen)

python3 -m demos.npcot_scale_practicality

Pathper-problemMAE
Soft forward0.094 ms0.694
Library hit (Python)0.038 ms0.560
Library hit (Rust)2 µs0.560

The discrete program beats its soft parent (0.560 vs 0.694 MAE) — no sigmoid-relaxation drift.

Cross-platform correctness

PlatformLibrary MAESoft MAE
macOS Apple Silicon (MPS)0.5600.694
Linux x86_64 CUDA 12.10.5600.694
Linux x86_64 CPU0.5600.694
Bit-for-bit identical

Distribution artifacts

ArtifactSizeRuns on
WASM (npcot_wasm.wasm)130 KBany browser
Native binary (npcot_run)475 KBmacOS, Linux x86
Release tarball224 KB
Typical library JSON2.2 KB

458 tests across Python (436), Rust (18), and WASM (4) targets. No NPCoT test fails on any platform.