KiloBench - Because Your Benchmark Score Doesn't Pay the Bill

Jun 08, 2026

Last month I was reviewing our model evaluation results for a new release, and I caught myself doing something absurd: comparing two models on a benchmark that neither of them would ever encounter in our actual product. One model scored 3 points higher. It also cost 8x more per task when we actually ran it through Kilo’s pipeline. The “worse” model was the obvious production choice, and no leaderboard on earth would have told me that.

This is the problem KiloBench exists to solve.

The benchmark ceiling

SWE-bench Verified — the benchmark that defined AI coding capability for two years — is approaching saturation. The top six models are separated by 1.3 percentage points. Claude Opus 4.5 leads at 80.9%, with Gemini 3.1 Pro at 80.6% and GPT-5.2 at 80.0%. When OpenAI ran a contamination audit, they found models could reproduce gold patches verbatim just from being given a task ID. Diagnostic subtasks showed 76% accuracy through memorization alone.

OpenAI’s response was to stop reporting SWE-bench Verified scores entirely.

The problem isn’t just contamination. It’s scaffold inflation. The choice of agent orchestration layer — how it decomposes problems, which tools it invokes, how it handles failures — can inflate raw model scores by 12+ points. When vendors publish SWE-bench numbers, they’re publishing a model+scaffold combination. Change the scaffold, change the number.

Scale AI’s SWE-bench Pro tries to fix this with harder, multi-file tasks. The performance gap is dramatic: GPT-5.3 Codex scores ~80% on Verified but drops to 56.8% on Pro. Claude Opus 4.5 hits 45.9% on the SEAL standardized leaderboard. That 35-point gap represents the difference between optimized-for-benchmark and genuine capability.

SWE-bench Pro still doesn’t answer the question I actually need answered: which model should I use in my harness, at what cost?

Generic benchmarks answer generic questions

The major benchmarks each test something different:

SWE-bench Verified: Can a model fix Python bugs from GitHub issues? (500 tasks, Python-only, showing signs of saturation)
SWE-bench Pro: Can a model handle harder, multi-file, multi-language tasks? (1,865 tasks, resists gaming better)
Terminal-Bench: Can a model complete real terminal-based workflows — not just patch code, but actually operate in a shell? (89 tasks from Stanford/Laude Institute)
Aider Polyglot: Can a model generate correct code edits? (Model-only evaluation, no agent framework)
PinchBench: Can a model perform real-world OpenClaw agent tasks? (23 tasks spanning scheduling, research, email, code)

Each benchmark reveals something useful. None of them answers the question that engineers making procurement decisions actually face: given my specific agent framework, my task distribution, and my budget, which model delivers the most value?

As MorphLLM noted in March: “No benchmark measures cost per task or time to completion. A model that scores 80% but costs $2 per task may be worse than one scoring 75% at $0.20.” That observation is what led us to build KiloBench.

The sticker price problem

Model pricing in 2026 looks straightforward on paper. DeepSeek V4-Flash charges $0.14 per million input tokens. Claude Opus 4.7 charges $5.00. A 35x difference, easy to model — except that comparison means almost nothing in practice.

Reasoning models create an internal chain of thought before producing visible output. Those thinking tokens are billed at output rates — you never see them, but you pay for them. On a hard coding problem, a model might burn 20,000 reasoning tokens before writing a 400-token response. The visible output suggests a cheap call. The actual bill reflects something very different.

The headline per-token rate understates real cost by 5x to 30x depending on model and task difficulty. Claude Sonnet 4.6’s thinking tokens are billed at $15/MTok — same rate as visible output. On a complex multi-step coding task where it reasons extensively, a single API call can cost more than several non-reasoning calls from a model with a higher sticker price.

Then add the agent loop problem. AI agents don’t consume tokens like chatbots. Every step in an agent loop sends the entire accumulated context back to the model. By step 20 of a debugging session, you’re paying for the same system prompt and conversation history 20 times. Audits of production teams show that re-sent context accounts for 62% of the total bill, with actual useful reasoning output comprising just 11%.

One developer hit $4,200 in API fees over a single weekend during an autonomous refactoring run. No benchmark would have warned them.

The cheapest model for a workload isn’t the one with the lowest per-token rate — it’s the one that gets the job done for the least money.

What KiloBench measures

KiloBench is built on Terminal-Bench’s terminal-dense task data — 89 real-world tasks spanning everything from git operations to cryptanalysis to QEMU automation. But instead of running them through a generic scaffold, KiloBench runs them specifically through the Kilo harness. Multiple trials per model. With cost tracking on every single run.

The key metrics:

Cost per attempt. Not “how much does this model cost per million tokens.” How much does it cost each time you ask it to do a task through Kilo. This captures the full picture: reasoning tokens, re-sent context, tool call overhead, retries, everything.

Cost to complete. Some models pass a task on the first try. Others need three attempts. A model that’s cheap per attempt but needs five tries is more expensive than a model that costs more per attempt but nails it once. KiloBench tracks both.

Harness-specific pass rate. The same model scores differently depending on which agent framework wraps it. We saw this clearly in our MiniMax M2.7 evaluation — the model consumed 2.8M input tokens per trial on average because its exploration-heavy behavior interacted with Kilo’s tool pipeline in a specific way. That’s not a model problem or a harness problem. It’s a combination that you can only measure by running both together.

Behavioral fingerprints. Models don’t just differ in capability. They differ in how they work. M2.7 reads extensively before writing — that thoroughness finds bugs other models miss, but it also means more tokens and potential timeouts on time-sensitive tasks. Kimi K2.6 sustains 1,000+ tool calls across 13-hour sessions. Claude tends toward shorter, more targeted terminal commands. These behavioral patterns matter enormously for cost, and they only emerge under harness-specific testing.

Why harness-specific matters

We already knew this from our own testing. When we evaluated MiniMax M2.7 across both PinchBench and Kilo Bench, we found that every model in the comparison solved tasks that no other model could. A hypothetical oracle that picks the best model per task would solve 67% of tasks — a 36% improvement over the best single model.

The models are complementary, not interchangeable — and which combination works best depends entirely on the harness.

SWE-bench Pro acknowledged this problem: the gap between their SEAL standardized scaffold (45.9% for Opus 4.5) and the best custom agent system (57.0% for GPT-5.3 Codex) is 11 points. Scaffolding adds 11 points of real capability — but only if it’s the right scaffold for the right model on the right tasks. Generic benchmarks can’t tell you that. They don’t run your harness.

The premise

As benchmarks converge, they become less useful for practical model selection. The information content of “this model scores 80.9% and that one scores 80.0%” is approximately zero for someone choosing a production model.

What still has information content:

This model costs $0.47 per task in Kilo and passes 52% of terminal-dense tasks.
That model costs $1.83 per task and passes 55%.
A third model costs $0.22 per task, passes 44%, but clears a different subset of tasks — including three that neither of the first two can solve.

Those numbers are actionable in a way that pass rate alone isn’t.

KiloBench is our commitment to producing those numbers. Generic benchmarks served their purpose when models were separated by 20 points. Now that they’re separated by fractions of a point, the useful signal has shifted to cost, behavior, and harness fit — which is exactly what we’re measuring.

We’ll be publishing initial results at kilo.ai/leaderboard. If you’re running Kilo and wondering why your monthly bill doesn’t match the pricing page — you’re asking the right question, and we’re building the tool to answer it.

Ken Lyle

Jun 8

Nemotron doesn't seem to get proper appreciation on the leaderboard. It's not on the Kilo Bench

Cost vs performance graph. Usage seems low because I get the impression there's a capacity limit of some kind...but it should get more love. I think a lot of people will look at at that graph and make a selection.

Kilo Blog

Discussion about this post

Ready for more?