6 Comments
User's avatar
Mushegh Gevorgyan's avatar

Great writeup. Any chance you could share the SPEC.md? Would love to reproduce on a few other models.

richardstevenhack's avatar

I use Kimi 2.5 on Nvidia's Developer program - which means FREE API access.

I assume 2.6 will be available there at some point.

Something to keep in mind.

Manuel Gollner's avatar

Like Mushegh said - if you could share your test environment with your SPEC.md that would be indeed very helpful for setting up test ourselve and understanding your benchmarks better. Also to contirbute to this great articel! What realy annoys me ist the happy halluzination part of the "stupid" models. They always claim to be finished and "Yeah! Tested everything - looks great!" additude but if you look closely they messed up. I hardly find project where state-machine correctness does not matter. So I keep on with opus thinking high...and pay a lot... A test with Opus 4.6 and the new Kimi oder Opus 4.5 would have been interessting, too. To see how these models evolve and how fare the chinese models are behind the leaderboard in "months" of realease. I personaly don't trust these benchmarks. What does it help me to see how they scored on humans last exam if they fail big time on my easy coding tasks :-D

A friend set up a similar environment for testing - you can check it out at https://GitHub.com/jannismain/ccbench

So thank you very much for this test!

KrisFromFuture's avatar

vs GLM 5.1 and MiniMax M2.7 please

Zen Equity's avatar

It's nice to see the open weight models catching up with top tier proprietary so quickly

Ken Lyle's avatar

So, as a concrete example, I have a non-critical lask to review my docs against my app, fill the doc gaps, and build the links in the app so that each screen and important fields link to the right elements or query in my MCP documentation system. I am thinking Kimi (or Elephant) over Claude to get a nearly or free scaffold without paying Claude to compact the Convo 7 times?