Kimi K2.6 launched on April 20, 2026, four days after Anthropic released Claude Opus 4.7. We gave both models the same spec for FlowGraph, a persistent workflow orchestration API with DAG validation, atomic worker claims, lease expiry recovery, pause/resume/cancel, and SSE event streaming. Then we reviewed the code and reproduced the edge cases the models’ own tests did not cover.
Like Mushegh said - if you could share your test environment with your SPEC.md that would be indeed very helpful for setting up test ourselve and understanding your benchmarks better. Also to contirbute to this great articel! What realy annoys me ist the happy halluzination part of the "stupid" models. They always claim to be finished and "Yeah! Tested everything - looks great!" additude but if you look closely they messed up. I hardly find project where state-machine correctness does not matter. So I keep on with opus thinking high...and pay a lot... A test with Opus 4.6 and the new Kimi oder Opus 4.5 would have been interessting, too. To see how these models evolve and how fare the chinese models are behind the leaderboard in "months" of realease. I personaly don't trust these benchmarks. What does it help me to see how they scored on humans last exam if they fail big time on my easy coding tasks :-D
So, as a concrete example, I have a non-critical lask to review my docs against my app, fill the doc gaps, and build the links in the app so that each screen and important fields link to the right elements or query in my MCP documentation system. I am thinking Kimi (or Elephant) over Claude to get a nearly or free scaffold without paying Claude to compact the Convo 7 times?
Great writeup. Any chance you could share the SPEC.md? Would love to reproduce on a few other models.
I use Kimi 2.5 on Nvidia's Developer program - which means FREE API access.
I assume 2.6 will be available there at some point.
Something to keep in mind.
Like Mushegh said - if you could share your test environment with your SPEC.md that would be indeed very helpful for setting up test ourselve and understanding your benchmarks better. Also to contirbute to this great articel! What realy annoys me ist the happy halluzination part of the "stupid" models. They always claim to be finished and "Yeah! Tested everything - looks great!" additude but if you look closely they messed up. I hardly find project where state-machine correctness does not matter. So I keep on with opus thinking high...and pay a lot... A test with Opus 4.6 and the new Kimi oder Opus 4.5 would have been interessting, too. To see how these models evolve and how fare the chinese models are behind the leaderboard in "months" of realease. I personaly don't trust these benchmarks. What does it help me to see how they scored on humans last exam if they fail big time on my easy coding tasks :-D
A friend set up a similar environment for testing - you can check it out at https://GitHub.com/jannismain/ccbench
So thank you very much for this test!
vs GLM 5.1 and MiniMax M2.7 please
It's nice to see the open weight models catching up with top tier proprietary so quickly
So, as a concrete example, I have a non-critical lask to review my docs against my app, fill the doc gaps, and build the links in the app so that each screen and important fields link to the right elements or query in my MCP documentation system. I am thinking Kimi (or Elephant) over Claude to get a nearly or free scaffold without paying Claude to compact the Convo 7 times?