12 Comments
User's avatar
Rinmi khamrang's avatar

Love these comparisons. It helps us decide which agent to use for what. This will save me tons of $. Thank you, keep em coming 😍

Expand full comment
Ahmet Sezen's avatar

Thank you for the tests, they help a lot! Do you mind sharing the prompts you used in this article?

Expand full comment
Gabriele Tomberli's avatar

Very interesting article, thank you!

It would be nice to add:

- comparison with previous generation models, to understand how much they have improved

- comparison with more cost effective models like GLM, DeepSeek, etc, to understand if the quality justifies the cost

If the prompts are publicly shared somewhere, I can try and share some of the results.

Expand full comment
Neal Tibrewala's avatar

Were all of these a single test of each model, or did you run it a few times for each model on each scenario?

Expand full comment
Gwenaël Nardin's avatar

Can you share prompts used ?

Expand full comment
Seg's avatar

wait what do you mean by "Both GPT-5.1 and Gemini 3.0 hardcoded the JWT secret"? this is serious bad practice that should be penalized

Expand full comment
Marina Spricigo Azevedo's avatar

nice, but should have used 5.1-codex max and explicitly show us the reasoning effort

Expand full comment
H1D's avatar

Great work! Confirms my gut felling overall. Use opus as a default now.

What would be nice to see:

- prompts and code shared

- multiple runs, not one. Models aren't so deterministic still

- same tests but comparing to other IDEs (Cursor, TRAE...) since LLM "harness" could play bigger role more than the prompt and model itself

Expand full comment
Suhrab Khan's avatar

Your summary of the coding benchmarks is sharp and clear. You captured the nuances of each model’s strengths and trade-offs in a way that’s immediately useful for developers. Excellent work highlighting the practical takeaways.

I talk about the latest AI trends and insights. If you’re interested in practical strategies for using AI to optimize coding workflows, model selection, and software development efficiency, check out my Substack. I’m sure you’ll find it very relevant and relatable.

Expand full comment
SOL's avatar

love these structured comparisions series. very helpful

Expand full comment
Dr. Daniel Bender's avatar

Very helpful article. It shows how close the top three closed-models are in coding tasks.

At the same time, it is interesting to see the special aspects of their coding styles and things to watch out for.

Keep up the good work!

Expand full comment
Dev Zu's avatar

Amazing! My friend Miguel loved this information.

Expand full comment