Your summary of the coding benchmarks is sharp and clear. You captured the nuances of each model’s strengths and trade-offs in a way that’s immediately useful for developers. Excellent work highlighting the practical takeaways.
I talk about the latest AI trends and insights. If you’re interested in practical strategies for using AI to optimize coding workflows, model selection, and software development efficiency, check out my Substack. I’m sure you’ll find it very relevant and relatable.
Love these comparisons. It helps us decide which agent to use for what. This will save me tons of $. Thank you, keep em coming 😍
Thank you for the tests, they help a lot! Do you mind sharing the prompts you used in this article?
Very interesting article, thank you!
It would be nice to add:
- comparison with previous generation models, to understand how much they have improved
- comparison with more cost effective models like GLM, DeepSeek, etc, to understand if the quality justifies the cost
If the prompts are publicly shared somewhere, I can try and share some of the results.
Were all of these a single test of each model, or did you run it a few times for each model on each scenario?
Can you share prompts used ?
wait what do you mean by "Both GPT-5.1 and Gemini 3.0 hardcoded the JWT secret"? this is serious bad practice that should be penalized
nice, but should have used 5.1-codex max and explicitly show us the reasoning effort
Great work! Confirms my gut felling overall. Use opus as a default now.
What would be nice to see:
- prompts and code shared
- multiple runs, not one. Models aren't so deterministic still
- same tests but comparing to other IDEs (Cursor, TRAE...) since LLM "harness" could play bigger role more than the prompt and model itself
Your summary of the coding benchmarks is sharp and clear. You captured the nuances of each model’s strengths and trade-offs in a way that’s immediately useful for developers. Excellent work highlighting the practical takeaways.
I talk about the latest AI trends and insights. If you’re interested in practical strategies for using AI to optimize coding workflows, model selection, and software development efficiency, check out my Substack. I’m sure you’ll find it very relevant and relatable.
love these structured comparisions series. very helpful
Very helpful article. It shows how close the top three closed-models are in coding tasks.
At the same time, it is interesting to see the special aspects of their coding styles and things to watch out for.
Keep up the good work!
Amazing! My friend Miguel loved this information.