SOTA Models Should Learn to KISS (Fable 5 Experiment)

Jun 11, 2026

One thing I keep noticing with the latest coding models is that their mistakes are getting more… interesting.

With older AI models, it was pretty easy to spot a mistake. The model hallucinated an API, misunderstood the framework, wrote code that did not compile, or confidently worked on the wrong part of the app. You could reject its suggestions quickly because the problem was pretty obvious to spot.

With newer models like Claude Fable 5, the failure modes are more subtle.

The setup

I kept seeing overhyped claims about how Fable 5 can one-shot 3D games, cool graphics, animations, etc. A lot of people use Claude models for coding though, so I decided to benchmark it by giving it, well, a coding task.

To get things started, I created basic a Next.js app where you could list/add products:

The app had 2 main routes: /products and /admin. It also had an internal API route at /api/products (I know this isn’t necessary for production apps most of the time, but I wanted to make things more complicated for the model).

The /admin page contained a form for adding products. To make things more complex (for the model, that is), I made the form post to the /api/products route handler, where the product was added to an in-memory data store, and that change needed to reflect whenever you returned to /products and expected to see the newly added item. The read side of the bug looked like this:

const res = await fetch(`${baseUrl}/api/products?${params}`, {

next: { revalidate: 60 },

});

Just to clarify, the bug was not the cached fetch (nothing’s wrong with this code in isolation) but the fact that /products cached its read for 60 seconds, while the admin write path never invalidated that cached result. This resulted in a broken flow; say you added a product under /admin and immediately went back to /products, there was no guarantee you’d immediately see that listing.

I liked testing how Fable 5 would handle this “bug” because:

I knew it would almost certainly find it and solve it, but the more interesting part was:
…about the approach it would take and how it’ll decide what to do. To solve this problem, you need to connect several pieces together, pick a suitable approach in a way you don’t drastically alter the rendering process.

By the way, here’s a list of the architectural choices, which IMO made sense in isolation:

Server-rendering a catalog page
I also had URL-based filters for the listing
You could argue that a 60-second cache on product data is a reasonable performance choice
I also called router.refresh() after mutating

The bug lived in the gap between those decisions. If someone visited /products, then added a product from /admin, then returned to /products before the 60-second window expired, the page could still show the old product list.

I have a Next.js App and the problem is that when I add a product from /admin, then go back to /products, the new item sometimes doesn’t appear immediately.

What’s the issue and can you fix this?

What Fable 5 did

Fable 5 did fix the bug pretty quickly.

It read the files and recognized that the core issue was cache invalidation after product creation. It did not do obviously wrong things like modify the filters, refactor to a client-side component, disable caching entirely, etc. It understood the important part: the product was being saved, and that /products could still be showing a cached product list from before the write.

So far, so good. But also expected for a SOTA model that was hyped to death.

Now, the fix itself. This is what we got from its thinking process:

The core issue is that I need to invalidate the cache after creating a product. The cleanest approach is to tag the fetch with next: { tags: [”products”] } and then call revalidateTag(”products”) in the POST handler.

The fix looked like this:

// app/products/page.tsx

next: { revalidate: 60, tags: [”products”] }

// app/api/products/route.ts

import { revalidateTag } from “next/cache”;

const product = addProduct(...);

revalidateTag(”products”);

return NextResponse.json({ product }, { status: 201 });

This kept the read requests cached for normal browsing, invalidated product data after writes, and preserved the server-rendered products page. Overall, the solution and the answer weren’t bad and it was something you’d expect from a frontier model.

The overengineering was subtle

For this specific app, I would have expected the simplest fix to be:

revalidatePath(”/products”)

After the POST succeeds, invalidate the route that shows the product list…and that’s pretty much it.

A tag-based solution is more scalable, but it is also over-engineered for what this specific app really needs. It is what you reach for when product data appears in many places: a homepage, category pages, related-product sections, sitemaps, admin views, maybe multiple layouts. In that world, invalidating the concept of products is cleaner than remembering every path.

But this was a very simple app that anyone who learned Next.js for more than a few months could write. In this situation, tags are not wrong per se; they introduce a bit more architecture than the problem needs.

I’ve seen a recurring pattern of people commenting a similar thing across social media; that SOTA models are overengineering. The model chooses an architectural pattern, but does not quite justify why this app needs that pattern yet.

To me, the best answer would have said:

For this small app, use revalidatePath(”/products”).

If product data is reused across multiple routes, use tags and revalidateTag(”products”).

Important update: To be clear, the model’s fix was valid. My criticism is not that it’s wrong, but that it reached for a scalable abstraction before considering the smallest fix that matched the requirements.

A simple example, but a recurring pattern

I know that this is a pretty simple example for a senior software engineer. Over-generalizing that SOTA models like Fable 5 are overengineering based on this example alone would be…well, stupid, to put it plain and simple.

However, I’m not alone in this observation. Here are a few more people who noticed the same pattern.

Anthropic also said that “Claude Opus 4.5 and Claude Opus 4.6 have a tendency to overengineer by creating extra files, adding unnecessary abstractions, or building in flexibility that wasn’t requested.” They recommend adjusting that with prompting, but I wonder if prompting is enough.

It’s also interesting how most model evals follow a specific pattern:

Task -> Evaluate a task -> Compare it against tests

As models get better, those benchmark tasks get more complicated. I wonder if new models are optimizing for this “loop” without being aware of it; more complex tasks involve more complex solutions, you usually need patterns for more complicated problems; as a result, simpler apps suffer because the model is trying patterns when there’s no need for it in the first place.

What do you think?

Andrew Simard

Jun 11

As an aging solo developer, I never had much interest in testing or testing frameworks, that sort of thing, as I worked on my own stuff. With AI though, I'm all-in on testing everything, even in simpler apps, as it provides a natural mechanism to help put blinders on models so they get to their destination without taking a wrecking ball through a project. So this evolution to Task > Evaluate > Test is I think exactly what we want models to be doing? And if we compartmentalize our code enough, then we'll always seemingly be giving them "simple" tasks to implement when in reality these are likely to often be smaller pieces of a larger project where that extra bit of effort is not wasted. If anything, I see these models doing more as a way to combat complacency in our own coding efforts. `I *could* do that but I'm in a rush and it's not MVP so let's skip it` kind of mindset versus `what would I do in this situation if I had enough talent and resources`. I'm liking the latter approach myself.

Η Προώθηση Της Γνώσης

Jun 12

Usually a better workflow is to ask the agent to create a plan and wait for review. That way you can read what the agent intents to implement before starting implementation. You would have seen the over-engineering intention early on.

For fixing bugs; It can happen that we ask an agent to fix a bug. Usually if the bug is complext, first we ask the agent to investigate , find the root cause, or debug. Modyfying the request to:

'Find the root cause for [X] , show me the problem, suggest a fix, and wait for review'

may have handled your complaint? The idea is to treat the agent as a co-worker with whom we have to communicate, but I guess we all tend to forget occasionally

1 reply

1 more comment...

Kilo Blog

Discussion about this post

Ready for more?