We Asked Grok Build 0.1 to Recover a Secret Hidden in Git History

Jun 10, 2026

We wanted to test Grok Build 0.1 on a real agentic coding task from one of the world’s most popular coding benchmarks (Terminal-Bench).

The background: Kilo Bench is our internal benchmark we use for new models. It’s a fork of Terminal-Bench 2.0, integrated with the Kilo CLI.

To test Grok Build 0.1, we took one of the most popular Terminal-Bench tasks called git-leak-recovery and asked Grok to execute on it.

The task

Imagine this: A secret got committed to a Git repository by mistake. Someone noticed and tried to make it disappear by rewinding the branch back to the commit before it. The current files & history look clean at first glance. But the secret, of course, is still in there, sitting in a part of Git that does not show up in the normal view.

We wanted to see whether Grok could handle this: go into a messy repository, find what was hidden, fix it carefully, and don’t break anything else along the way.

The model had three jobs:

Recover the secret and write it to /app/secret.txt.
Clean up the repository so the secret could not be found anywhere in it.
Leave unrelated files and commit messages untouched.

In plain English: find the secret the repo tried to forget, save it, then actually erase the remaining traces.

That is harder than it sounds. The task ships with time estimates for a human doing it by hand: about 30 minutes for an expert, an hour for a junior engineer. Grok finished the task in 27 steps, roughly 41 seconds of agent work, for $0.09.

The repo looked clean at a first glance

At first glance, there was almost nothing in the repository.

The repo lived in /app/repo. It held a .git directory and a README.md.

The visible commit history looked normal too. There were two commits, both with ordinary “init” messages. The current README said:

demo project
some changes

Nothing about the current, visible files hinted at a secret. To a person glancing at the repo, nothing looked unusual.

But, as you know, Git isn’t just a simple folder of files. Git keeps history, and history has a whole lot of data. Would this AI model be able to figure something like this out on its own?

Git remembers more than you think

If you are new to Git, think of it as the tool developers use to track changes in code over time.

Every time a developer saves a meaningful checkpoint, Git creates a “commit.” A commit is a snapshot of the project at that moment.

That entire history is a useful piece of data. It lets teams see what changed, recover old work, compare versions, and undo mistakes.

However, that same history can work against you.

Say a developer accidentally commits a file with a password in it. They notice, delete the file, and commit again. The latest version of the project looks clean, but the earlier commit is still sitting in the history, and the password is still in it. Basically anyone can find it with git log. To really get rid of it, you have to rewrite history and remove that old commit.

But even after someone does that, Git often keeps local copies around for a while; things like old commits, deleted objects, and reflog entries that no longer show up in the normal history but are still recoverable.

That is why in many cases “I deleted the file” is not the same as “the secret is gone.”

The task tested exactly that distinction. The secret was gone from the visible repo and Grok had to figure out whether Git still remembered it.

Grok started with the obvious checks

Grok did not jump straight into cleaning.

It listed the contents with ls -la, checked the commit history with git log, and pulled the full patches with git log -p.

That is what a developer would do first: understand the current state before touching anything.

The visible history only showed README changes and nothing suspicious for now.

A simpler tool might have stopped there. A basic text search would have come up empty too, because the secret was not in the current files.

But the task said the secret had been committed and then hidden by rewinding the branch. That clue matters, so Grok looked deeper.

The hidden commit

Grok ran git fsck --unreachable to look for objects no branch points to anymore.

An unreachable object is a piece of Git data that is no longer connected to the normal history.

Picture Git history as a visible timeline. Most commits sit on that timeline, and you see them when you run the normal history commands. But Git holds onto old pieces that are not on the timeline anymore. Maybe a commit got reset away or maybe a branch was deleted. Those pieces are “unreachable.” and do not show up in the usual history, but they are still sitting inside the repo.

That is where Grok found the clue.

Git reported an unreachable commit, so Grok inspected it with git show.

The hidden commit had the message:

feat: add scratch notes

And it added a file:

secret.txt

That was the first breakthrough. The current repo did not show a secret.txt. The visible history did not show it either. But an old unreachable commit still had it.

The secret

Grok opened the secret.txt from that hidden commit.

Inside was the leaked value. It matched the format the task described, a short tagged string, and it was the only thing in the repo shaped that way.

Now Grok had done the first hard part. It found a secret that was not in any visible file.

But the overall task was not done. Grok had to copy the secret into /app/secret.txt, then clean the repo so the same secret could not be found anywhere.

This order matters: If the agent cleaned the Git objects first, it might delete the only remaining copy before saving it.

Grok wrote the value to /app/secret.txt, then read the file back to confirm.

Only after that did it start cleaning.

Cleaning the repo for real

This is where the task moves from “find the secret” to “make sure it is actually gone.”

Grok checked the reflog first with git reflog. A reflog is Git’s local memory of where your branch used to point. Think of it like a safety net; if you reset a commit away by accident, the reflog helps you get it back.

That is great when the lost commit holds the work you want. It is a problem when the lost commit holds a secret you need to destroy.

The reflog told Grok the whole story. It had an initial commit, then a commit that added scratch notes. Then a hard reset that rewound the branch one commit, orphaning the notes. Then another normal commit on top. The secret had been committed, then dropped out of the visible history when the branch was reset. That leftover local metadata is what keeps an orphaned commit recoverable, which is why it had to go.

So Grok expired the reflog with git reflog expire --expire=now --all. Then, instead of assuming that worked, it ran the unreachable-object check again with git fsck --unreachable.

The three objects were still there.

Expiring the reflog had not been enough on its own. A less careful agent might have stopped at the reflog step, declared victory, and left the secret sitting in the repo, recoverable by anyone who knew where to look. Grok caught it because it verified in the middle of the job instead of only at the end.

So it ran Git garbage collection with pruning, git gc --prune=now. In plain terms, it told Git to throw out the discarded objects for good.

Then it checked again with git fsck. No unreachable objects were left. The secret string was gone from the .git directory. A broader search for the secret[...] pattern across the repo returned nothing.

The secret had been recovered to the output file and removed from the repository.

It preserved the project

The last requirement was easy to overlook: leave irrelevant files and commit messages untouched.

This matters because a careless cleanup can remove the secret while wrecking the repo.

An agent could delete the whole .git directory. That removes the secret, and the entire history with it. It could rewrite commit messages it did not need to touch. It could edit the README even though the README had nothing to do with the task.

Grok did none of that (fortunately). After cleanup, the commit history still showed the expected commits, the README was unchanged, and a full integrity check came back clean.

The benchmark takes a checksum of every real file in the repo and compares it against a known-good value. Change a byte in the README, drop a commit you should have kept, and the checksum no longer matches.

That’s the outcome we usually want here; the secret being recovered, the repo being cleaned, and no modifications to the working dir.

Similar to a real-world task

The reason the repo in this benchmark was tiny is that you eliminate noise and make the task easy to describe. The underlying problem is something many software devs encounter daily.

Secrets leak into Git repositories all the time: Say a developer commits an API token by mistake, then another developer removes it from the current files and assumes the problem is solved. The problem is that anyone with the repository history can still dig out the old value.

Secret cleanup isn’t just about editing text, but inspecting the Git history as well. You have to know where Git stores old information and what the visible history does and does not show. You also have to know that deleted files can still be recovered, and that reflogs and unreachable objects keep old commits alive locally.

Why this is a good test for a coding model

Well, for a few reasons mostly: There is a clear goal, a hidden target, an expected output file, real constraints, and a way to check whether the agent actually pulled it off.

That makes it different from most “flashy” demos where you ask model to build a website and the result is you get is evaluated (mostly) subjectively.

The reason I like this experiment is because the task is concrete and the evaluations are pretty straightforward: Did the agent recover the secret? Is the repo actually clean? Did the unrelated files and commit messages come through untouched? Each answer can be boiled down to a clear yes or a no.

What Grok did well

Grok figured out the right commands for the task, and also the overall sequence to arrive at the solution.

The thing I also liked is that it followed the “shape” of the problem:

Inspect the visible repo
Confirm the obvious files did not hold the answer
Look into the history
Find the unreachable objects
Inspect the hidden commit
Recover the secret
Save it
Clear the reflogs
Check and notice the objects were still there
Prune them
Check again
Confirm the visible project was intact

That is what an agentic workflow is usually about: using the tools, reading the environment, making decisions, and checking its own work along the way.

Limitations

I felt like I need to put this as a disclaimer before ending the article.

All this does not mean AI agents can fix every leaked-secret incident.

In production, if a secret is committed, you should assume it may already be exposed. What you usually do here is rotate the key, revoke it. Treat cleaning the repo as just a small step of your overall todo.

This benchmark also uses an artificial secret in a controlled environment. Real repos are bigger and messier, with remote copies, forks, CI logs, package caches, and other places a secret can hide.

Kilo Blog

Discussion about this post

Ready for more?