We Asked Grok Build 0.1 to Plan and Build a Webhook Service
xAI released Grok Build 0.1 as a new coding model priced at $1 per 1M input tokens and $2 per 1M output tokens with a 256K context window. We ran it through Kilo Code (the VS Code extension) on a real backend task to see how it handles planning, implementation, and the long tool-calling sequence in between.
TL;DR: Grok Build 0.1 planned and built a working webhook delivery service in TypeScript, Bun, and SQLite for $1.65 total ($0.17 to plan, $1.48 to implement) at roughly 120 tokens per second, with zero tool-calling failures across the run.
Pricing and Context
For cost context, Grok Build 0.1 sits at $1/$2 per 1M input/output. GPT-5.5 is $5/$30 and Claude Opus 4.7 is $5/$25.
The Setup
We tested Grok Build 0.1 across two phases inside Kilo Code: Plan Mode first, then Code Mode. We ran each phase in its own Kilo Code chat so the plan and the implementation each started with a clean context. Same model in both, same project directory.
The Task
We asked Grok Build 0.1 to build a webhook delivery service. Webhook delivery is the kind of backend nobody enjoys writing but every product needs. Events come in over HTTP, the service has to deliver them to subscriber URLs registered against event types, retry failed deliveries with backoff, sign outgoing requests so subscribers can verify they’re real, and isolate broken subscribers so they don’t take the queue down with them.
Here is the exact prompt we used:
“Build a webhook delivery service in TypeScript using Bun and SQLite. The service should accept webhook events via an HTTP API, deliver them to subscriber URLs registered against event types, retry failed deliveries with backoff, sign outgoing requests so subscribers can verify they came from us, and isolate deliveries that keep failing so they don’t block healthy traffic. Operators should be able to register subscribers, list pending and failed deliveries, and replay deliveries that previously failed. Include a working example that simulates a flaky subscriber and shows the retry behavior end to end.”
This prompt is intentionally open-ended. The model picks the HTTP framework, the database access layer, the retry policy, how secrets are stored, the operator API surface, and how the example is structured. That’s closer to how a senior engineer actually receives a feature request than a tightly-scoped spec is.
Phase 1: Plan Mode
The first thing Grok Build 0.1 did was open a web search. It pulled current references on webhook reliability patterns before writing anything (Stripe’s signature format, GitHub’s retry behavior, the Standard Webhooks spec). That’s a reasonable opening move on this kind of task because the conventions for webhook signing (HMAC headers, timestamp tolerance, header naming) are easy to get subtly wrong if you’re working from memory.
Once it had context, it asked nine multiple-choice questions before producing any plan. The questions covered:
The HTTP framework (Hono vs Elysia vs raw
Bun.serve)The database access layer (raw SQL vs Drizzle vs Kysely)
The worker design (DB polling vs in-memory queue with
setTimeout)The API auth approach (none, per-subscriber secrets only, etc.)
The secret storage strategy (plaintext vs encrypted at rest)
The retry policy (hardcoded global vs per-subscriber)
The demo structure (separate flaky server vs in-process toggle)
The project name and any other constraints
The questions mapped directly to architectural choices the prompt left open. The model didn’t ask things it could have inferred from the prompt. It asked things where the answer materially shapes the architecture.
We answered with Hono, Drizzle ORM, a DB-polling worker, no API auth (signing only), encrypted secrets at rest, a hardcoded global retry policy, and a separate flaky-subscriber script for the demo.
Grok Build 0.1 then produced a written plan covering the full system. The plan included an ASCII architecture diagram, the database schema written out as real Drizzle TypeScript (not pseudocode), an explicit out-of-scope list, a phased implementation breakdown with time estimates, a risks-and-tradeoffs section, and explicit success criteria.
A few things stood out in what it chose:
It defaulted to Standard Webhooks-style headers (
X-Webhook-Id,X-Webhook-Timestamp,X-Webhook-Signaturewitht=...,v1=...), not a custom scheme. This is the header shape Stripe, Svix, and a growing list of providers use, and the one most subscriber libraries already know how to read.It planned AES-GCM encryption of subscriber signing secrets at rest, using the Web Crypto API with a master key from an environment variable.
It planned an SSRF guard in front of every outbound
fetchto a subscriber URL, blocking cloud metadata endpoints by default.It planned a separate immutable
delivery_attemptstable so the full attempt history is preserved even when replay creates a new delivery row.
The plan generated in roughly 90 seconds. Cost: $0.17.
Phase 2: Code Mode
We then opened a fresh Kilo Code chat with Grok Build 0.1 in Code Mode and asked it to implement the plan.
The model read the plan file, then started executing the phases in order: bootstrap the project, write the crypto layer, build the DB schema and migrations, wire up the delivery engine, add the HTTP routes, build the flaky subscriber, write the demo orchestrator.
The agentic run was smooth. No tool-calling retries, no malformed arguments, no hallucinated file paths. When it hit environment issues during setup (a drizzle-kit ABI mismatch with Bun, a type error on a Zod schema, an import path it got slightly wrong on the first pass), it diagnosed and fixed each one without us intervening. The run completed in a single uninterrupted session.
The final output was 26 files across src/, scripts/, tests/, and a generated drizzle/ migration folder. The project structure matched the plan exactly. The model didn’t drift, didn’t quietly simplify the spec, and didn’t drop pieces along the way.
Cost: $1.48.
Verification
After the run finished we checked the project against the basics:
Fresh DB demo runs cleanly. On a stale DB (a webhooks.db left over from a previous run with a different master key), we saw one stale delivery fail with secret decrypt failed, the replayed delivery stay pending, and the subscriber end up paused. The demo script still exits 0 because it doesn’t assert final state strongly enough. The runtime behavior is correct (a decrypt failure should fail the delivery), but the demo script presents itself as a passing check when it isn’t really one.
What the Code Looks Like
The code Grok Build 0.1 produced lined up with the plan. It produced the expected modules, passed typecheck, passed 14 helper tests, and ran the happy-path demo on a fresh DB.
A few things worth calling out:
SQLite is configured correctly on open. WAL mode,
busy_timeout, and foreign keys are all set before the first query runs. These are easy to forget and they’re the difference between a SQLite project that survives concurrent writes and one that doesn’t.Migrations are real. It generated a proper Drizzle migration (a checked-in SQL file under
drizzle/) and wired the migrator into the startup path. It did not justCREATE TABLE IF NOT EXISTSinline.The signing layer uses Standard Webhooks-style headers. HMAC-SHA256 over
id.timestamp.canonical_payload, the same header names most subscriber libraries already know,Retry-Afterrespected on 429 responses, 30-second timeout viaAbortSignal.timeouton every outbound fetch.Fan-out is transactional. When an event comes in, the matching deliveries are inserted in a single DB transaction so a crash mid-fanout doesn’t leave the system half-delivered.
Replay is non-destructive. Replaying a failed delivery creates a new row instead of mutating the original, so the attempt history stays intact as audit data.
A working demo ships with the project. A
flaky-subscriber.tsscript spins up a Hono server on a separate port that fails the first N attempts on purpose, verifies signatures, and logs everything. Ademo.tsorchestrator spawns both processes, fires an event, polls the deliveries table, watches the retries happen, triggers the auto-pause, then replays. The demo runs on a fresh database and shows the intended retry/replay path, but it is not a real test harness yet (see Verification above).
Structured JSON logs are consistent throughout. Error handling follows the same pattern in every code path (record the attempt, then update the delivery row, then update the subscriber). The unit tests cover the parts where a silent mistake would never be caught at runtime (crypto roundtrip, backoff math, signing and verification, URL safety). The shape is solid, but the missing coverage is around the actual worker and replay path.
What We’d Flag
A few things didn’t land cleanly:
GET /subscribers/:idreturns the encrypted secret. The route returns the full subscriber row, which includes theencryptedSecretcolumn. The plan was explicit that the plaintext secret should only ever be returned once (at creation time), and the README repeats that promise. But the get-by-id endpoint hands back the encrypted form on every request. With the master key, that’s the secret. This is the kind of thing a human code reviewer catches in 30 seconds.The signature comparison isn’t constant-time. It’s a plain string equality check. For a webhook verifier this matters because constant-time comparison is what protects against timing-attack signature recovery.
Integration coverage is thin. Unit tests cover the math (crypto, backoff, signing, URL safety). There are no integration tests for the actual delivery loop, the auto-pause behavior, or the replay flow. The demo script doubles as an integration check, but it isn’t wired into
bun testand (as Verification above notes) it doesn’t assert final state strongly enough.A few demo shortcuts ship in the code. The retry schedule is short (0 to 6 seconds across all six attempts) to keep the demo fast, the SSRF guard has its private-IP block commented out so the localhost flaky subscriber can be reached, and there’s no rate limit on ingest. These are called out in the README as demo choices, but a production user would need to revisit them before shipping.
These are real code-review notes. The code runs and the unit tests pass, but a developer doing a review pass on this PR would close these by tightening the subscriber endpoint, swapping the comparison for crypto.timingSafeEqual, and writing a couple of integration tests around the worker loop.
Speed
Kilo Code reported throughput of roughly 120 tokens per second during the run, with short pauses between tool calls instead of the long thinking phases we usually see on agentic runs. We’ve seen higher numbers on inference platforms optimized for raw throughput (Cerebras-hosted models can push close to 1,000 tps, for example), but 120 tps on a 256K-context model with this much tool-calling on a single task is fast.
The plan generated in about 90 seconds end to end. The implementation finished in a single uninterrupted run.
Cost
The full job came to $1.65:
$0.17 for the plan (web research, nine clarifying questions, the full written plan)
$1.48 for the implementation (26 files, 3 tables, working demo, passing unit tests)
For a service of this scope (HTTP API, encrypted secret storage, retry-with-backoff worker, per-subscriber isolation, replay, signing, demo harness), $1.65 is in the range where running the same task two or three times to compare attempts is still inexpensive.
What Grok Build 0.1 Felt Like to Use
A few patterns from this run worth naming:
It asks before it builds. On an open-ended prompt, the model spent its first turn on research and its second turn on questions. It didn’t guess at a framework, pick a retry policy, or commit to a secret-storage approach without checking first.
The plan it produces is the plan it executes. When we switched to Code Mode and pointed it at the plan, the final project structure matched the plan exactly. Nothing was lost, nothing was quietly downgraded.
It recovers from environment friction on its own. The Bun ABI issue with
drizzle-kit, the Zod type error, the import path correction: it diagnosed each one, fixed it, and kept going.Tool calling held up across a long run. No malformed file edits, no hallucinated paths, no infinite retry loops on broken commands. The agent loop ran cleanly from start to finish.
In previous cheaper-model runs, these were the places where tool use broke down first. Grok Build 0.1 avoided them in this run.
Takeaway
Grok Build 0.1 planned, scoped, and built a webhook delivery service in one sitting for $1.65 at around 120 tokens per second, with zero tool-calling failures and a demo that runs on a fresh database. The code lined up with the plan, the conventions it picked were sensible (Standard Webhooks-style headers, AES-GCM secret storage, a basic SSRF guard), and the review notes that came out of it are the kind a reviewer can close in a follow-up PR rather than a rewrite.
At $1 per 1M input and $2 per 1M output with 256K of context, this run cost $1.65. Running it two or three times on the same prompt to compare attempts is still cheaper than a single attempt on a frontier model at current prices. On a well-scoped backend with clear primitives, this run says it can carry the first pass on its own, as long as a reviewer checks the worker path, API surface, and demo assertions before treating it as done.
The testing was performed using Kilo Code, a free open-source AI coding assistant for VS Code and JetBrains with 3,000,000+ installs across all platforms.






