We tested Devstral 2 on real coding tasks

Dec 05, 2025

Note: This model was previously released as a stealth model called “Spectre,” so you may see that name referenced in the images below. So Spectre = Devstral 2.

We ran Devstral 2 through three coding tasks to see how it handles real development work, including building an API from scratch, finding bugs in Go code, and writing documentation.

Let’s dive deeper.

What We Tested

We created three tests that cover common development workflows:

Code generation: Build a bookmarking API with TypeScript and Hono
Bug detection: Find and fix issues in Go code with concurrency bugs
Documentation: Generate JSDoc and README for a complex TypeScript function

All tests ran in Kilo Code using Code Mode with a clean setup for each test. We deliberately used different languages and frameworks to see how Devstral 2 performs across different development scenarios.

Test 1: Bookmarking API

We wanted to test how closely Devstral 2 follows instructions. We gave it specific requirements and checked whether it implemented exactly what we asked for, added unrequested features, or missed requirements. This reflects a common development workflow where you already know exactly what you need to build and you want the model to execute without diverging from the spec.

Prompt:

Build a bookmarking API using TypeScript and Hono with the following requirements:

1. Use better-sqlite3 for persistence
2. Endpoints:
   - POST /bookmarks - create a bookmark (url, title, tags[], notes)
   - GET /bookmarks - list all bookmarks with optional tag filter
   - GET /bookmarks/:id - get single bookmark
   - PUT /bookmarks/:id - update bookmark
   - DELETE /bookmarks/:id - delete bookmark
3. Input validation using Zod
4. Proper error handling with appropriate status codes
5. Include a health check endpoint at GET /health

The results: Devstral 2 set up the entire project from scratch with package.json, dev and start scripts, TypeScript configuration, and a multi-file architecture all in one go. After generating the code, it started the server and ran curl commands to verify each endpoint worked correctly.

It’s worth noting from our experience running these tests, models often get stuck setting up Node project structures. Version mismatches, wrong package.json configurations, and TypeScript issues usually take a few back-and-forths to fix. Devstral 2 was among a small number of models that one-shotted the entire setup.

The project structure Devstral 2 created:

The validation layer shows clean Zod usage with separate schemas for create and update operations.

The repository layer handles database operations with proper JSON serialization for the tags array.

What Devstral 2 did well:

Created proper separation of concerns (models, repositories, validation)
All five required endpoints implemented with correct HTTP methods
Zod validation with .safeParse() and formatted error responses
Try/catch blocks on all routes with appropriate status codes (400, 404, 500, 201)
Self-tested the endpoints with curl after implementation

Minor issues:

Uses any type in a few places in the repository layer
The update logic passes empty strings for undefined optional fields instead of preserving existing values

Devstral 2 generating the bookmarking API in Kilo Code

Test 2: Bug Detection in Go

Bug detection is one of those tasks where we’ve seen mixed results between frontier models and smaller models. Some bugs are obvious pattern matches that any model can catch, but others require deeper reasoning about control flow and edge cases. We wanted to see where Devstral 2 lands on this spectrum.

We wrote a Go session management system with intentional bugs including race conditions, nil pointer issues, and missing error handling.

Prompt: “Review this Go code and find all bugs and issues. Fix them.”

Bugs planted: 9
Bugs found: 7

Devstral 2 ran go build and go vet after making changes to verify the fixes compiled correctly. Similar to how it tested the API implementation using curl in Test 1, Devstral 2 has a tendency to verify its own work using language tools like go build and go vet or external tools like curl. It’s an interesting behavior worth noting.

Here’s how Devstral 2 fixed the nil pointer dereference in GetSession:

Before:

After:

What Devstral 2 missed:

The login handler has a critical authentication bypass bug. When the password check fails, the code sends an error response but doesn’t return from the function, so execution continues and creates a session for the user anyway.

This means any login attempt creates a valid session regardless of the password. Devstral 2 fixed the mutex bugs and nil pointer issues but missed this logic error.

Devstral 2 also identified the race condition in the rate limiter but left a comment instead of implementing the fix.

Devstral 2 analyzing and fixing Go bugs in Kilo Code

Test 3: Documentation

Documentation is another task that’s often offloaded to smaller, faster, and more affordable models. You already have working code and you just need it explained clearly. We wanted to see how Devstral 2 handles this, so we gave it a complex TypeScript token management function (about 100 lines) that handles four different actions: generate, validate, revoke, and refresh tokens.

Prompt: “Write comprehensive documentation for this function including JSDoc comments, a README file explaining usage, and examples for each action type.”

Devstral 2 produced both inline JSDoc and a separate README with 256 lines of documentation.

JSDoc output:

The README includes a token types table, usage examples for each action, and a complete authentication workflow.

Devstral 2 documented all interfaces (TokenConfig, GeneratedToken, ValidationResult), explained the security features (SHA-256 hashing, timing-safe comparison), and included practical examples showing common patterns like token refresh flows and error handling.

Minor issues:

One JSDoc example shows validation.expiresAt but the validation result object doesn’t have that property (it has an expired boolean instead)
The README includes npm install crypto in the setup instructions, but crypto is a built-in Node.js module that doesn’t need installation

Devstral 2 generating documentation in Kilo Code

Observations

Devstral 2 is a fast model that is reliable during tool calls. All three tests completed in a single pass with no tool calling failures or retries needed. And since it’s free, we ran all tests without spending anything.

One behavior that stood out was self-verification. Devstral 2 ran verification commands after each task. For the API, it started the server and tested endpoints with curl. For the Go code, it ran go build and go vet to confirm fixes compiled. This caught issues before we had to review the output.

Devstral 2 is currently in stealth mode, and the team behind it is actively collecting feedback during this testing phase. We expect the model to improve based on real-world usage data.

Where Devstral 2 Fits

Based on these tests, Devstral 2 handles implementation work well. The code generation test showed it can scaffold a complete project with proper structure. The bug detection found most issues but missed a critical auth bypass. The documentation output was thorough with practical examples.

The self-verification behavior is useful. Having the model test its own output reduces back-and-forth debugging cycles.

How to Start Using Devstral 2 For Free

Devstral 2 is free in Kilo Code with no rate limits.

Install our Kilo agent (available for VS Code, JetBrains, or as a CLI).
Select “Devstral 2” from the model dropdown
Start coding!

Dmenis

Dec 8

The article mentions that the model is free however that is not what I see. In the model dropdown in VSCode it is not shown as a free model.

PS: I read several comments on this model and they were pretty negative no where near the positivity mentioned by KiloCode. Haven't seen KiloCode responding to this.

matt wilkie

Dec 5Edited

this post is timely, because I was just in the process of signing in to substack to add a comment _"Like the other commenters, my experience with Spectre is middling to not very good. @Darko, can you share how you worked with it to get good results?"_ to https://blog.kilo.ai/p/spectre-stealth-model. So, you answered my prompt before I hit submit! heh.

My take-away from this new blog post in depth is: with Spectre we **must** use Architect with a frontier model first for good results, where as the other stealth models in the last couple of months have been a bit more forgiving with less forethought.

1 more comment...

Kilo Blog

Discussion about this post

Ready for more?