The New AI Problem Is a Lack of New Data
Why AI labs are buying coding tools and giving away free products
All indicators point to the AI coding market just heating up. Last week, SpaceX announced it had secured an option to buy Cursor for $60 billion. Last year, OpenAI tried to acquire Windsurf for $3 billion before the deal collapsed and Google swooped in for a licensing agreement. Google launched Gemini CLI as a free, open-source coding agent. GitHub Copilot has a free tier. OpenAI and Anthropic give away Codex and Claude Code usage for way less than market pricing.
But what’s happening one layer under the surface is more than a simple competition for users. These companies are competing for your data—subscription revenue is secondary. The old adage that if you’re not paying, you are the product, is as true as ever.
The data is gone
Epoch AI estimated that the total stock of public human-generated text on the internet is roughly 300 trillion tokens. Their projection: at current training rates, AI labs will exhaust that entire supply between 2026 and 2032. If you account for overtraining—running through the same data multiple times with different configurations—the timeline moves closer to 2026.
We’re there now. Every major lab has already ingested essentially all of Wikipedia, all of Reddit, all of Stack Overflow, most of GitHub, most published books, most news articles. The low-hanging fruit was picked years ago. What’s left is either low-quality, paywalled, or already synthetic (AI-generated content polluting the very sources that models need to learn from).
The result is that frontier models from OpenAI, Anthropic, Google, and others have been converging, with the gaps between them narrowing with every release cycle. Per-token API costs dropped 60-80% between early 2025 and now. Google’s Gemini Flash-Lite costs $0.25 per million input tokens. DeepSeek V4 is at $0.30.
Models are becoming commodities, and when you can’t differentiate your product on quality because everyone trained on the same internet, you have to find new data sources.
Strategy 1: Build tools that generate data
Google is giving away Gemini CLI. GitHub Copilot has a free tier. OpenAI is handing out Codex credits. Anthropic gives you way more than your subscription’s worth of tokens in Claude Code. The reason is the same in every case: every time you use an AI coding assistant, you’re generating training signal. The prompts you write, the code you accept or reject, the edits you make after the AI suggests something—all of that is precisely the kind of high-quality, expert-annotated data that doesn’t exist on the public internet.
From a lab’s perspective, a Stack Overflow answer tells you what correct code looks like. A coding assistant interaction tells you what correct code looks like in context: what problem the developer was actually solving, what they tried first, what they accepted, what they modified, and what they threw away. That’s a much richer training signal.
The free tier is a data collection strategy dressed up as a product offering. You get free AI assistance; they get millions of real-world developer workflows they can use to train the next generation of models.
Kilo’s open-source coding agent runs locally: your code context goes to whichever model provider you choose, and we never train on it because we don’t have a model to train. That’s a deliberate architectural decision. When I look at how the big labs are structuring their free tiers, I’m glad we made that call. Users should be paying closer attention to whether their coding assistant is serving them—or quietly serving as a data collection layer for the next foundation model.
Strategy 2: Acquire the data moat
If you can’t generate enough new data with free tools, you can buy a company that already has it.
That’s the SpaceX/Cursor story. Cursor has widespread adoption among developers, including senior engineers at well-funded companies writing production code. SpaceX’s post on X described Cursor’s appeal in terms of its “distribution to expert software engineers”. When you look at it from this angle, it sounds more like a description of a data asset than a compliment about UX or even product-market fit.
The same logic drove OpenAI’s failed $3 billion bid for Windsurf. OpenAI didn’t need another IDE. All the memes about “Why would you pay $3B for a VS Code fork?” were funny but unfounded. What OpenAI wanted—what they need—is Windsurf’s user base, and the interaction data that comes with it. When that deal collapsed (reportedly over Microsoft tensions about IP access), Google immediately moved in with a $2.4 billion licensing deal.
These acquisitions make zero sense as product plays. Cursor at $60 billion? The AI coding assistant market is valuable, sure, but not that valuable on subscription revenue alone. The valuations reflect the value of what’s actually being bought: a pipeline of continuous, high-quality human data that can’t be scraped from the public internet.
Strategy 3: Distill from competitors
There’s a third approach, and it’s messier. Instead of building tools to collect data or acquiring companies that have it, you can extract capabilities directly from stronger models.
Anthropic accused DeepSeek, Moonshot AI, and MiniMax of running “industrial-scale” distillation campaigns against Claude—approximately 24,000 accounts generating over 16 million exchanges, using commercial proxy services to bypass China access restrictions. MiniMax alone drove 13 million of those exchanges.
OpenAI made similar allegations to U.S. legislators, claiming it observed DeepSeek employees “developing methods to circumvent OpenAI’s access restrictions” and “developing code to access U.S. AI models and obtain outputs for distillation in programmatic ways.”
Distillation isn’t inherently sketchy—Anthropic acknowledged in its own statement that AI firms “routinely distill their own models to create smaller, cheaper versions.” But there’s a difference between distilling your own work and systematically extracting a competitor’s capabilities through 24,000 fake accounts.
The motivation, though, is the same as strategies 1 and 2: when you can’t get new training data from the public internet, you find other sources. Even if those sources are your competitors’ models.
What this means for developers
If you’re a developer, the race for your data puts you in a strange position: the companies building your tools need your workflows as much as you need their models. Your code patterns, debugging approaches, and architectural decisions all have value, and you trade it for the convenience of free autocomplete.
That doesn’t mean you should stop using these tools. I use AI coding assistants constantly—I’d be a hypocrite to suggest otherwise. But it’s worth being deliberate about which ones you use and what data you’re comfortable sharing. Read the terms of service, understand the privacy policies, and consider whether a tool that runs locally and lets you choose your model provider gives you more control than one that routes everything through a single vendor’s cloud.
The commodity floor
Models are converging because they trained on the same data. The labs that acquire the best new data—through free tools, acquisitions, or distillation—will pull ahead temporarily. But even that advantage is fleeting, because the strategies themselves are converging too. Every major lab is building free coding tools, bidding on coding startups, and dealing with distillation attacks—the strategies are converging as fast as the models.
My bet: the models themselves will fully commoditize within the next year or two. The differentiation will shift to execution speed, tool integration, privacy guarantees, and ecosystem support—basically everything except raw model intelligence. The difference between Claude, GPT, and Gemini on most coding tasks is already getting close to negligible. What matters is how they integrate into your workflow, and at Kilo, we believe this means you should have the freedom to use the right model for the proper job.
The labs will keep looking for new sources, expanding beyond coding into design, writing, data analysis—anywhere experts make decisions that generate valuable signal.
For the rest of us, the play is straightforward: use the subsidies, be aware of the trade-off, and don’t get locked in. The tools are good and getting better. Just be clear-eyed about why they’re free.



"The free tier is a data collection strategy dressed up as a product offering."
And then they bitch and moan about China distilling their models...
What's wrong with this picture?