AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild.
The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code (“vibe coding”), while in 23%, humans
write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs—through corrections, failure reports, and interruptions—in 44% of all turns. By capturing complete interaction traceswith human vs. agent code authorship attribution, SWE-chat provides an
empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
They also discovered that:
We identify sessions with a low success rating, revealing cases where agents fail to complete the user requests appropriately (Figure 6). In addition to that, we find that less than half of all agent-produced code survives into user commits (Table 3). Vibe coding is particularly inefficient, consuming roughly 3× more tokens and dollars per committed line than collaborative coding (Figures 7 and 29). Vibe-coded code is also substantially less safe. It introduces roughly 9× more security vulnerabilities per committed line than code that humans write themselves and about 5× more than code they co-author with the agent (Table 4). Agents are working autonomously for longer—the 99.9th-percentile turn duration now exceeds 100 minutes—yet they rarely stop to ask users for clarification (Figure 30). Users compensate by interrupting agents in 5% of turns and by pushing back against agent outputs in 39% of turns, often providing corrections and failure reports (Figure 8)
I've read the paper. The SWE-chat study is real data and worth taking seriously, but using it as a flat "vibe coding has a serious problem" verdict skips past several methodological gaps that matter a lot for how you interpret those numbers.
The dataset is 6,000 sessions from open-source developers who opted into Entire.io's public checkpoint logging on public GitHub repos. The paper itself acknowledges this "selects for early adopters of a new open-source tool" and explicitly says findings "may not generalize." There is zero stratification by developer experience level anywhere in the paper — no junior vs. senior, no novice vs. principal, nothing. The words "junior," "senior," "principal," or "expert" in the sense of career level don't appear as analytical dimensions. That omission is enormous, because the population is almost certainly weighted toward developers early in their practice with these tools, and the headline numbers reflect that population - not the ceiling of what's possible.
First, that 44% is the overall average. Vibe coding sessions actually have a -59% survival rate. The paper notes this "may reflect lower user scrutiny" - meaning vibe coders are accepting more output with less review, not that the output is higher quality. More critically: an experienced engineer reviewing agent output and discarding the 40% that doesn't meet the bar is the system working correctly. Treating generated code as a proposal to be curated is the right mental model. The "waste" framing only makes sense if you think the agent's first output should be production-ready, which it shouldn't.
Vibe coding consumes roughly 3x more tokens per committed line than collaborative coding. But tokens are not the binding constraint for professional developers — time is. The paper's own time efficiency data tells a completely different story: collaborative sessions (where a human is actively directing and co-authoring) clock in at a median of 4.8 minutes per 100 committed lines, while vibe coding sessions run 12.6 minutes. Collaborative coding - meaning an experienced engineer staying engaged and steering tightly - is 2.6x faster by time and most cost-efficient. That finding is buried in Section 4.2 and doesn't make it into the abstract.
This one I take most seriously - but it's measured by Semgrep static analysis only, on files modified by the commit, comparing post-commit findings that weren't in pre-commit. It's a proxy metric, not a complete security audit. More importantly: it measures what entered commits, not what entered production after code review. In a professional setting with an experienced engineer, agent output gets reviewed before it ships. The vulnerabilities are a flag for the review process, not a verdict on the output itself. The finding is actually the best argument for what I've been writing about regarding diff discipline and human gatekeeping - not against agentic coding as such.
Nowhere in the paper - not in the methodology, not in the limitations, not in future work - is there any discussion of session pre-configuration. No mention of CLAUDE.md, rules files, system prompts, or any of the setup work that experienced practitioners do before a single prompt is typed. An experienced engineer doesn't open a blank agent session and say "build me a feature." They've set up behavioral defaults, scoped the context, defined no-fly zones, and structured the task before the first turn. The study captures nothing about this. It's measuring cold sessions from a population that skews toward people who haven't done that work yet.
Another important gap the paper acknowledges - sessions where the agent completely fails and the user abandons the output are not captured at all. As the limitations section states: "If the user abandons the agent's output entirely, session logs are not committed and thus not captured by our data." This means the dataset is biased toward sessions that produced some committed code. Complete failures are systematically excluded, which means even the headline numbers are more optimistic than the actual experience of the broader population.
For what it's worth from a practitioner standpoint: I'm a principal engineer, and I rarely need more than 1-2 well-structured prompts with a short test cycle before a feature is complete and requires minimal rework. Not because the tools are magic, but because setup matters enormously --- scoped prompts, pre-configured agent behavior, explicit constraints, and staying in the collaborative loop rather than delegating entirely. The paper's own data supports this: collaborative coding (active human engagement + agent) is the most efficient mode across time, tokens, and cost. It just doesn't generate the alarming headline numbers.
The study is doing something genuinely valuable --- real empirical data from real sessions is exactly what the field needs. But "vibe coding has a serious problem" as a takeaway conflates undisciplined use by a mixed-experience population with the ceiling of what experienced practitioners achieve. Those are different things, and the study doesn't --- and can't, given its design --- distinguish between them.
I agree entirely that a professional developer would do the necessary pre-code design work to maintain control of the project.
I grew up during the period in the 70’s, 90’s and 90’s when structured programming and design principles were being developed.
When I read about LLMs “refactoring an entire codebase”, let alone generating an entire app as a one-shot, I shudder.
“In a professional setting with an experienced engineer, agent output gets reviewed before it ships.”
And yet Anthropic engineers were asked how many of the them push code without review. Quite a few said they did.
And that’s in a presumably professional environment?
What happens when NON-professional amateur engineers are encouraged by “AI influencers” to engage in vibe-coding?
“The vulnerabilities are a flag for the review process, not a verdict on the output itself.”
If there IS a review process. My entire point is that new vibe-coders don’t know to do one, let alone how to do one, let alone a competent security audit - even by just running the LLM and telling it to do one.
“The finding is actually the best argument for what I've been writing about regarding diff discipline and human gatekeeping - not against agentic coding as such.”
I’m not against AI-generated code as such. I’m against people being encouraged to generate code that is 1) not maintainable, 2) insecure, 3) not reviewed.
We all know management wants the code out yesterday - regardless of quality. AI coding has made this an even bigger problem. The security industry is clawing its eyes out over the potential for even more insecure code and vulnerable AI-generated Web sites than the industry was producing before.
I’ve long said that software engineering - isn’t. It’s still a craft profession without serious engineering principles. And since LLMs were trained on it, we should not be surprised if the code quality isn’t any better than that produced by humans.
What I advocate is that deterministic AI systems must be developed to aid software engineers trained in true engineering in the process of software design and production. Non-deterministic LLMs will never be as good.
But that’s not where we’re headed, fueled by “tokenmaxxing.”
We are indeed heading in that direction, just that the SWE-chat study seems to lead us further into that direction by often being misrepresented than it does in a positive light. I hope some are reflecting on it just like you've written about, I hope the discipline is put into process - I kind of doubt it will be. But it's the world we're left with. As for those of us that do however, we'll be able to create more, faster, at higher quality than we ever have had a chance to do.
Eventually more story will be told that the SWE-chat study just doesn't and can't tell, until then I just feel its important to point out the more myopic or misleading narratives.
A new Stanford study shows once again that vibe coding has a serious problem...
Coding Agent Interactions From Real Users in the Wild
https://arxiv.org/pdf/2604.20779
Abstract
AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild.
The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code (“vibe coding”), while in 23%, humans
write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs—through corrections, failure reports, and interruptions—in 44% of all turns. By capturing complete interaction traceswith human vs. agent code authorship attribution, SWE-chat provides an
empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
They also discovered that:
We identify sessions with a low success rating, revealing cases where agents fail to complete the user requests appropriately (Figure 6). In addition to that, we find that less than half of all agent-produced code survives into user commits (Table 3). Vibe coding is particularly inefficient, consuming roughly 3× more tokens and dollars per committed line than collaborative coding (Figures 7 and 29). Vibe-coded code is also substantially less safe. It introduces roughly 9× more security vulnerabilities per committed line than code that humans write themselves and about 5× more than code they co-author with the agent (Table 4). Agents are working autonomously for longer—the 99.9th-percentile turn duration now exceeds 100 minutes—yet they rarely stop to ask users for clarification (Figure 30). Users compensate by interrupting agents in 5% of turns and by pushing back against agent outputs in 39% of turns, often providing corrections and failure reports (Figure 8)
I've read the paper. The SWE-chat study is real data and worth taking seriously, but using it as a flat "vibe coding has a serious problem" verdict skips past several methodological gaps that matter a lot for how you interpret those numbers.
The dataset is 6,000 sessions from open-source developers who opted into Entire.io's public checkpoint logging on public GitHub repos. The paper itself acknowledges this "selects for early adopters of a new open-source tool" and explicitly says findings "may not generalize." There is zero stratification by developer experience level anywhere in the paper — no junior vs. senior, no novice vs. principal, nothing. The words "junior," "senior," "principal," or "expert" in the sense of career level don't appear as analytical dimensions. That omission is enormous, because the population is almost certainly weighted toward developers early in their practice with these tools, and the headline numbers reflect that population - not the ceiling of what's possible.
First, that 44% is the overall average. Vibe coding sessions actually have a -59% survival rate. The paper notes this "may reflect lower user scrutiny" - meaning vibe coders are accepting more output with less review, not that the output is higher quality. More critically: an experienced engineer reviewing agent output and discarding the 40% that doesn't meet the bar is the system working correctly. Treating generated code as a proposal to be curated is the right mental model. The "waste" framing only makes sense if you think the agent's first output should be production-ready, which it shouldn't.
Vibe coding consumes roughly 3x more tokens per committed line than collaborative coding. But tokens are not the binding constraint for professional developers — time is. The paper's own time efficiency data tells a completely different story: collaborative sessions (where a human is actively directing and co-authoring) clock in at a median of 4.8 minutes per 100 committed lines, while vibe coding sessions run 12.6 minutes. Collaborative coding - meaning an experienced engineer staying engaged and steering tightly - is 2.6x faster by time and most cost-efficient. That finding is buried in Section 4.2 and doesn't make it into the abstract.
This one I take most seriously - but it's measured by Semgrep static analysis only, on files modified by the commit, comparing post-commit findings that weren't in pre-commit. It's a proxy metric, not a complete security audit. More importantly: it measures what entered commits, not what entered production after code review. In a professional setting with an experienced engineer, agent output gets reviewed before it ships. The vulnerabilities are a flag for the review process, not a verdict on the output itself. The finding is actually the best argument for what I've been writing about regarding diff discipline and human gatekeeping - not against agentic coding as such.
Nowhere in the paper - not in the methodology, not in the limitations, not in future work - is there any discussion of session pre-configuration. No mention of CLAUDE.md, rules files, system prompts, or any of the setup work that experienced practitioners do before a single prompt is typed. An experienced engineer doesn't open a blank agent session and say "build me a feature." They've set up behavioral defaults, scoped the context, defined no-fly zones, and structured the task before the first turn. The study captures nothing about this. It's measuring cold sessions from a population that skews toward people who haven't done that work yet.
Another important gap the paper acknowledges - sessions where the agent completely fails and the user abandons the output are not captured at all. As the limitations section states: "If the user abandons the agent's output entirely, session logs are not committed and thus not captured by our data." This means the dataset is biased toward sessions that produced some committed code. Complete failures are systematically excluded, which means even the headline numbers are more optimistic than the actual experience of the broader population.
For what it's worth from a practitioner standpoint: I'm a principal engineer, and I rarely need more than 1-2 well-structured prompts with a short test cycle before a feature is complete and requires minimal rework. Not because the tools are magic, but because setup matters enormously --- scoped prompts, pre-configured agent behavior, explicit constraints, and staying in the collaborative loop rather than delegating entirely. The paper's own data supports this: collaborative coding (active human engagement + agent) is the most efficient mode across time, tokens, and cost. It just doesn't generate the alarming headline numbers.
The study is doing something genuinely valuable --- real empirical data from real sessions is exactly what the field needs. But "vibe coding has a serious problem" as a takeaway conflates undisciplined use by a mixed-experience population with the ceiling of what experienced practitioners achieve. Those are different things, and the study doesn't --- and can't, given its design --- distinguish between them.
I agree entirely that a professional developer would do the necessary pre-code design work to maintain control of the project.
I grew up during the period in the 70’s, 90’s and 90’s when structured programming and design principles were being developed.
When I read about LLMs “refactoring an entire codebase”, let alone generating an entire app as a one-shot, I shudder.
“In a professional setting with an experienced engineer, agent output gets reviewed before it ships.”
And yet Anthropic engineers were asked how many of the them push code without review. Quite a few said they did.
And that’s in a presumably professional environment?
What happens when NON-professional amateur engineers are encouraged by “AI influencers” to engage in vibe-coding?
“The vulnerabilities are a flag for the review process, not a verdict on the output itself.”
If there IS a review process. My entire point is that new vibe-coders don’t know to do one, let alone how to do one, let alone a competent security audit - even by just running the LLM and telling it to do one.
“The finding is actually the best argument for what I've been writing about regarding diff discipline and human gatekeeping - not against agentic coding as such.”
I’m not against AI-generated code as such. I’m against people being encouraged to generate code that is 1) not maintainable, 2) insecure, 3) not reviewed.
We all know management wants the code out yesterday - regardless of quality. AI coding has made this an even bigger problem. The security industry is clawing its eyes out over the potential for even more insecure code and vulnerable AI-generated Web sites than the industry was producing before.
I’ve long said that software engineering - isn’t. It’s still a craft profession without serious engineering principles. And since LLMs were trained on it, we should not be surprised if the code quality isn’t any better than that produced by humans.
What I advocate is that deterministic AI systems must be developed to aid software engineers trained in true engineering in the process of software design and production. Non-deterministic LLMs will never be as good.
But that’s not where we’re headed, fueled by “tokenmaxxing.”
I agree with a lot of what you've written here. For example, the security situation was already a mess pre-Gen AI tooling stepping up, and now I've even written about it! >> https://compositecode.blog/2026/03/09/security-was-already-a-mess-generative-ai-is-about-to-prove-it/
We are indeed heading in that direction, just that the SWE-chat study seems to lead us further into that direction by often being misrepresented than it does in a positive light. I hope some are reflecting on it just like you've written about, I hope the discipline is put into process - I kind of doubt it will be. But it's the world we're left with. As for those of us that do however, we'll be able to create more, faster, at higher quality than we ever have had a chance to do.
Eventually more story will be told that the SWE-chat study just doesn't and can't tell, until then I just feel its important to point out the more myopic or misleading narratives.