Is Vibe Coding Growing Up?

In June 2025, Linus Torvalds, the inventor of Linux and Git, was asked about “vibe coding”, where non-programmers use large language models to write code. He memorably described it as “Very Inefficient But Entertaining”.

However, the world of AI is far from static, and the release of Claude Opus 4.5 in November 2025 seems to have been well received in the software development community. Along with Google’s Gemini series, the latest ChatGPT 5 model and competitors like Llama, it is clear that the capabilities of LLMs to write software are developing apace, and have moved on significantly since the early days of generative AI. Along with complementary tools like the Cursor AI code editor, software developers now have a range of evolving tools to choose from, and can move on from just producing prototype systems, useful though that can be.

The productivity effect of generating code via an LLM, compared to traditional coding, is much debated. Of course, an LLM can produce a thousand lines of code in moments in response to a text prompt, and no human can do that. However, actual coding is a fraction of a software developer’s job. They also have to consider user requirements, design, code review, testing, debugging etc. Different studies and articles show different numbers, but the consensus seems to be that a developer will spend around 20-30% of their time actually coding. So, even if you eliminated coding entirely, with no other balancing consequences, you would only make a developer about a quarter more productive. Very good for sure, but not the “10x” quotes that are breathlessly bandied about on LinkedIn posts.

A key question, though, is what is the quality of that code, and how maintainable is it? Over half and up to two-thirds of corporate IT budgets are spent on support and operations, much more than on software development. If software development is twice as fast but that code takes twice as much effort to maintain, that is a net loss, since the support costs are ongoing rather than one-off. Consequently, the maintainability of AI-generated code is crucial.

Some large surveys of developers have so far found some intriguing results. The December 2025 Dora Survey found that 90% of developers use AI, yet 30% report little or no trust in the code that is generated; only 20% report that they have a lot of trust in it. The July 2025 Model Evaluation and Threat Trust (METR) survey with a randomised control trial found that people who thought that AI made them 20% more productive actually were 19% less productive.

What does this mean for the use of LLMs for coding in enterprises? To begin with, very few corporates are greenfield sites, awaiting brand new software applications. Large companies typically have millions of lines of code and hundreds of applications in a variety of programming languages, as well as commercial software packages. Some of these applications have been running for decades, and it is very hard to justify replacing an old, large system that works just fine. Many banks and insurance companies have systems that are 20-40 years old, often involving millions of lines of code. It has been estimated that perhaps 200-800 billion lines of COBOL code are still running operationally today. Bear in mind that COBOL was first specified in 1959. No one is going to try replacing a ten-million-line COBOL system with some LLM-generated replacement. If for no other reason, LLMs cannot see millions of lines of code due to context window limits.

One important thing to understand is that an LLM does not keep a persistent model of the system that it generates. If it is asked to later modify a system that it once generated, it first assesses the current code structure, modifies just the relevant parts needed for the change, and then outputs a patch or edited sections. Someone, either a human or possibly another software agent, then reviews the proposed edits, performs any tests that are needed and then, when the tests are passed, commits the new code to production. The “context size” limits the amount that an LLM can see of a system, so for a very large application, the LLM will make edits one module at a time. This has issues, since there may well be dependencies between modules. It depends on there being well-defined schemas, interfaces, APIs etc. However, to be fair, this approach is also true of human developers modifying a large system: they don’t hold the whole codebase in their heads either. However, LLMs will struggle when dependencies are semantic or behaviour is implicit. Consequently, LLMs work well with well-architected systems with clear interfaces and defined types (so you can’t accidentally add a number to a character string; this would be rejected by the compiler). They struggle with more spaghetti-type systems, and where typing is implied. Again, so do humans, but humans may have a long-term memory and understanding of a complex system’s architecture, which LLMs do not have.

Because LLMs cannot have a complete global dependency graph of a large system, some errors may creep in when they generate code. For example, an LLM may not spot all use situations of a changed data structure. Another issue is that a human-coded system in a particular company may have specific local naming and layout conventions. An LLM generating modifications may not be aware of these, and its code may be correct, yet be harder for a human to read and understand since it follows different conventions than the main system being modified.

In general, LLMs seem to work well at dealing with small edits, adding features, or dealing with simple bugs, and even refactoring specific code modules. They struggle with big architectural changes, cross-file refactoring and complex changes. LLMs may introduce new bugs, logic errors and security vulnerabilities. This reinforces the need for comprehensive automated testing, and indeed, an LLM can help with defining and setting up such test harnesses.

LLMs will hallucinate in code just as they hallucinate in text. This means that they will, from time to time, invent functions, variables or APIs that do not exist. They also may fill in missing knowledge with something plausible but wrong. This is especially the case where the LLM lacks all the context it needs, has ambiguous specifications or is dealing with a long chain of reasoning. Humans can try to help by being very specific in the prompts that they use, such as being explicit about only modifying one particular function at a time. LLM-generated code still seems to have many security vulnerabilities. They are trained in vast libraries of existing code, and many of those examples are either outdated or have security gaps. LLMs do not spontaneously produce threat models unless you tell them to, and usually assume input data is well-formed and trustworthy. LLM code may use weak hash functions or insecure defaults, or even embed passwords and API keys in plain sight. The October 2025 Veracode study tested 100 different LLMs in 80 different coding tasks. Fully 45% of the generated code contained vulnerabilities. There are several other studies that have come to similar conclusions. The need to add security reviews of generated code clearly counts against the gains of rapid code generation. One thing that does not seem to be an issue is the performance of the code. In this University of Lille study of 18 LLMs, it was found that LLM-generated code performed at least as fast as human-generated code.

At this point, the jury is still out regarding the true productivity effect of LLMs for coding, particularly if you consider the ongoing support and maintenance effort as well as just the new coding element. It will be intriguing to see the major industry software developer surveys like Dora look like in 2026, to see what they reveal about the practical use of LLMs, compared to the surveys of 2025. This is clearly a rapidly evolving area, but very important since coding is emerging as one of the main use cases for generative AI. However, more research is needed into the maintainability of generated code, and the effects of factors like governance, testing and architecture, rather than just how fast code can be generated.

Related Posts