Finding The Goldilocks Zone for AI-Generated Code

Large language models (LLMs) are the underlying technology of popular generative AI applications like ChatGPT, Gemini, Claude, Perplexity and Grok. They are well known for their ability to generate fluent text in response to natural language prompts from users, such as: “write me an essay on the industrial revolution”, or “draft me a letter for an email marketing campaign as follows…” As well as being able to generate content in English, Chinese, French, Spanish etc, they can also generate program code in popular programming languages like Python, C++, Javascript and more. Indeed, software engineering has emerged as one of the most promising use cases for LLM technology, along with image recognition and generation and language translation.

So, how good are these tools in practice? Software engineering is more than just writing code. Indeed, programming represents only a fraction of the job of a typical software developer. There is also analysis of the user requirements, design, testing, documentation, debugging and researching, quite apart from meetings, communication, training and admin. A Microsoft study found that programmers would ideally spend about 25% of their time coding, but in practice, it was 14%. It can be seen that even if programming were entirely eliminated, there are limits to the productivity gains that could be made even by entirely automating coding. Moreover, producing code really fast is not necessarily such a great idea if it negatively impacts on code quality, stability or maintainability. Corporate IT budgets typically allocate half or more of their money to support and maintenance. Consequently, producing buggy code faster may actually be counter-productive.

The picture for the productivity of AI code is nuanced and only beginning to emerge. The very large annual Dora survey of 39,000 software developers found in 2025 that AI code was less stable than human-written code, and that using AI-generated code slightly worsened software delivery productivity. Three-quarters of respondents felt that AI had given them at least some productivity gains, but over 39% had little or no trust in AI-generated code.

When I recently interviewed a chief technology officer of a large software company, he explained that his engineering teams had found that LLMs excelled in certain tasks but not others. They had been excellent at dealing with relatively straightforward, repetitive tasks, were good at documentation and had been surprisingly good at helping with debugging code. They also excelled at generating comprehensive automated tests. On the other hand, they struggled with working on large, complex existing code bases, and their success rates dropped off rapidly when presented with problems that were novel or unusual, presumably as such problems were not within their training data. Research has shown that, in general, they are poor at precision, spatial reasoning and tend to generate code that has security lapses. One study found that GitHub CoPilot introduced security issues in almost a third of its Python code and a quarter of the generated JavaScript code.

It is well-known that LLMs hallucinate fabricated or meaningless answers at a steady rate, maybe 20% or more on average, with newer models actually being worse. This has had various consequences in the legal field, where there are already hundreds of legal cases that have been impacted by lawyers using LLMs to generate court documents with plausible but fictitious case law precedents. However, the same issue occurs in code. LLMs will generate some hallucinated software package names when asked to write code in languages like Python and Ruby. What is intriguing is that some of the hallucinated names crop up repeatedly. This was tested by Lasso Security, who found that LLMs often (around 19% of the time) hallucinate the same software library, framework or package. This creates a worrying security loophole, as hackers can register the names of commonly hallucinated software libraries, put actual software in them and include malware. In a test, thousands of downloads were made of such a (dummy) Trojan horse software package that had been uploaded to AI infrastructure and tools company Hugging Face. This research was repeated in 2025 by security researcher Socket, with even higher consistency of hallucinated software package names from the LLMs.

The picture for AI-generated code is quite complex. There is no doubt that LLMs can generate code very quickly, and can prove effective in other software engineering tasks like documentation, debugging and testing. They can produce prototypes very quickly, which can help to nail down user requirements faster than before. On the other hand, the code that they produce has substantial security issues and is less stable than good-quality human code. The situation is evolving rapidly, with huge amounts of money being invested by venture capital companies in AI start-ups, many of them operating in the software engineering area. Consequently, it is likely that early lessons will be learnt, tools will improve, and software engineers will get better at figuring out which tasks LLMs excel at and which they do not. At present, though, it would be prudent to have realistic expectations about the effect on software productivity of LLMs. They undoubtedly have a place, but they will not be replacing programmers en masse in the way that some newspaper headlines have breathlessly claimed. Instead, LLMs will augment programmers, helping with particular tasks that they are well suited to, and carrying out mundane, repetitive tasks that human programmers often dislike. Although a high proportion of programmers at present distrust the code that they produce, it is likely that, as time passes, software engineers will become smarter about how to use LLMs in the most effective way. The future of programming will involve AI, but working alongside humans rather than replacing them.

Finding The Goldilocks Zone for AI-Generated Code

Related Posts