The Cost of Inference

How would you react if your AI costs went up an order of magnitude next month? AI promises productivity improvements in areas such as software code generation, but at what economic cost? Software budgets in large enterprises have traditionally been heavily skewed towards support and maintenance, with 60% or more of software budgets allocated to this category. Traditionally, the actual hardware compute costs have been relatively well understood. In the case of on-premises data centres there was the cost of maintaining those, and for public cloud services there are monthly processing charges. The latter could cause unpleasant surprises, but enterprises have generally got to grips with these, given predictable application workloads. For sure, the bills for AWS or Azure can spike if something unexpected happens to application workload, but companies have had some years to manage and control these. With AI chatbots, agents and applications, and AI-generated code there is a new factor: inference costs – the costs of executing AI models. These may include the price of tokens used within the models (assuming an API call to a hosted service) or direct processing costs, if the AI models are hosted on-premises.

Traditional IT applications had mostly predictable running costs once they were in production. This is not the case with AI models, where customers may interact with chatbots in unpredictable ways, and AI agents may make large numbers of API calls. There is increasing evidence that enterprises are struggling with unpredictable AI inference costs. In one example revealed by Forbes magazine, a company with $200 a month AI costs when building an application found that their bill rose to $10,000 a month when it went into production, a 50 times increase. Multi-step AI agents may call models many times for each task, swelling the inference cost. The way that prompts are phrased can significantly affect token costs, and you may not be able to control this factor if you are exposing your models to interaction with customers or employees. Although the unit cost of inference is falling, the total consumption of inference is still increasing. This is an example of the economics theory of “Jevons paradox”, where efficiency gains in a resource lead to an increase, rather than a decrease, in total consumption.

The situation is exacerbated by the fact that the frontier labs producing large language models like ChatGPT and Claude are currently subsidising inference costs. Only around 5% of ChatGPT users currently pay a license fee. OpenAI and Anthropic are losing billions of dollars a month at present in an effort to gain market share, supported by their venture capital investors. At some point these investors will expect to see a return in the form of profits, and at that point inference costs will likely increase sharply. This means that enterprises that are budgeting for inference costs now may be in for some unpleasant surprises down the line. What is troubling is that enterprises seem to be experiencing unexpectedly high inference costs right now, even with the subsidised token fees that they are paying. AI budgets now involve as much as 85% in inference costs (up from 20% in 2023). In 2026 more money (over $50 billion) is being spent on inference costs than in training AI models, which itself is notoriously expensive. Although there is limited formal research data, reports abound on social media of spikes in inference costs and budgets being spent dramatically faster than expected. To be sure, companies are still learning about the effects of AI systems in production, and doubtless they will find ways to try to limit unexpected budget spikes, but it is worrying that budget surprises are happening in a period where the inference costs are being priced by the frontier labs at loss-leading levels. When economic gravity kicks in and frontier labs start to charge realistic prices, the problem will worsen.

Much is made of the ability of Claude in particular to generate code of increasingly high quality ever since late 2025. However as more and more applications are deployed, especially in the form of AI agents, where the code has been written by AI rather than humans, it is to be expected that the inference costs of that code will be harder to predict. If humans are no longer writing the code, and AI agents are deciding how many API calls to make without human intervention, then it is likely that inference costs will spike. This increase in costs may heavily offset any coding productivity increases experienced.

Data sovereignty is also an issue to consider with AI applications as they roll out. The EU AI Act has a number of obligations regarding data governance and the location of European data in cloud-based data centres in some cases (such as in regulated industries and for public sector contracts). From August 2026 the act takes effect, and heavy fines could potentially be imposed for companies that flout the rules. We can expect this to have a further cost impact on AI budgets, as enterprises seek to avoid this legislative risk. This introduces additional indirect inference cost overheads through data residency constraints and infrastructure duplication.

One consequence of all this is that we can expect to see heightened interest in low-cost LLMs. Examples are the open-source models from China such as DeepSeek, Qwen and Kimi K2. These models might not be quite as capable as Claude and ChatGPT, but for many purposes they are good enough, and the performance gap between these Chinese models and the leading US ones appears to be closing.

We are living in unpredictable times, and AI inference budgets are just one aspect of that unpredictability. It is important that enterprises carefully examine this area when setting AI budgets, and maintain a tight leash on inference costs. This may include looking at cheaper AI models, negotiating fixed price contracts, or carefully examining the efficiency of AI applications. In particular, they need to be aware of the potential spike in inference prices that is likely to occur in due course, as investors begin to demand that AI companies turn a profit rather than subsidise growth.

Related Posts