The Box is Open
Published:
The myth of Pandora is not a story about curiosity or punishment. It is a story about irreversibility. Once the box is opened, the contents are out. Nothing that follows changes that fact. The story does not end with someone finding a way to put everything back.
Spend enough time in technology and you recognise the pattern. The personal computer could not be uninvented. The internet could not be uninvented. The smartphone could not be uninvented. Each of these transitions produced a period of resistance, a period of hype, and then a period of reckoning, where the genuine economics of the technology became clear and the businesses that had built on convenient fictions had to adapt or be overtaken by those that had not.
We are in that reckoning period now for artificial intelligence, and I think most organisations are not yet looking at the right problem.
The question most businesses are asking is whether to adopt AI. That question is largely settled. The evidence is sufficiently clear that teams using capable AI tools are more productive, that certain categories of knowledge work are being transformed, and that the gap between AI-augmented and non-AI-augmented competitors will compound over time. Debating whether to adopt is fighting yesterday’s battle.
The question that actually matters is more structural: who pays for this AI, on whose terms, from where does it run, and what happens to your organisation when the pricing assumptions you have built upon change?
My argument is this: Cloud AI is not priced at its true cost. It is subsidised infrastructure, built on speculative capital, and that subsidy has a finite lifespan. The use cases that make AI genuinely transformative, specifically the agentic, iterative, autonomous workflows that produce the biggest productivity gains, are also the most expensive to run in the cloud. Local and on-premises deployment is no longer a fringe option reserved for privacy enthusiasts and research labs. The hardware is serious, the models are capable, and the economics are compelling for a wider range of workloads than most businesses have considered. The organisations that understand this and plan accordingly will have a structural advantage over those that do not.
This is not an argument against cloud AI. It is an argument for being deliberate about where AI runs and why, rather than defaulting to cloud because that is where it started.
The Subsidised Era
Start with the numbers, because they are striking.
OpenAI, the company that has done more than any other to define the public perception of what AI is and what it costs, is projected to lose approximately $14 billion in 2026. This is against annualised revenue of around $20 billion. Cumulative losses through 2028 are projected to reach $44 billion. The company does not expect to reach profitability until 2029, a timeline that depends on continued revenue growth at a rate that would be extraordinary for any business, let alone one operating at this scale of infrastructure cost.
The pricing you pay today for OpenAI’s API or Claude’s API does not reflect the cost of delivering that inference. It reflects the price at which those companies have chosen to acquire market share while they are able to fund the gap from investor capital. This is not a criticism of the business model; it is a description of it. The logic is coherent: establish deep adoption, build ecosystem lock-in, demonstrate capability, and then either reach the scale at which unit economics improve or raise prices once the switching costs are prohibitive. This is a well-understood playbook in technology.
Anthropic presents a partial counterpoint. The company is growing faster than its projections suggested and has indicated a path to profitability sooner than OpenAI’s timeline. I use Anthropic’s products every day and think highly of what the company has built. But the underlying dynamic is the same: current pricing is not a steady-state reflection of compute cost. It is a position on a pricing curve that has not yet reached its natural equilibrium. The difference between OpenAI and Anthropic on this question is one of degree and timeline, not direction.
Both companies face mounting IPO pressure. Both have investors who have deployed capital at valuations that require either sustained high-growth revenue or, eventually, profitability. The mechanism by which losses are resolved is not complicated: costs come down as scale improves infrastructure efficiency, and prices go up as the competitive moat deepens and the subsidy becomes less necessary. You can debate the relative contribution of each, but the direction is not ambiguous.
I explored the fragmented and often opaque nature of AI billing models in an earlier piece on the true cost of AI inference. The point there was about the difficulty of understanding what you are currently paying. The point here is about what you will be paying in three to five years. The answer is: more. Probably significantly more, for the workloads where you are most dependent.
Building your AI strategy on the assumption that today’s pricing persists is not prudent planning. It is an undeclared bet on someone else’s balance sheet.
The Agentic Paradox
There is a particular irony at the centre of the current moment. The AI capabilities that are most transformative are also the most expensive to deliver via the cloud.
Think about the spectrum of AI use. At one end: a single completion. You send a prompt, receive a response, the interaction is done. A few hundred tokens, in and out. At the other end: a fully autonomous agentic loop. The model receives a high-level objective, breaks it into subtasks, reads files, runs tools, evaluates outputs, revises its approach, and iterates until the objective is achieved. Each step in that loop is an API call. A complex agentic task might involve dozens of planning steps, hundreds of tool invocations, and thousands of tokens of context being passed back and forth at each stage.
The relationship between capability and token consumption is not linear. It is steep. And the capabilities that are generating genuine productivity transformations for engineering teams, things like agentic coding, automated testing, iterative document synthesis, multi-step research, sit at the expensive end of that spectrum.
I use Claude Code extensively in my own work, and I have written about how agentic tools are reshaping professional workflows in ways that go well beyond code generation. What that experience has made viscerally clear is the token economics. A simple code completion: a few thousand tokens. An agentic session working through a complex feature implementation, with file reads, test runs, plan revisions, and context accumulation: potentially tens of thousands of tokens, for a single developer, in a single session.
Subscription pricing obscures this today. A monthly flat fee smooths the variability and makes the marginal cost of each session feel like zero. That is convenient, and it is also misleading. The actual compute cost of that session is not zero, and as the profitability pressure on frontier model companies intensifies, the gap between the cost they bear and the price you pay will narrow. For light users, this may not matter. For the organisations where AI-assisted development has become the primary mode of software production, where developers are running agentic sessions for hours every day, the per-session economics matter enormously.
This is what I mean by the agentic paradox. The more seriously you use AI, the more your workload resembles the use case that is most financially unsustainable for cloud delivery. The subscription model is a gift to heavy users today. It is not a long-term structural feature of the pricing landscape.
AI is not an engineering strategy in isolation, as I have argued elsewhere. But deploying AI seriously, at depth, in the ways that actually transform how engineering teams work, requires thinking carefully about the infrastructure that underpins it. Right now, most organisations are not.
What Local Actually Means
When people hear “local AI” they tend to picture something small and disappointing. A chatbot that runs on a laptop, noticeably less capable than the cloud alternatives, acceptable only if you are unusually privacy-conscious or unusually cost-sensitive. That picture was accurate two years ago. It is not accurate today.
The open-weight model landscape has matured considerably and rapidly. Let me be specific about what is available now, because the abstraction “open-weight models exist” undersells how significant the current state is.
The current leading models, with honest notes on each:
- Qwen3 (Alibaba Cloud, Apache 2.0) is, as of mid-2026, the strongest open-weight family available. The range spans 8 billion to 235 billion parameters, with the larger models competitive with frontier closed models on coding and reasoning benchmarks. Qwen3-Coder, the purpose-built agentic coding variant, is a serious alternative to frontier coding models for iterative, tool-using development work.
- Meta Llama 4 offers strong general capability with multimodal input support and a commercially permissive licence. For organisations building products with multilingual requirements, it is a robust foundation.
- DeepSeek V3 and R1 stand out for reasoning. R1 performs at a level competitive with o1-class reasoning models on complex mathematical and scientific tasks. The caveat for Western organisations is real: the provenance of a model trained and released by a Chinese company is a legitimate data sovereignty consideration. Whether it is disqualifying depends on your use case, your regulatory environment, and your threat model. It is worth naming rather than eliding.
- Microsoft Phi-4, at around 14 billion parameters, achieves performance per parameter that is remarkable for its size. A sensible entry point for organisations with constrained hardware; for a significant proportion of structured business tasks, Phi-4 on a decent workstation delivers results genuinely difficult to distinguish from much larger cloud models.
- Mistral Large 3 is a capable, permissively licensed all-rounder, well-suited to European organisations where GDPR considerations make cloud AI complicated.
On the tooling side, two pieces of infrastructure matter most:
- Ollama is the developer-friendly entry point. A single command downloads and serves any of the major open-weight models, handles quantisation transparently, and gives a developer a local inference environment in minutes. It is not a production inference server, but it does not need to be. For individual developer use and for organisations evaluating local deployment before committing to infrastructure, Ollama is the right starting point.
- vLLM is the answer for production deployment serving multiple concurrent users. It is a high-throughput inference server with PagedAttention for efficient memory management, continuous batching for concurrent requests, and the operational characteristics you need to serve an organisation rather than a single developer.
Now for the honest assessment of where the capability gap sits, because pretending it does not exist would be unhelpful.
Frontier models from Anthropic, OpenAI, and Google retain a genuine advantage in a specific set of demanding tasks: complex multi-step reasoning over novel problems, highly ambiguous creative generation, tasks that benefit from very recent world knowledge, and edge cases in domains where the training data for open-weight models is thinner. If you are building a system that needs to reason through genuinely novel scientific problems, or generate sophisticated legal arguments about recent developments, or handle the long tail of unusual customer queries with full contextual nuance, the frontier models remain meaningfully better.
What that gap does not cover is the majority of what most businesses are actually doing with AI:
- Document summarisation and extraction
- Internal question-and-answer systems operating over known corpora
- Code review and test generation
- Structured data analysis
- First-draft report generation from templates
- Sentiment classification
For this category of work, which represents the bulk of current business AI deployment, the performance difference between a well-chosen open-weight model and a frontier model is negligible in practice. The gap is real; it is also narrowing fast, and it has already ceased to matter for a substantial proportion of business use cases.
The question is not whether local models are as good as frontier models in every respect. They are not. The question is whether they are good enough for your specific workloads, and whether the cost and sovereignty advantages of running them locally outweigh the remaining capability gap. For a growing number of workloads, the answer is yes.
Deployment Follows the Problem
This is where I want to spend the most time, because I think it is where most organisations are getting the analysis wrong.
The dominant assumption in AI adoption right now is that cloud is the default, and local deployment is the exception, chosen when you have a specific reason to deviate. I think this gets the logic exactly backwards. The right question is not “why would we not use cloud?” It is “what does this specific workload require, and which deployment model best serves those requirements?”
That reframing matters because it changes what you optimise for. A default-to-cloud approach optimises for convenience of adoption. A problem-first approach optimises for fit. Here is how the analysis breaks down across the four natural categories of business AI use.
Deployment model by workload type. Regulated and sensitive data may require on-premises regardless of quadrant.
-
Developer tooling. Agentic coding, code review, automated test generation, refactoring assistance. These workloads have three characteristics that make local deployment compelling. Token volumes per session are high; a serious agentic coding session generates more tokens than almost any other business use case. The input material, the codebase itself, is typically among the most sensitive data in the organisation. And the primary consumer is a single developer at a workstation, not a concurrent multi-user system requiring high availability.
Put those together and the case for local deployment is strong. A developer running Qwen3-Coder locally on a capable workstation gets a private agentic coding environment with zero marginal cost per session, no data leaving the organisation, no dependency on network availability, and response latency determined by local hardware rather than API queuing. The crossover point at which the capital cost of that hardware becomes cheaper than the equivalent cloud API spend arrives faster than most people expect, particularly as agentic usage patterns drive session token consumption into the tens of thousands.
-
Internal business workflows. Document processing, internal question-and-answer systems, summarisation, data analysis, internal chatbots operating over company knowledge bases. This category typically involves multiple concurrent users rather than a single developer, which changes the infrastructure picture. A single developer can be well-served by a desktop machine; a team of fifty people querying an internal knowledge base simultaneously requires a more capable inference server.
The right answer here is an on-premises inference server, running vLLM on suitable hardware, serving the whole organisation. This satisfies compliance requirements structurally rather than through contractual assurances. Data never leaves the organisation’s infrastructure. The cost per query, once the hardware is amortised, approaches zero. For regulated industries in particular, this is not just cost-efficient; it eliminates an entire category of compliance and legal exposure that cloud AI creates. I have written separately about the risks of building AI on ungoverned data, and an on-premises deployment addresses many of those risks at the infrastructure level rather than requiring data governance controls to compensate for a cloud dependency.
-
Customer-facing features. Chatbots, recommendation engines, real-time personalisation, customer service automation. This is the category where cloud retains a genuine structural advantage for most businesses, and I want to be honest about that rather than overstating the local case.
Customer-facing AI workloads have characteristics that make cloud attractive: demand is variable and potentially very spiky, geographic distribution matters for latency, uptime requirements are typically higher than for internal tooling, and the concurrency requirements at peak can be substantial. Replicating the elasticity of a major cloud provider’s inference infrastructure with owned hardware requires engineering overhead that most organisations cannot justify unless they are operating at significant scale. For customer-facing AI, cloud is still the right default for most businesses today.
The qualification is “most businesses.” Organisations with sufficient scale, existing investment in data centre infrastructure, and engineering teams capable of operating it are increasingly deploying on-premises inference clusters with proper redundancy for customer-facing workloads as well. This is viable; it is not yet the default, and it should not be recommended as the starting point for most organisations.
-
Regulated or sensitive data processing. Healthcare, legal, financial services, government, and any other context where the data being processed has compliance requirements that constrain where it can be processed. This category cuts across the other three. An internal knowledge base in a healthcare organisation is subject to HIPAA requirements regardless of whether that knowledge base is an “internal workflow” in abstract terms. A legal firm’s document processing is governed by client confidentiality requirements that do not yield to a cloud provider’s data processing agreement, however well-drafted.
For these organisations, cloud AI was never a fully clean option for sensitive data, and many have been navigating that complexity through a combination of careful data classification, contractual arrangements, and frankly some deliberate ambiguity about what data was and was not being processed. What has changed is that the on-premises alternative is now capable enough that the compliance-driven choice does not require a significant capability sacrifice. Previously, organisations in regulated industries faced a genuine dilemma: the capable AI was in the cloud, and the cloud was problematic from a compliance standpoint. That dilemma is resolving. The capable AI is increasingly available on-premises.
The realistic near-term strategy for most businesses is hybrid: local or on-premises for developer tooling and internal workflows, particularly where data sensitivity is high or token volumes are significant, and cloud for customer-facing features and for the specific frontier-model tasks that genuinely require capabilities not yet available in open-weight models. This is not a permanent state; it is the appropriate posture for 2026, and it will shift towards more local deployment as the capability gap continues to close.
The hidden costs of AI adoption extend beyond token pricing, and the deployment model affects almost all of them. Organisations that have mapped these costs honestly are better positioned to make these decisions well.
The Hardware Signal
Something interesting has happened in the hardware market in the first half of 2026, and it deserves more attention than it has received.
AMD opened pre-orders for the Ryzen AI Halo in June 2026. The specification is notable: Ryzen AI Max+ 395 processor, 128 gigabytes of unified memory, at a retail price of US$3,999. That memory figure is the critical one. The practical constraint on running large language models locally is memory bandwidth and capacity; a 70 billion parameter model requires roughly 40 gigabytes to run in reasonable quantisation, and a 128 billion parameter model requires roughly 70 gigabytes. At 128 gigabytes of unified memory, the Ryzen AI Halo can run models in the 100 to 200 billion parameter range at a workstation price point. This is a desktop machine capable of running the largest open-weight models in the Qwen3 family.

The AMD Ryzen AI Halo developer platform. US$3,999. 128GB unified memory. Capable of running models up to approximately 200 billion parameters. Image: AMD.
NVIDIA’s DGX Spark launched at US$3,999 but the price was raised to US$4,699 in February 2026 following memory supply constraints. The DGX brand has historically been associated with data centre hardware; the Spark is NVIDIA’s entry into the desktop inference market, explicitly designed for local AI workloads. The positioning is deliberate. NVIDIA is not selling a gaming machine that can also run AI. It is selling an AI inference workstation with the DGX brand attached, signalling that this is a serious product for serious AI workloads, not a hobbyist device.
AMD, for its part, pitched the Ryzen AI Halo directly against the DGX Spark at a US$700 discount. Two of the largest hardware manufacturers in the world have independently arrived at the same product category, at the same price bracket, at the same moment. This is not coincidence. Hardware development cycles are long. The decisions to build these products were made eighteen months to two years ago, based on projections about where the AI deployment market would be heading. Both AMD and NVIDIA saw the same signal: a coming shift in where inference happens, from cloud to local, driven by economics and capability convergence. They have built the infrastructure in anticipation of the migration.
The infrastructure arriving ahead of the wave is itself a signal about the wave.
For on-premises deployments at organisational scale, the relevant hardware sits a tier up: NVIDIA H100 and H200 GPU clusters, AMD Instinct MI300X, and the infrastructure products built around them. These are not cheap; a multi-GPU server with H100s costs six figures. But for an organisation processing significant AI workloads across a team, the amortised cost over three to five years looks very different from the headline capital figure, particularly when compared against the cloud API spend those workloads would otherwise generate.
The Cost Structure Inversion
Cloud AI and local AI have fundamentally different cost structures, and understanding that difference is prerequisite to making good deployment decisions.
Cloud AI is a variable cost. The upfront commitment is low: an API key and a credit card. The marginal cost accrues per token, per request, per session. This makes the initial adoption easy and the total cost opaque. A team of five developers using an AI coding tool all day generates token costs that compound invisibly behind a subscription fee. When the subscription model shifts towards consumption-based pricing, or when subscription prices increase to reflect actual cost, that invisibility becomes a problem.
Local AI is a capital cost. A significant upfront expenditure on hardware, near-zero marginal cost thereafter. The depreciation on a US$4,000 workstation over three years is roughly US$1,300 per year. The inference cost for any number of sessions run on that workstation during that period is the electricity bill. For a developer running meaningful agentic coding sessions, the crossover point at which local infrastructure becomes cheaper than equivalent cloud API spend can arrive within months.
At organisational scale the arithmetic is more compelling still. An on-premises inference server costing US$50,000 serving fifty developers for three years has a total infrastructure cost of roughly US$16,000 per year, plus electricity and maintenance. Fifty developers running Claude Code or equivalent at meaningful agentic workload levels through cloud APIs would generate costs that, depending on usage patterns, could reach multiples of that figure annually and would increase as pricing pressure normalises.
I want to be honest that this arithmetic is sensitive to assumptions about usage intensity, model choice, and pricing trajectories. I am not presenting it as a precise financial model; I am making the point that the crossover exists, that it arrives sooner than most organisations expect, and that current subscription pricing is not a reliable basis for projecting future costs.
Illustrative cost crossover for a single developer: US$500/month cloud equivalent versus US$4,000 hardware plus US$100/month local. Heavy agentic usage accelerates the crossover.
Beyond the direct cost comparison, local and on-premises deployment offers structural advantages that have value independent of the per-token economics:
- Data sovereignty: your proprietary code, your customer data, your internal documents, never traverse a third-party network.
- Privacy: no query telemetry, no training data contribution questions, no ambiguity about what a vendor’s data processing agreement actually covers.
- Vendor independence: if a model provider changes its pricing, changes its usage policies, or makes decisions about what its model will and will not do, your local deployment is unaffected.
For regulated industries in particular, the alternative to local deployment is a recurring legal and compliance cost that the capital expenditure on owned infrastructure simply eliminates.
What Businesses Should Do Now
The immediate practical question is where to start, and my strong view is that the right starting point is a use case audit, not an infrastructure decision.
Before you evaluate hardware or choose models or design deployment architecture, you need to understand what your organisation is actually doing with AI today and what it is planning to do. Map your current and planned AI usage against the four categories described above: developer tooling, internal workflows, customer-facing features, regulated or sensitive data. Be honest about which workloads are generating the most token volume and where your data classification concerns are highest.
Most organisations I speak with do not have a clear picture of their AI spending at a workload level. They have a subscription line on a credit card and a rough sense that developers are using AI assistants. They do not know which teams are the heaviest users, which use cases are generating the most tokens, or whether the workloads generating the most cost are actually the ones generating the most value. That information gap makes it impossible to make good infrastructure decisions.
Once you have mapped your usage, identify the candidates for local or on-premises deployment:
- Strong candidates: developer tooling with sensitive codebases; internal knowledge base and document processing workloads, particularly where compliance considerations apply.
- Poor candidates (for most businesses): customer-facing features, unless you are operating at a scale that justifies the infrastructure and operational overhead.
Build a roadmap before pricing pressure forces the decision reactively. A reactive decision made under cost pressure is a worse decision than one made with time to evaluate options, run pilots, and build internal capability. The organisations that will navigate the normalisation of AI pricing best are those that have already understood their dependency, identified their migration path, and begun building the internal competence to operate local inference infrastructure. The organisations that will struggle are those that discover their cloud AI bill has doubled when they have no alternative in place and no internal knowledge of how to build one.
This is a strategic planning question, not a hardware procurement question. The right owner is the engineering or technology leadership function, in conversation with finance and legal, with a clear brief that connects AI infrastructure decisions to the organisation’s broader technology strategy. I have written about what it means to have an AI strategy rather than just AI adoption, and this is precisely the kind of decision that distinguishes one from the other.
The Box Stays Open
Let me close by being clear about what this argument is not.
It is not a case for retreating from AI. The productivity evidence is sufficiently strong, and the competitive implications sufficiently clear, that treating AI as optional is not a strategy available to most businesses. The organisations I have seen attempt it are not neutral; they are falling behind.
It is not a case against cloud AI. Cloud providers offer real advantages for specific workloads, and for certain use cases, particularly those with high concurrency requirements and variable demand, cloud remains the right answer. The hybrid model I have described is a genuine hybrid, not local-first absolutism.
What it is: an argument for treating AI infrastructure as a deliberate strategic choice rather than a default assumption. The default assumption, that cloud is where AI lives and the pricing you see today is roughly what you will pay in future, is not well-founded. The financial structure of the frontier model companies makes pricing normalisation inevitable. The capability of open-weight models makes local deployment viable for a significant and growing proportion of business workloads. The hardware arriving in 2026 makes the infrastructure accessible at a price point that changes the capital expenditure calculus.
Pandora’s box, in the original myth, retained one thing after everything else had escaped: hope. I think the appropriate AI-era reading of that is this. The technology is out, it is permanent, and the question is whether you engage with it deliberately or reactively. The organisations that are deliberate, that understand the dependency they are building and have thought carefully about its infrastructure, its costs, and its control, will be better positioned when the economics shift than those that simply followed the default until it became untenable.
The box is open. The question is who controls what comes out of it.
About the Author
Tim Huegdon is the founder of Wyrd Technology, a consultancy that helps technology leaders make sound infrastructure and AI strategy decisions. With over 25 years of experience in software engineering and technical leadership, he works with organisations navigating the shift from default cloud AI adoption to deliberate, cost-aware deployment strategies that match infrastructure choices to the specific requirements of each workload, and that hold up when the economics change.