AI

Inference Is the New AI Bottleneck, and NVIDIA’s Grip Is Starting to Slip

OpenAI cut inference costs 50% and Etched raised $800M, reshaping AI economics. Goldman Sachs sees token demand growing 24x by 2030.

Published

6 hours ago

July 4, 2026

Logan Pierce

OpenAI engineers found a way to run the company’s AI models at half the cost earlier this month, according to how OpenAI halved its model running costs. In the same week, chip startup Etched emerged from stealth with $800 million raised and more than $1 billion in orders for a transformer-only chip the company claims runs at twenty times the speed of an Nvidia H100.

The two stories look unrelated. They are not. AI inference cost, the work a model does every time it answers a prompt, now eats roughly two-thirds of all AI compute and 80% to 90% of what a model costs across its life. NVIDIA’s training moat does not extend there.

Two Headlines That Are Actually One Story

OpenAI’s optimization was disclosed to colleagues earlier this month and applied to specific models, letting the free version of ChatGPT run on a couple hundred Nvidia GPUs where it once needed tens of thousands.

That single change could save OpenAI $10 million to $100 million per month, per the estimate that has circulated alongside The Information’s reporting. It is one line item, but it is the line item that decides whether an AI lab turns a profit. Etched, founded by Harvard dropouts and based in San Jose, broke cover on Tuesday with $800 million raised across multiple rounds, $1 billion-plus in signed customer contracts, and an investor list that includes Peter Thiel, Jane Street (more than $100 million invested in total, per Bloomberg), Geoffrey Hinton, Fei-Fei Li, Andrej Karpathy, and Stanley Druckenmiller. The company’s CEO, Gavin Uberti, has framed the bet in stark terms.

What ties the two announcements together is the layer of the AI stack they target. Both are about the cost of running a model, not the cost of training one.

AI inference cost cutting threatens NVIDIA's dominance — AI inference cost cutting threatens NVIDIA’s dominance

Inference Now Eats Most of the Bill

For most of the last decade, training was the headline number. A new flagship model needed a giant cluster for a few months, and that was the story. Inference was the part that ran quietly in the background. The split has flipped. Inference now accounts for roughly two-thirds of all AI compute capacity and somewhere between 80% and 90% of what a model costs across its life. It passed training in data center revenue late last year, per the case for betting on inference economics.

The reason is reasoning. The smartest models now spend more compute thinking before they answer, using 5 to 10 times more inference per query than older models. The weights themselves matter less than how long the model is allowed to think, and that thinking is billed by the token.

Inference = roughly two-thirds of all AI compute
Inference = 80% to 90% of a model’s lifecycle cost
Token prices have fallen roughly 1,000x in three years
Inference cost per token is falling 60% to 70% per year, per Goldman Sachs Research
Token demand is forecast to grow 24x by 2030, to 120 quadrillion per month

What Cheaper Inference Does to a Lab’s Books

OpenAI closed the first quarter on a 39% gross margin and has set a target of 52% by December. Cheaper inference is the only path that gets them there. Anthropic is closer: it is currently running an 80% gross margin on every Opus 4.8 inference call, which would make it the first profitable AI lab in the world.

The math has flipped from one of growth-at-any-cost to one of margin discipline.

That is why OpenAI unveiled a custom inference chip called Jalapeño in June 2026, co-developed with Broadcom, designed to deliver better performance-per-watt for ChatGPT and reduce dependence on Nvidia’s GPUs. It is also why every hyperscaler is now designing its own inference silicon, from Amazon’s Inferentia to Google’s TPU to OpenAI’s Broadcom project. The 80% gross margin NVIDIA historically captured on inference is the margin everyone else is now trying to keep.

OpenAI already offers its Batch API at a 50% discount compared with standard API pricing. That price cut only works if the underlying cost is also falling. The Information’s scoop points to exactly that combination, software-driven cost reduction layered on top of custom silicon.

NVIDIA’s Open Side

NVIDIA’s dominance is built on CUDA, a software and library stack accumulated over roughly twenty years, the moat that PyTorch and TensorFlow depend on, the moat that makes Nvidia hard to displace in training. That moat does not extend automatically to inference.

Inference is a different workload, more deterministic, more batchable, more amenable to specialized silicon. That is why the field has flooded in. Groq and Cerebras are already deployed at scale by frontier labs. Etched, the most concentrated bet, designs its chip to do one thing well. Analysts cited by Limitless expect NVIDIA’s share of the inference market to fall from north of 90% today to somewhere between 20% and 30% by 2028. The drop sounds dramatic, but the structural reason is simple: general-purpose GPUs are not the cheapest way to run a transformer, and the rest of the chip industry has noticed.

NVIDIA could in principle build a more specialized inference accelerator, but doing so would be a public concession that general-purpose flexibility is no longer worth paying for, and that concession would weaken CUDA’s own lock-in.

Every Hyperscaler Is Building Its Own Chip

The chip startups are not the only ones piling in. Amazon, Google, Meta, and Microsoft all have in-house inference programs. Etched is the most concentrated of the bets: its Sohu chip hard-codes the transformer computation graph into silicon, runs on TSMC’s N4P (4nm) process, and pairs 144 gigabytes of HBM3E memory per chip with a proprietary interconnect.

Etched says first racks ship this summer. The latest $500 million funding round closed in December 2025 at a $5 billion post-money valuation, led by Stripes. The design philosophy on Sohu’s transformer ASIC architecture is single-purpose in a way Nvidia’s general-purpose GPUs are not.

Chip (8-unit server)	Tokens per second on Llama 70B	Per-chip H100 equivalent
Etched Sohu	500,000	20 GPUs (Etched claim)
Nvidia H100	23,000	–
Nvidia B200	43,000 to 45,000	–

Jevons Hits the Token

Cheaper tokens do not mean less usage. They mean different usage. The price of a token has fallen roughly 1,000 times in three years, and total spending on inference has risen the entire way up. Every cost reduction makes a previously uneconomic use case viable.

Agents are the clearest example. A single agentic task consumes 5 to 30 times the tokens of a normal chat, and some agents run continuously in the background. The 2030 token demand projection for AI agents from senior equity analyst Jim Schneider at Goldman Sachs Research expects monthly token consumption to multiply 24 times between 2026 and 2030, to 120 quadrillion tokens per month.

The math is uncomfortable for incumbents and inviting for challengers. Inference cost per token is falling 60% to 70% per year, but token volumes are growing faster than unit prices are falling, which is why Schneider calls the next 3 to 12 months a “margin inflection” for AI players that own their compute stack. Developers comparing AI tools already feel this in their API bills. See Claude Code vs OpenAI Codex in 2026 coding benchmarks for how that math plays out on the tool side.

Etched: Sohu ASIC, $800M raised, $1B+ contracts, $5B valuation
Cerebras: wafer-scale engine, in talks at roughly $22B pre-money valuation
Groq: LPU inference, deployed at frontier labs
Amazon: Inferentia, in-house AWS silicon
Google: TPU, in-house inference and training
OpenAI: Jalapeño chip, co-developed with Broadcom (June 2026)

The Claims Still Unproven

Etched’s headline figures are company-reported, not third-party-verified. The 500,000-token-per-second figure on Llama 70B, the 160-H100 replacement claim, and the 80% peak FLOP utilization on sparse mixture-of-experts models all come from Etched’s own published materials. The benchmark conditions Etched’s architecture is optimized for (high batch throughput, 2,048 input tokens, 128 output tokens) are not the average production workload, and a single H100 produces roughly 45,000 tokens per second at high batch sizes, a narrower gap than the headline comparison suggests.

The bet is binary, and the founders have said so in writing.

If the transformer architecture endures at scale, Etched has the potential to become an extraordinarily large business. If it does not, the chips become expensive paperweights.

Etched CEO Gavin Uberti, on the company’s binary bet, in Tuesday’s stealth exit materials.

Frequently Asked Questions

What does “inference” mean in AI, and why is it suddenly the cost story?

Inference is the work a trained AI model does each time it answers a prompt. It is distinct from training, the heavier one-time cost of building the model in the first place. The bill has flipped because every ChatGPT reply, every API call, and every agent run draws on inference, and the cost shows up every time a query is made rather than once at the lab. OpenAI, Anthropic, and Google now spend more on inference than on training across their fleets.

Why is NVIDIA’s grip on inference weaker than its grip on training?

Training workloads run on a deeply entrenched CUDA software stack that every major AI framework depends on, a moat two decades in the making. Inference is a more open workload, and a wave of ASIC startups (Etched, Cerebras, Groq) plus every hyperscaler building its own silicon (Google TPU, Amazon Inferentia, OpenAI’s Broadcom chip) are now targeting it directly. The supplier concentration that protected NVIDIA’s training business has no equivalent on the inference side.

What is Etched’s Sohu chip, and how is it different from an Nvidia GPU?

Sohu is a transformer-only application-specific integrated circuit. The chip bakes attention and matrix math into silicon rather than running it as software on general-purpose cores, which Etched says lets an eight-chip server push 500,000 tokens per second on Meta’s Llama 70B against roughly 23,000 tokens per second for an eight-GPU Nvidia H100 server. It cannot run non-transformer architectures and it is not programmable for general AI workloads, a trade-off the company has framed as a deliberate bet on the durability of the transformer.

How much could AI token demand actually grow by 2030?

Goldman Sachs Research expects monthly token consumption to grow 24-fold between 2026 and 2030, to 120 quadrillion tokens per month. The forecast, attributed to senior equity analyst Jim Schneider, is driven by agentic AI: a single agent task consumes 5 to 30 times the tokens of a normal chat, and Goldman models daily LLM queries compounding at a 40% CAGR to 11 billion by 2030.

What would it take for ASIC chips like Sohu to displace Nvidia in inference?

Independent third-party benchmarks on production hardware. Etched’s 500,000-token-per-second Llama 70B figure is company-reported and was measured under benchmark conditions (2,048 input tokens, 128 output tokens, high batch throughput) that favor its architecture. The displacement case rests on whether transformers remain dominant through the end of the decade, and Etched CEO Gavin Uberti has openly described the bet as binary: either Sohu becomes core inference infrastructure, or it becomes unsellable if a different model architecture takes over.

Disclaimer: This article discusses AI chip industry dynamics, corporate financials, and projected market share figures. Statements about future market share, valuations, and demand projections are sourced to named analysts and companies and represent their forecasts, not certainties. Figures are accurate as of publication and may have changed. This content is for informational purposes only and does not constitute investment advice. Consult a qualified financial professional before making investment decisions related to AI, semiconductor, or technology companies.