AI

Subquadratic Launches A 12-Million-Token AI Model And Says The Wall Is Gone

Published

3 months ago

May 6, 2026

A Miami startup called Subquadratic launched its first model on Tuesday with a 12-million-token context window, twelve times the ceiling that every frontier lab has settled near for two years. The company says its proprietary attention architecture scales linearly in compute and memory, runs 52 times faster than dense attention at 1 million tokens, and beats OpenAI’s GPT-5.5 on multi-document recall by nine points. The product launched as an API, a coding agent called SubQ Code, and a deep-research tool called SubQ Search, all running on neoclouds rather than the hyperscalers. Subquadratic has raised $29 million at a $500 million valuation.

The pitch is the kind that surfaces every twelve months in this corner of AI. Magic.dev claimed a 100-million-token window in August 2024, raised more than $500 million, and has produced no public evidence the model is being used outside its own building. Subquadratic’s benchmarks, named experts, and architectural specifics are different. Whether the math survives outside controlled testing is the question every CTO will ask this week, with the company’s launch post detailing SubQ’s SSA architecture and benchmark methodology serving as the opening exhibit.

12 million tokens of context, roughly 9 million words or 120 books in a single prompt.
52x faster than dense attention at 1 million tokens, by the company’s own measurement.
92.1% accuracy on needle-in-a-haystack retrieval at the full 12-million-token length.
$29 million raised at a $500 million valuation from Javier Villamizar, Justin Mateen, and early backers of Anthropic, OpenAI, Stripe, and Brex.

Why The Million-Token Wall Has Held This Long

Quadratic attention has been the bottleneck since the 2017 transformer paper. Doubling the input quadruples the work. The original architecture compares every token to every other token, so a million-token prompt costs a trillion pairwise comparisons. Every workaround the industry has shipped, from retrieval-augmented generation to agentic decomposition to KV-cache offloading, exists to dodge that cost.

The major frontier models have stalled at the same number. Anthropic’s Claude Opus 4.7, Google’s Gemini 3.1 Pro, and OpenAI’s GPT-5.5 all advertise context windows of around one million tokens. None of them work especially well at the upper end. On MRCR v2, the multi-reference recall benchmark labs report, GPT-5.5 leads at 74.0 percent. Opus 4.7 trails at 32.2 percent. Gemini 3.1 Pro lands at 26.3 percent. The advertised window and the usable window are not the same thing.

That gap is what the labs are quietly acknowledging when they pair their long-context offerings with retrieval pipelines. RAG was never a feature; it was a workaround. Subquadratic’s argument is that if attention itself stops scaling quadratically, the workarounds stop being necessary.

Fractured chrome cube bursting open with glowing text fragments representing a 12 million token AI context window.

What Selective Attention Actually Picks

The architecture Subquadratic calls Subquadratic Selective Attention, or SSA, is a learned sparsity mechanism. For any given query token, the model picks which positions in the input matter, conditioned on what the query and keys actually contain. The selection itself does not run quadratic, which is the trap most prior sparse-attention work fell into.

Alex Whedon, Subquadratic’s CTO and a former Meta engineer who ran enterprise AI at TribeAI, described the mechanism in plain terms.

“Sparse attention basically means instead of doing what transformers do, which is if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 squared combinations. You realize that only a portion of those actually matter and you only process the portion that matter,” Whedon said in an interview.

The catalogue of prior approaches reads like a graveyard of clever tradeoffs:

Fixed-pattern sparse attention, used in models like Longformer, scales linearly by attending only to a sliding window. It works when relevant information sits nearby. It fails when it doesn’t.
State-space models like Mamba, Mamba-2, RWKV, and RetNet trade dense attention for a recurrent state. The state is lossy. Nvidia’s 8-billion-parameter study found pure Mamba-2 lagged transformers on MMLU and basic phonebook lookup until attention layers were stitched back in.
Hybrid architectures, including Jamba, Kimi Linear, Qwen3-Next, and Nvidia’s Nemotron v3, mix cheap layers with a few dense layers. The dense layers still do quadratic work, so a hybrid that’s three times cheaper at 32K tokens is still three times cheaper at 10 million tokens. The asymptotics never improve.

DeepSeek’s Native Sparse Attention won the ACL 2025 best paper award, and its successor DeepSeek Sparse Attention now ships in DeepSeek’s V3.2-Exp release notes describing the DSA lightning indexer. Sebastian Raschka’s technical breakdown of DeepSeek V3.2 sparse attention notes that DSA reduces complexity from quadratic to linear in the number of selected tokens. The catch, as independent analysts have flagged, is that the lightning indexer that picks those tokens still has to score every query against every key. The selection step itself stays quadratic.

Whedon argues SSA does what DSA tried to do without the indexer trap. “For prompt A, words one and six are going to be important to each other,” he said. “For prompt B, maybe it’s words two and three. It’s different for every single input.” Hybrids, he said, deliver “a scalar benefit.” A pure subquadratic mechanism delivers a scaling-law advantage.

The Benchmarks, And Where They Land

Subquadratic’s own technical paper publishes the headline numbers, with one third-party verification on MRCR v2. On RULER at 128,000 tokens, SubQ scores 97.1 against Opus 4.6’s 94.8. On MRCR v2 the company reports a research result of 83 and a verified production score of 65.9, both ahead of GPT-5.5’s 74. On SWE-Bench Verified, the long-running coding benchmark, SubQ logs 82.4 percent against Opus 4.6’s 81.42 and Gemini 3.1 Pro’s 80.6. At 12 million tokens, where no other frontier model operates at all, SubQ holds 92.1 percent on a needle-in-a-haystack retrieval task. The cost figures are larger than the accuracy figures. Subquadratic’s RULER comparison reports SubQ hit 95 percent at $8 of inference, against Claude Opus’s 94 percent at roughly $2,600.

Benchmark	SubQ	Opus 4.6 / 4.7	GPT-5.5	Gemini 3.1 Pro
RULER 128K	97.1	94.8	not reported	not reported
MRCR v2	83.0 (research)	32.2	74.0	26.3
SWE-Bench Verified	82.4%	81.42%	not reported	80.6%
Needle at 12M	92.1%	not operable	not operable	not operable

Where Subquadratic’s Own Paper Slows Down

The technical paper is unusually candid about its caveats, which is the part most write-ups have skipped. Each model run was performed once. Inference costs at this scale make repeats prohibitive. Standard practice in academic ML is to run benchmarks several times and report median or mean. A single run leaves wider error bars than any of the published deltas.

The SWE-Bench result is also, by the paper’s own description, “harness as much as model.” SWE-Bench scores depend heavily on the agentic scaffolding wrapped around the model: how the harness reads the repository, how it iterates on patches, how it validates tests. A one-point margin over Opus 4.6 may reflect harness design rather than raw model capability.

Whedon also acknowledged that the SubQ model itself is, in his words, “way smaller than the big labs.” Parameter counts have not been disclosed. A subquadratic architecture that performs at frontier scores with fewer parameters is the strongest possible result. A subquadratic architecture that wins because it was carefully tuned for the published benchmarks, while a larger dense competitor runs out of the box, is a much weaker one.

There is also a theoretical ceiling worth flagging. The 2024 paper on fundamental limitations on subquadratic alternatives to transformers proved that for certain reasoning tasks, no truly subquadratic architecture can match dense attention without sacrificing capability. Whether SSA threads that needle, or whether it pays the price on tasks not in the current benchmark suite, will only show up in external use.

What Ships This Week

The launch package is three products. The SubQ API exposes the full 12-million-token window to developers in beta. SubQ Code is a CLI agent that loads an entire repository into a single context call, sidestepping the chunk-and-rerank pipelines most coding agents rely on. SubQ Search is a deep-research tool that runs free at launch as a customer acquisition lever. All three sit on neoclouds, the GPU-specialty providers like CoreWeave and Lambda, rather than AWS or Google Cloud. CEO Justin Dangel told reporters the major hyperscalers are “very expensive.”

The company is not open-sourcing weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. The 50-million-token window target is set for the fourth quarter of 2026. Whether that lands or slips is the first real test of whether SSA scales the way the paper claims. If it does, every retrieval pipeline built in the last three years has a competing answer to consider.

The Magic.dev Shadow Hangs Over Every Big Context Pitch

Subquadratic is not the first startup to claim the ceiling has been broken. The category has a recent and unflattering history that any investor in this round had to weigh.

August 2024, And A Number Nobody Could Test

Magic.dev announced LTM-2-mini’s 100-million-token context window in August 2024, with a claimed 1,000-fold efficiency advantage over Llama 3.1 405B’s attention. The company posted internal benchmarks and raised more than $500 million on the strength of the announcement.

Twenty months later, there is no public evidence of LTM-2-mini being used at scale outside Magic. No third-party benchmark replications. No production customers willing to be named. No follow-up model has shipped publicly. The company’s product page still markets the figure.

What Subquadratic Has To Avoid Repeating

Independent ML analysts have spent the intervening period auditing the broader subquadratic claim. Vladimir Ivanov, in a February 2026 audit of subquadratic attention claims published on LessWrong, surveyed the field and concluded that most reported breakthroughs are best understood as “incremental improvement number 93595 to the transformer architecture” rather than fundamental shifts. Ivanov’s read on Kimi Linear, DeepSeek Sparse Attention, and the state-space family was that each delivered constant-factor speedups while remaining quadratic in their selection or hybrid layers.

Subquadratic’s defense is that SSA is genuinely sparse end-to-end, including the selection mechanism, and that the published benchmarks reflect that. The paper publishes the math. What it cannot publish is the proof the architecture survives at the much larger parameter counts the frontier labs train. That proof comes from external use, or it doesn’t come at all.

$29 Million, A Pivot, And A Speech Model In The Drawer

The $29 million seed at a $500 million valuation is its own data point about the AI funding cycle. The capital came from Javier Villamizar, formerly of SoftBank Vision Fund, and Justin Mateen, the Tinder co-founder who runs JAM Fund. Several individual backers from early Anthropic, OpenAI, Stripe, and Brex rounds also wrote checks. The valuation is roughly 17 times the round size, which is aggressive for a pre-revenue company in any other market and merely typical for AI infrastructure right now.

Subquadratic is a pivot. The company was previously called Aldea and worked on speech models before redirecting to attention architecture. The speech work is shelved. The team that remains is 11 PhD researchers from Meta, Google, Oxford, Cambridge, ByteDance, Adobe, and Microsoft, plus a CEO with five prior companies in health-tech, insurance, and consumer goods, and a CTO who ran enterprise AI at TribeAI after leaving Meta.

Pivots in AI are common right now because the capital is patient and the architecture space is still moving. Whether SSA is the mechanism that finally retires dense attention, or the latest in a long line of constant-factor improvements, is what the next two quarters will tell.

Frequently Asked Questions

Can I Use SubQ Today?

Yes, in beta. The SubQ API opened to developers on May 5, 2026, with the full 12-million-token window exposed. Sign-up runs through subq.ai with an early-access waitlist. SubQ Code is a separate CLI agent built on the same model. SubQ Search runs free at launch. All three sit on neoclouds, not AWS or Google Cloud, so latency and region availability differ from what you’d get from a hyperscaler-hosted model.

How Much Does The SubQ API Cost?

Subquadratic has not posted a public price sheet yet. The company says SubQ runs roughly 50 times cheaper than frontier models at 1 million tokens, with one published RULER comparison showing $8 per benchmark run versus about $2,600 for Claude Opus on the same test. Specific per-token rates will land at general availability. Early access pricing is being negotiated case-by-case for enterprise customers through the company’s sales contact form.

Is SubQ Open Source?

No. Subquadratic has stated it will not release model weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. That puts SubQ in the same camp as Anthropic and OpenAI rather than DeepSeek or Meta. If open weights are a hard requirement, DeepSeek V3.2-Exp on Hugging Face is the closest sparse-attention alternative currently available, though its window tops out around 128K.

When Does The 50-Million-Token Window Launch?

Fourth quarter of 2026, per the company’s published roadmap. The current 12-million-token product is the first commercial release of SSA. The 50-million-token version is described as an extension of the same architecture rather than a new model family. Whether the schedule holds is the most direct test of whether linear scaling holds in production. Subquadratic has not committed to a specific month, and Q4 dates in AI tend to slip.

How Does SubQ Compare To Claude Opus 4.6 On Coding?

Subquadratic’s published numbers show SubQ at 82.4 percent on SWE-Bench Verified, against Opus 4.6 at 81.42. The margin is one point and the paper acknowledges the score is “harness as much as model,” meaning the agentic scaffolding around the model affects the result. For real coding work, the bigger differentiator is SubQ Code’s ability to load a full repository into one context call, which Opus 4.6’s 1-million-token window cannot match.

Will SubQ Work With My Existing IDE?

Through SubQ Code, yes for command-line workflows. Native plugins for Cursor, Cline, or VS Code’s Copilot extension are not yet shipping. You can route the SubQ API into any tool that accepts an OpenAI-compatible endpoint, but the full 12-million-token codebase-loading workflow currently lives only inside the SubQ Code CLI. Plugin integrations are on the public roadmap but not dated, so plan around the CLI for now.

The independent benchmarks will arrive when researchers outside Miami get the API in their own hands and run tests on prompts the company never saw. Until then, the math is on paper and the cautionary tales are documented. Subquadratic has the harder job ahead: producing the receipts.