AI

AI Made Building Cheap, So Investors Now Chase Founder Conviction

Published

23 hours ago

May 6, 2026

Investors pumped a record $300 billion into startups in the first quarter of 2026. Four AI giants swallowed nearly two-thirds of it. Down at the seed stage, fewer teams are climbing to Series A than at any point this decade.

The bar moved. AI made building a product nearly free, so capital is flowing somewhere else: toward founders who know something specific about a market, a customer, or a problem that an autocomplete cannot manufacture in a weekend. Domain conviction is the new moat.

That shift is rewriting what early-stage investors actually grade for, what teams should look like at the seed stage, and which signals matter when a single founder can spin up a polished pitch deck before lunch.

The Building Edge Just Stopped Counting

Twenty years ago, internet fluency was the silent filter on a founding team. Forty years ago, computer literacy. The pattern keeps repeating with each generational wave of technology.

AI-native fluency is now the floor, not the ceiling. A founder who can scaffold a working product, stand up a marketing site, and submit accelerator applications inside one weekend is no longer the exception.

Aaron Tainter, who runs accelerator programs at Pittsburgh’s Innovation Works, puts it bluntly. Founders who haven’t pulled AI copilots into their daily routine, he writes, are “new-aged dinosaurs.”

Brass chess king dominating scattered pawns symbolizing venture capital founder conviction shift.

Why Founder-Market Fit Is The New Moat

If everyone can build, the product alone cannot be the durable advantage. Investors have caught up. The grading rubric is shifting toward founder-market fit: domain expertise that predates the company, customer relationships built before the deck, and a clear-eyed read on what people will actually pay for.

“AI can help a founder build anything, but it’s what customers have a need for that tells them what’s worth building,” Tainter argued in his Crunchbase op-ed this week. That judgment, he says, is the scarce resource now.

The data backs the thesis. Crunchbase analysis of seed-to-Series-A graduation rates shows the share of seed companies climbing to Series A within two years collapsed from 30.6% in 2018 to roughly 15.4% by 2024. Capital didn’t dry up. It got pickier.

Q1 2026 makes the picture sharper. Crunchbase data on Q1 2026 venture funding concentration shows OpenAI alone raised $122 billion in the quarter, with Anthropic, xAI and Waymo collectively pulling in another $66 billion. The remainder of the venture market fought over a much thinner pool. Andreessen Horowitz’s $2.2 billion crypto vehicle announced this week is one of the few large checks not flowing toward foundation-model giants.

The Seed Team Has Shrunk By 40%

Lean teams are now structural, not optional. Carta’s State of Startup Compensation H1 2025 report pegs the average seed-stage company at 6.2 employees, down from 10.3 in 2021. AI absorbed the difference.

Seed is still active, but it’s leaner, slower, and more distributed. The first three hires come much later than a few years ago, partly AI leverage, partly capital discipline.

That read came from Peter Walker, Head of Insights at Carta, in his January 2026 commentary on the firm’s seed data. With teams that small, every hire has to pull disproportionate weight.

6.2 employees: average seed-stage headcount in 2025, per Carta.
5.3 employees: average headcount at the moment of seed close in H1 2024, leaner still.
40% reduction: shrinkage in seed-team size since 2021.
$18.8 billion: capital deployed in 2026 into AI startups founded since the start of 2025.

The composition shifted along with the size. The most useful first hires now look like a product-minded builder, an owner of the customer relationship, and someone who can position the product and pull demand. A bench of engineers is no longer the default.

How Startup Slop Is Breaking Investor Triage

The flip side of cheap building is cheap signaling. AI tools that let real founders ship faster also let bad-faith founders fabricate credibility in a single afternoon. Tainter calls this “startup slop,” the entrepreneurial cousin of the AI-generated content flooding everywhere else online.

Top-tier seed funds report receiving thousands of unsolicited decks a year. The volume makes thorough human review effectively impossible. Dealflow has become a vanity metric, not an asset, when half the inputs are autocompleted.

Software is the most exposed category. A polished landing page, a synthesized founder bio, and a few generated customer-discovery “summaries” can be assembled before lunch. None of it survives a serious due diligence call.

Investors are responding by asking sharper questions. The default polite ask of “tell me about your background” has been replaced with specific probes: why this customer, why this city, why now, and what did you learn the third time you sat with a real buyer.

Why Deep Tech Stays Hard To Fake

Therapeutics still require lab work. Hardware still requires supply chains. Advanced manufacturing still requires real partnerships with key opinion leaders. None of that compresses into a weekend, no matter how good the copilot is.

That structural difficulty is one reason deep tech now commands roughly 20% of global venture capital, up from about 10% a decade ago. Capital is gravitating toward problems where the moat is physical, regulatory, or scientific, not just polished prose.

Inside An Accelerator’s Filter

Pittsburgh-based AlphaLab, the accelerator inside Innovation Works that Tainter runs, picked its largest cohort ever for 2026. Twenty startups across software, robotics, health, energy, and advanced manufacturing share up to $100,000 per company in seed checks plus mentorship. Five teams are relocating at least one cofounder to Pittsburgh to join.

The selection bar this year reflected the broader market squeeze. Innovation Works’ announcement of the 2026 AlphaLab cohort framed the theme as “embedding intelligence into industry,” code for the deep-tech and applied-AI bias that now defines the program.

Tainter said his team weeded out polished-but-empty applications by leaning on questions that AI cannot answer convincingly. Why this city. Why this customer. Why now. Why you, of all the people who could have built this.

“You can sense how genuine someone is based on their answer,” he wrote. The lived experience that drove a founder to start the company in the first place tends to leave evidence all over an application, in details an LLM has no reason to invent.

AlphaLab’s track record gives the filter weight. Across its various tracks, the program has invested in more than 250 companies, generated over $1.3 billion in follow-on funding, and produced two unicorns since launching in 2008.

The Soft Signals That Still Get Checks Written

The signals that close rounds at the seed stage haven’t really changed. They’ve just gotten harder to spot under all the polish. Coachability, hustle, and authentic conviction are still the variables investors talk about behind closed doors.

Speed of communication has quietly emerged as the new tell. AI killed the friction in writing follow-up emails, sending weekly updates, and answering investor questions. A founder who still takes four days to reply is sending a message about how they will run the company.

Higher-order skills now carry the weight that engineering depth used to. Judgment. Storytelling. Relationship-building. Strategic clarity. These muscles compound over a fundraising cycle, and they cannot be vibe-coded into existence.

Investors are stress-testing those muscles with questions that AI struggles to answer:

Who was your first paying customer, and what did they say no to before saying yes?
What part of the problem did you only understand after you started building?
Why did you start this, instead of joining a larger company solving the same thing?
What does your week look like outside of pitching investors?

Tainter’s underlying point lands colder than the AI-doom takes that get the most reposts. The cost of starting a company has dropped, sure, but the cost of earning a meaningful check has gone up. Founders are competing for a smaller share of capital with louder, glossier neighbors.

The teams that get funded in 2026 will look like the teams that have always gotten funded: people who know something specific about a market the rest of the room hasn’t sat with yet. AI just exposed the founders who don’t.

AI

Monday.com Stakes Identity Pivot On AI Agents Before Earnings

Published

23 hours ago

May 6, 2026

Logan Pierce

Monday.com is no longer pitching itself as a place where teams track work. The Israeli software company on Tuesday recast its product as an AI work platform, with native agents that draft campaigns, qualify leads, and approve budgets without waiting for a human to click through. The relaunch lands five days before the company reports Q1 2026 earnings on May 11.

The repositioning is the biggest strategic shift since the company’s 2021 IPO. Co-founders Roy Mann and Eran Zinman are betting that monday.com’s 250,000 paying customers reported in its 2025 annual results will keep paying once AI agents handle the work the platform used to merely organize. The pitch arrives after a 21% February stock crash and class-action lawsuits over a withdrawn $1.8 billion 2027 revenue target.

Mann’s words make the bet explicit. “Monday.com has a new identity and a new purpose,” he said in monday.com’s investor announcement on the AI agent launch. The agents ship to every customer with no setup required, drawing on data from across teams to plan and execute, not just track.

What Ships With The Relaunch Today

Agents do six things out of the box. They run under human supervision and inside monday.com’s existing security, permissions, and governance controls. No new login, no separate console.

Marketing. Draft campaigns from briefs already on the board.
Sales. Qualify inbound leads against the pipeline rules a team has set.
Support. Triage tickets, route the gnarly ones, close the simple ones.
Reporting. Generate weekly and monthly reports without a human pulling data.
Project workflows. Run multi-step processes end to end.
Finance. Process budget approvals against policy.

The agents differ from bolt-on AI tools, the company argues, because they sit inside one structured platform with context across an entire business. A marketing agent can read sales pipeline data when sizing a campaign. A finance agent can see what an engineering board owes a release.

Expanded agent experiences are coming to Slack, with new AI modules added to Make, monday.com’s automation product. The pitch is consolidation, not another chatbot tab.

Monday.com AI work platform native agents relaunch ahead of Q1 2026 earnings call.

Why The Timing Reads Like A Boardroom Decision

The relaunch comes 86 days after the worst trading session in the company’s public history. Monday.com shares dropped roughly 21% on February 9, 2026 after Q4 results showed sharp growth deceleration and management withdrew its $1.8 billion 2027 revenue target.

That fear has only compounded since. Class-action complaints filed in U.S. district court allege the company misled investors about its 2027 revenue path. Barclays and UBS have cut price targets. Jefferies moved monday.com from Buy to Hold. The stock entered May down roughly 55% year to date, with an average analyst price target around $124, well below where bulls sat in late 2025.

Zinman has pushed back in public. The co-CEO told CNBC in February the company saw no impact “currently from any AI company,” even while conceding that the pitch and the product were being reworked to be more AI-native. Tuesday’s announcement is the visible result.

The new positioning is also a tacit answer to a sharper question. If agents do the work, what is monday.com actually selling? Mann and Zinman’s reply is that the platform is the substrate agents need to be useful inside a business: structured data, permissions, audit trails, and the boards human teams already use every day.

That answer is unproven until customers expand seats or sign bigger contracts. Analysts polled by FactSet expect Q1 revenue near the low end of the company’s $338 million to $340 million guide. The relaunch puts a marker in the ground before the call.

Connectors Open The Box To Outside Models

Customers can wire in external AI platforms with one click. The company named Anthropic’s Claude and OpenAI’s GPT models, the same pair that recently hit cyber-task parity in UK AISI testing, alongside Microsoft Copilot and Google Gemini. An AI Platform Gateway routes requests across multiple large language models so customers are not locked into a single vendor’s stack.

Open connectivity is also a defensive position. Cursor, Perplexity, Grok, and Anthropic’s research-built agents can already operate through monday.com’s APIs, and the company has been building bridges to the Model Context Protocol since March. Locking customers behind a proprietary model would have been the wrong fight for a roughly 2,500-person company facing trillion-dollar AI labs. The play is to be the orchestration layer that knows the customer’s workflow shape, including long-context models like Subquadratic’s 12-million-token launch as they emerge.

Mind The Execution Gap

The pitch leans on Deloitte’s 2026 State of AI in the Enterprise survey released in March, which found enterprises broadened AI access by 50% in a single year while production deployment lagged badly behind. The numbers are cited inside monday.com’s own announcement and they sting any vendor selling shelfware.

50%. Year-over-year jump in workforce access to sanctioned AI tools.
25%. Share of enterprises that have moved 40% or more of pilots into production.
34%. Companies saying AI is deeply transforming how they work.
54%. Companies expecting to clear the production threshold in three to six months.

Mann’s argument is that monday.com closes the gap because AI sits where work already is. “We are not asking customers to change how they work,” Zinman said in the announcement. “We are bringing AI into how they already work.” The bet is that adoption follows least-effort.

The Field Mann And Zinman Walk Into

The repositioning puts monday.com against Asana, Atlassian, Smartsheet, and ClickUp, all of which have laid generative AI features over their own work platforms in the past two years. Microsoft 365 Copilot and Salesforce Agentforce sit on the perimeter as bundle threats that arrive free with the rest of an enterprise contract.

The competitor narrative matters because it shapes what each platform claims to own. Monday.com is selling unified context across a business. Asana is selling collaboration between humans and agents. Atlassian’s Rovo, launched in 2025, sells agent reach across Jira tickets and Confluence pages. The lanes overlap, and customers will pick on price, on incumbency, and on the agent that ships first inside the tool a team already opens at 9 a.m.

Asana Got There First

Asana made its AI Teammates feature generally available with 21 role-specific agents, from a Campaign Brief Writer to a Bug Investigator and a Sprint Coach. The company has been louder than monday.com about how agents fit inside human teams rather than acting as a personal copilot.

We believe in AI being multiplayer by design. The future of the agentic enterprise will only be realized if agents can work independently and with multiple people, versus just a copilot.

That framing, voiced by Asana chief product officer Arnab Bose in a Computerworld interview, is the closest competitor articulation to what Mann and Zinman are now selling. The two pitches will be argued in the same RFPs through the rest of 2026.

Atlassian, ClickUp, And Smartsheet Crowd The Lane

Atlassian’s Rovo connects agents across Jira, Confluence, and developer tooling, giving it natural reach inside engineering organizations. ClickUp Brain runs free across the platform’s workspace, betting on volume rather than seat upsell. Smartsheet has been quieter but layered AI summarization, prediction, and routing into its enterprise tier through 2025 and 2026.

The thicker the lineup, the harder it is for one company to own the agentic-work narrative. Monday.com’s 250,000-customer base is its lever, but a lever and an outcome are not the same thing. The company also has the optionality of Agentalent.ai, the agent marketplace it launched in March with AWS, Anthropic, and Wix, which can ship third-party agents into customer accounts on top of the native ones.

The May 11 earnings call will be the first read on whether the new pitch is moving anything inside accounts. Investors will watch past the headline number for net retention, large-customer ARR growth, and any disclosure on agent activation rates. Mann and Zinman now have to show that customers are paying for execution, not just talking about it.

Disclaimer: This article reports on monday.com’s product strategy, financial guidance, analyst commentary, and ongoing securities litigation as of May 6, 2026. It is for informational purposes only and is not investment advice. Stock prices, analyst price targets, revenue guidance, and the status of class-action proceedings can change without notice. Readers considering a position in MNDY or peer software stocks should consult a licensed financial professional before acting.

AI

Subquadratic Launches A 12-Million-Token AI Model And Says The Wall Is Gone

Published

2 days ago

May 6, 2026

Logan Pierce

A Miami startup called Subquadratic launched its first model on Tuesday with a 12-million-token context window, twelve times the ceiling that every frontier lab has settled near for two years. The company says its proprietary attention architecture scales linearly in compute and memory, runs 52 times faster than dense attention at 1 million tokens, and beats OpenAI’s GPT-5.5 on multi-document recall by nine points. The product launched as an API, a coding agent called SubQ Code, and a deep-research tool called SubQ Search, all running on neoclouds rather than the hyperscalers. Subquadratic has raised $29 million at a $500 million valuation.

The pitch is the kind that surfaces every twelve months in this corner of AI. Magic.dev claimed a 100-million-token window in August 2024, raised more than $500 million, and has produced no public evidence the model is being used outside its own building. Subquadratic’s benchmarks, named experts, and architectural specifics are different. Whether the math survives outside controlled testing is the question every CTO will ask this week, with the company’s launch post detailing SubQ’s SSA architecture and benchmark methodology serving as the opening exhibit.

12 million tokens of context, roughly 9 million words or 120 books in a single prompt.
52x faster than dense attention at 1 million tokens, by the company’s own measurement.
92.1% accuracy on needle-in-a-haystack retrieval at the full 12-million-token length.
$29 million raised at a $500 million valuation from Javier Villamizar, Justin Mateen, and early backers of Anthropic, OpenAI, Stripe, and Brex.

Why The Million-Token Wall Has Held This Long

Quadratic attention has been the bottleneck since the 2017 transformer paper. Doubling the input quadruples the work. The original architecture compares every token to every other token, so a million-token prompt costs a trillion pairwise comparisons. Every workaround the industry has shipped, from retrieval-augmented generation to agentic decomposition to KV-cache offloading, exists to dodge that cost.

The major frontier models have stalled at the same number. Anthropic’s Claude Opus 4.7, Google’s Gemini 3.1 Pro, and OpenAI’s GPT-5.5 all advertise context windows of around one million tokens. None of them work especially well at the upper end. On MRCR v2, the multi-reference recall benchmark labs report, GPT-5.5 leads at 74.0 percent. Opus 4.7 trails at 32.2 percent. Gemini 3.1 Pro lands at 26.3 percent. The advertised window and the usable window are not the same thing.

That gap is what the labs are quietly acknowledging when they pair their long-context offerings with retrieval pipelines. RAG was never a feature; it was a workaround. Subquadratic’s argument is that if attention itself stops scaling quadratically, the workarounds stop being necessary.

Fractured chrome cube bursting open with glowing text fragments representing a 12 million token AI context window.

What Selective Attention Actually Picks

The architecture Subquadratic calls Subquadratic Selective Attention, or SSA, is a learned sparsity mechanism. For any given query token, the model picks which positions in the input matter, conditioned on what the query and keys actually contain. The selection itself does not run quadratic, which is the trap most prior sparse-attention work fell into.

Alex Whedon, Subquadratic’s CTO and a former Meta engineer who ran enterprise AI at TribeAI, described the mechanism in plain terms.

“Sparse attention basically means instead of doing what transformers do, which is if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 squared combinations. You realize that only a portion of those actually matter and you only process the portion that matter,” Whedon said in an interview.

The catalogue of prior approaches reads like a graveyard of clever tradeoffs:

Fixed-pattern sparse attention, used in models like Longformer, scales linearly by attending only to a sliding window. It works when relevant information sits nearby. It fails when it doesn’t.
State-space models like Mamba, Mamba-2, RWKV, and RetNet trade dense attention for a recurrent state. The state is lossy. Nvidia’s 8-billion-parameter study found pure Mamba-2 lagged transformers on MMLU and basic phonebook lookup until attention layers were stitched back in.
Hybrid architectures, including Jamba, Kimi Linear, Qwen3-Next, and Nvidia’s Nemotron v3, mix cheap layers with a few dense layers. The dense layers still do quadratic work, so a hybrid that’s three times cheaper at 32K tokens is still three times cheaper at 10 million tokens. The asymptotics never improve.

DeepSeek’s Native Sparse Attention won the ACL 2025 best paper award, and its successor DeepSeek Sparse Attention now ships in DeepSeek’s V3.2-Exp release notes describing the DSA lightning indexer. Sebastian Raschka’s technical breakdown of DeepSeek V3.2 sparse attention notes that DSA reduces complexity from quadratic to linear in the number of selected tokens. The catch, as independent analysts have flagged, is that the lightning indexer that picks those tokens still has to score every query against every key. The selection step itself stays quadratic.

Whedon argues SSA does what DSA tried to do without the indexer trap. “For prompt A, words one and six are going to be important to each other,” he said. “For prompt B, maybe it’s words two and three. It’s different for every single input.” Hybrids, he said, deliver “a scalar benefit.” A pure subquadratic mechanism delivers a scaling-law advantage.

The Benchmarks, And Where They Land

Subquadratic’s own technical paper publishes the headline numbers, with one third-party verification on MRCR v2. On RULER at 128,000 tokens, SubQ scores 97.1 against Opus 4.6’s 94.8. On MRCR v2 the company reports a research result of 83 and a verified production score of 65.9, both ahead of GPT-5.5’s 74. On SWE-Bench Verified, the long-running coding benchmark, SubQ logs 82.4 percent against Opus 4.6’s 81.42 and Gemini 3.1 Pro’s 80.6. At 12 million tokens, where no other frontier model operates at all, SubQ holds 92.1 percent on a needle-in-a-haystack retrieval task. The cost figures are larger than the accuracy figures. Subquadratic’s RULER comparison reports SubQ hit 95 percent at $8 of inference, against Claude Opus’s 94 percent at roughly $2,600.

Benchmark	SubQ	Opus 4.6 / 4.7	GPT-5.5	Gemini 3.1 Pro
RULER 128K	97.1	94.8	not reported	not reported
MRCR v2	83.0 (research)	32.2	74.0	26.3
SWE-Bench Verified	82.4%	81.42%	not reported	80.6%
Needle at 12M	92.1%	not operable	not operable	not operable

Where Subquadratic’s Own Paper Slows Down

The technical paper is unusually candid about its caveats, which is the part most write-ups have skipped. Each model run was performed once. Inference costs at this scale make repeats prohibitive. Standard practice in academic ML is to run benchmarks several times and report median or mean. A single run leaves wider error bars than any of the published deltas.

The SWE-Bench result is also, by the paper’s own description, “harness as much as model.” SWE-Bench scores depend heavily on the agentic scaffolding wrapped around the model: how the harness reads the repository, how it iterates on patches, how it validates tests. A one-point margin over Opus 4.6 may reflect harness design rather than raw model capability.

Whedon also acknowledged that the SubQ model itself is, in his words, “way smaller than the big labs.” Parameter counts have not been disclosed. A subquadratic architecture that performs at frontier scores with fewer parameters is the strongest possible result. A subquadratic architecture that wins because it was carefully tuned for the published benchmarks, while a larger dense competitor runs out of the box, is a much weaker one.

There is also a theoretical ceiling worth flagging. The 2024 paper on fundamental limitations on subquadratic alternatives to transformers proved that for certain reasoning tasks, no truly subquadratic architecture can match dense attention without sacrificing capability. Whether SSA threads that needle, or whether it pays the price on tasks not in the current benchmark suite, will only show up in external use.

What Ships This Week

The launch package is three products. The SubQ API exposes the full 12-million-token window to developers in beta. SubQ Code is a CLI agent that loads an entire repository into a single context call, sidestepping the chunk-and-rerank pipelines most coding agents rely on. SubQ Search is a deep-research tool that runs free at launch as a customer acquisition lever. All three sit on neoclouds, the GPU-specialty providers like CoreWeave and Lambda, rather than AWS or Google Cloud. CEO Justin Dangel told reporters the major hyperscalers are “very expensive.”

The company is not open-sourcing weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. The 50-million-token window target is set for the fourth quarter of 2026. Whether that lands or slips is the first real test of whether SSA scales the way the paper claims. If it does, every retrieval pipeline built in the last three years has a competing answer to consider.

The Magic.dev Shadow Hangs Over Every Big Context Pitch

Subquadratic is not the first startup to claim the ceiling has been broken. The category has a recent and unflattering history that any investor in this round had to weigh.

August 2024, And A Number Nobody Could Test

Magic.dev announced LTM-2-mini’s 100-million-token context window in August 2024, with a claimed 1,000-fold efficiency advantage over Llama 3.1 405B’s attention. The company posted internal benchmarks and raised more than $500 million on the strength of the announcement.

Twenty months later, there is no public evidence of LTM-2-mini being used at scale outside Magic. No third-party benchmark replications. No production customers willing to be named. No follow-up model has shipped publicly. The company’s product page still markets the figure.

What Subquadratic Has To Avoid Repeating

Independent ML analysts have spent the intervening period auditing the broader subquadratic claim. Vladimir Ivanov, in a February 2026 audit of subquadratic attention claims published on LessWrong, surveyed the field and concluded that most reported breakthroughs are best understood as “incremental improvement number 93595 to the transformer architecture” rather than fundamental shifts. Ivanov’s read on Kimi Linear, DeepSeek Sparse Attention, and the state-space family was that each delivered constant-factor speedups while remaining quadratic in their selection or hybrid layers.

Subquadratic’s defense is that SSA is genuinely sparse end-to-end, including the selection mechanism, and that the published benchmarks reflect that. The paper publishes the math. What it cannot publish is the proof the architecture survives at the much larger parameter counts the frontier labs train. That proof comes from external use, or it doesn’t come at all.

$29 Million, A Pivot, And A Speech Model In The Drawer

The $29 million seed at a $500 million valuation is its own data point about the AI funding cycle. The capital came from Javier Villamizar, formerly of SoftBank Vision Fund, and Justin Mateen, the Tinder co-founder who runs JAM Fund. Several individual backers from early Anthropic, OpenAI, Stripe, and Brex rounds also wrote checks. The valuation is roughly 17 times the round size, which is aggressive for a pre-revenue company in any other market and merely typical for AI infrastructure right now.

Subquadratic is a pivot. The company was previously called Aldea and worked on speech models before redirecting to attention architecture. The speech work is shelved. The team that remains is 11 PhD researchers from Meta, Google, Oxford, Cambridge, ByteDance, Adobe, and Microsoft, plus a CEO with five prior companies in health-tech, insurance, and consumer goods, and a CTO who ran enterprise AI at TribeAI after leaving Meta.

Pivots in AI are common right now because the capital is patient and the architecture space is still moving. Whether SSA is the mechanism that finally retires dense attention, or the latest in a long line of constant-factor improvements, is what the next two quarters will tell.

Frequently Asked Questions

Can I Use SubQ Today?

Yes, in beta. The SubQ API opened to developers on May 5, 2026, with the full 12-million-token window exposed. Sign-up runs through subq.ai with an early-access waitlist. SubQ Code is a separate CLI agent built on the same model. SubQ Search runs free at launch. All three sit on neoclouds, not AWS or Google Cloud, so latency and region availability differ from what you’d get from a hyperscaler-hosted model.

How Much Does The SubQ API Cost?

Subquadratic has not posted a public price sheet yet. The company says SubQ runs roughly 50 times cheaper than frontier models at 1 million tokens, with one published RULER comparison showing $8 per benchmark run versus about $2,600 for Claude Opus on the same test. Specific per-token rates will land at general availability. Early access pricing is being negotiated case-by-case for enterprise customers through the company’s sales contact form.

Is SubQ Open Source?

No. Subquadratic has stated it will not release model weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. That puts SubQ in the same camp as Anthropic and OpenAI rather than DeepSeek or Meta. If open weights are a hard requirement, DeepSeek V3.2-Exp on Hugging Face is the closest sparse-attention alternative currently available, though its window tops out around 128K.

When Does The 50-Million-Token Window Launch?

Fourth quarter of 2026, per the company’s published roadmap. The current 12-million-token product is the first commercial release of SSA. The 50-million-token version is described as an extension of the same architecture rather than a new model family. Whether the schedule holds is the most direct test of whether linear scaling holds in production. Subquadratic has not committed to a specific month, and Q4 dates in AI tend to slip.

How Does SubQ Compare To Claude Opus 4.6 On Coding?

Subquadratic’s published numbers show SubQ at 82.4 percent on SWE-Bench Verified, against Opus 4.6 at 81.42. The margin is one point and the paper acknowledges the score is “harness as much as model,” meaning the agentic scaffolding around the model affects the result. For real coding work, the bigger differentiator is SubQ Code’s ability to load a full repository into one context call, which Opus 4.6’s 1-million-token window cannot match.

Will SubQ Work With My Existing IDE?

Through SubQ Code, yes for command-line workflows. Native plugins for Cursor, Cline, or VS Code’s Copilot extension are not yet shipping. You can route the SubQ API into any tool that accepts an OpenAI-compatible endpoint, but the full 12-million-token codebase-loading workflow currently lives only inside the SubQ Code CLI. Plugin integrations are on the public roadmap but not dated, so plan around the CLI for now.

The independent benchmarks will arrive when researchers outside Miami get the API in their own hands and run tests on prompts the company never saw. Until then, the math is on paper and the cautionary tales are documented. Subquadratic has the harder job ahead: producing the receipts.

AI

GPT-5.5 Catches Mythos On Cyber Tests, ARC Reveals Brittle Logic

Published

2 days ago

May 6, 2026

Logan Pierce

OpenAI’s GPT-5.5 has matched Anthropic’s Mythos Preview on offensive cyber tasks, the UK AI Security Institute reported on April 30, 2026. GPT-5.5 scored 71.4% on AISI’s hardest 95-task suite against Mythos Preview’s 68.6%, and both finished a 32-step network intrusion that a human expert needs roughly 20 hours to clear.

A separate ARC Prize Foundation study, published the next day, found both models still fail problems they could not have seen in training. The two streams of evidence landed in the same week and pull in opposite directions.

The Parity Moment AISI Flagged

Mythos Preview held the top spot on AISI’s expert-tier cybersecurity tasks for two weeks before GPT-5.5 caught and slightly cleared it. The institute treats the gap as statistically meaningless. GPT-5.4, the predecessor, sat at 52.4%. Anthropic’s Opus 4.7 came in at 48.6%. Both new frontier models jumped roughly 20 percentage points over their immediate predecessors in a few months.

AISI’s GPT-5.5 cyber capability evaluation calls the parity itself the headline finding, not the leader. “A second model, from a different developer, now reaches a similar level of performance on our cyber evaluations,” the institute wrote, and warned that further jumps could land “in quick succession” if cyber gains keep arriving as a side effect of general reasoning improvements.

Cracked chrome padlock symbolizing GPT-5.5 and Mythos parity on AISI cyber benchmark tests.

Inside The Last Ones, A 32-Step Network Range

The Last Ones is a simulated breach of a fictional corporate network built jointly with SpecterOps. Spread across four network segments and roughly twenty machines, it asks a model to chain initial access, lateral movement, privilege escalation, and a final objective without prompting. AISI estimates a skilled human operator needs about 20 hours.

GPT-5.5 finished the full chain in two of ten runs. Mythos Preview, the first model to crack it per AISI’s Mythos Preview cyber capability report, did so in three of ten and averaged 22 of 32 steps when it failed. Each attempt ran with a token budget of 100 million, putting even a successful run in the hundreds of dollars on API pricing.

71.4%: GPT-5.5 pass rate on AISI’s hardest expert tier
68.6%: Mythos Preview pass rate on the same tier
2 of 10: GPT-5.5 end-to-end completions of The Last Ones
20 hours: estimated human-expert time to clear all 32 steps

The shape of those numbers matters. A model that finishes a 32-step intrusion two times out of ten is unreliable on any single run, but the long tail of the distribution is what counts for offensive use. An attacker only needs one chain to land.

The 95-task suite covers vulnerability research, reverse engineering, web exploitation, and cryptographic attacks. Tasks score binary pass-fail and group into four difficulty tiers. AISI’s tooling lets the models open shells, edit files, and call out to debuggers like a real operator would.

One footnote sharpens the picture. AISI also tested both labs on a seven-step industrial control simulation built with Hack The Box, called Cooling Tower. No model has finished it yet, GPT-5.5 included. The cyber-physical bar is still out of reach.

The Rust_vm Result Mainstream Coverage Skipped

Tucked into AISI’s report is a single task that reframes the threat picture. The challenge, called rust_vm, asks the model to reverse engineer a Rust-based virtual machine, recover its instruction set, disassemble its bytecode, reverse a custom authenticator, and solve constraints to produce a key. An expert playtester used Binary Ninja, gdb, Python, and the Z3 solver. They needed about 12 hours.

GPT-5.5 finished in 10 minutes and 22 seconds. The total API cost was $1.73.

That figure compresses the whole offensive-AI argument into one line. A task that ate half a day of an experienced reverse engineer’s time fell to a model in under eleven minutes for less than the price of a coffee. The size of the gap, not the raw capability, is what AISI wants regulators to read.

A Universal Jailbreak Found In Six Hours

AISI’s red team also tested the safeguards OpenAI ships with GPT-5.5. Six hours of expert prompting was enough to find a single bypass that defeated every malicious cyber query AISI had prepared, including the multi-step agent runs where the model has to plan and execute over many turns.

OpenAI shipped a safeguard update in response. AISI said a configuration error in the version it received kept it from confirming whether the new defenses held. The audit cycle, in other words, has not closed.

“A second model, from a different developer, now reaches a similar level of performance on our cyber evaluations.”

The line, from AISI’s published evaluation, is the institute’s polite way of saying the parity is not a fluke. OpenAI’s internal classification rates GPT-5.5 a “high” cybersecurity risk under OpenAI’s updated Preparedness Framework, the second-highest tier, meaning the model can amplify existing attack pathways but stops short of “critical,” the bar for entirely new routes to severe harm.

The high tier carries deployment commitments. OpenAI agreed under the framework to ship monitoring, abuse detection, and rate-limiting around any high-rated production model. AISI’s universal-bypass finding tests whether those commitments translate to defenses that hold against a focused attacker.

Where ARC-AGI-3 Catches Both Models Out

Cyber benchmarks measure tasks that look broadly like training data. ARC-AGI-3 was built to do the opposite. The ARC Prize Foundation, run by Greg Kamradt, places models in 135 hand-crafted environments where no instructions are given and no prior data applies. Every environment has been solved by at least two humans without special training. Frontier models score near zero.

In a study released May 1, 2026, Kamradt’s team analyzed 160 replays and reasoning traces. GPT-5.5 scored 0.43 on the semi-private set. Opus 4.7 scored 0.18. The ARC Prize analysis of GPT-5.5 and Opus 4.7 identifies three repeating failure modes, but the most striking finding is how differently the two models broke.

GPT-5.5 Failed To Compress

GPT-5.5 generated multiple competing hypotheses about each environment but could not commit to one. Kamradt called this “wider hypothesis generation” without the closing step. The model saw that an action sometimes rotated an object and sometimes did nothing, but never compressed the observations into a single rule.

That pattern shows up in offensive cyber work too, just less visibly. Solving a known capture-the-flag means matching a pattern. Reasoning about a brand-new system means building the model and committing to it. AISI’s rust_vm result hides the distinction because the underlying instruction set, while custom, follows familiar conventions.

Opus 4.7 Locked Onto The Wrong Game

Opus 4.7 went the opposite way. It compressed quickly, then refused to revise. “Opus had the wrong compression,” Kamradt wrote. “GPT-5.5 failed to compress.” Opus runs repeatedly mistook ARC environments for Tetris, Frogger, Sokoban, Breakout, Pong, and Boulder Dash, then kept playing those games even after the rules disagreed.

The transfer problem hit both labs hard. Beating one level rarely helped on the next. Whatever a model learned in level one did not survive contact with level two. Background on the benchmark’s construction sits in the ARC-AGI-3 interactive reasoning benchmark paper.

Why Capability And Brittleness Live Together

The two evaluation streams point at the same fact from opposite sides. Cyber benchmarks reward fluency in patterns the model has seen many times. Reasoning benchmarks punish that fluency the moment the patterns no longer apply. Both labs are pushing the first lever and have done little for the second.

If AISI is right that the cyber jump came from general reasoning and agent gains rather than targeted training, the next frontier model will likely show both moves at once. More offensive capability. The same brittle compression. ARC Prize’s 2025 competition results already trailed this pattern, with strong scores on training-aligned tasks and collapses on novel ones.

Frequently Asked Questions

Is GPT-5.5 available to use right now?

GPT-5.5 is in limited preview as of May 2026. OpenAI has rolled it out to enterprise customers and API testers under usage agreements that include the high-risk safety controls AISI tested. A wider ChatGPT release has not been announced. Developers can apply for access through OpenAI’s platform page; rate limits and abuse-monitoring requirements come bundled with the high-risk classification.

Does the AISI finding mean AI can hack on its own?

Not quite. GPT-5.5 finished a full corporate intrusion only twice in ten attempts, and each run cost hundreds of dollars in compute. What changed is the speed on individual subtasks. Reverse engineering jobs that took human experts twelve hours fell in under eleven minutes for $1.73. Defenders should treat the model as a force multiplier for skilled attackers, not an autonomous threat actor yet.

How does ARC-AGI-3 differ from earlier ARC tests?

ARC-AGI-3 is interactive, not single-turn. The earlier ARC-AGI-2 asked models to fill in a missing grid pattern from a few examples. ARC-AGI-3 drops the model into 135 hand-built game environments with no instructions, where the model must figure out rules through trial and error. Humans clear them without training; frontier models score below 1%. The 2026 Kaggle round opens later this year for outside teams.

What did OpenAI say about the universal jailbreak?

OpenAI updated its safeguard stack after AISI shared the bypass details, the company told the institute. AISI then received a follow-up build, but a configuration error in that version blocked retesting, so the fix is unverified externally. OpenAI’s preparedness page lists GPT-5.5 at “high” risk on cybersecurity, the second-highest tier and the trigger for monitoring commitments around the model in production.

The next round of frontier evaluations is already in flight. AISI is iterating its 95-task suite while ARC Prize runs ARC-AGI-3 as a 2026 Kaggle competition with a $1 million prize pool. Whichever lab ships the next jump first will be tested against both, and the gap between those two scores is now the number that matters.