Connect with us

AI

GPT-5.5 Catches Mythos On Cyber Tests, ARC Reveals Brittle Logic

Published

on

OpenAI’s GPT-5.5 has matched Anthropic’s Mythos Preview on offensive cyber tasks, the UK AI Security Institute reported on April 30, 2026. GPT-5.5 scored 71.4% on AISI’s hardest 95-task suite against Mythos Preview’s 68.6%, and both finished a 32-step network intrusion that a human expert needs roughly 20 hours to clear.

A separate ARC Prize Foundation study, published the next day, found both models still fail problems they could not have seen in training. The two streams of evidence landed in the same week and pull in opposite directions.

The Parity Moment AISI Flagged

Mythos Preview held the top spot on AISI’s expert-tier cybersecurity tasks for two weeks before GPT-5.5 caught and slightly cleared it. The institute treats the gap as statistically meaningless. GPT-5.4, the predecessor, sat at 52.4%. Anthropic’s Opus 4.7 came in at 48.6%. Both new frontier models jumped roughly 20 percentage points over their immediate predecessors in a few months.

AISI’s GPT-5.5 cyber capability evaluation calls the parity itself the headline finding, not the leader. “A second model, from a different developer, now reaches a similar level of performance on our cyber evaluations,” the institute wrote, and warned that further jumps could land “in quick succession” if cyber gains keep arriving as a side effect of general reasoning improvements.

Inside The Last Ones, A 32-Step Network Range

The Last Ones is a simulated breach of a fictional corporate network built jointly with SpecterOps. Spread across four network segments and roughly twenty machines, it asks a model to chain initial access, lateral movement, privilege escalation, and a final objective without prompting. AISI estimates a skilled human operator needs about 20 hours.

GPT-5.5 finished the full chain in two of ten runs. Mythos Preview, the first model to crack it per AISI’s Mythos Preview cyber capability report, did so in three of ten and averaged 22 of 32 steps when it failed. Each attempt ran with a token budget of 100 million, putting even a successful run in the hundreds of dollars on API pricing.

  • 71.4%: GPT-5.5 pass rate on AISI’s hardest expert tier
  • 68.6%: Mythos Preview pass rate on the same tier
  • 2 of 10: GPT-5.5 end-to-end completions of The Last Ones
  • 20 hours: estimated human-expert time to clear all 32 steps

The shape of those numbers matters. A model that finishes a 32-step intrusion two times out of ten is unreliable on any single run, but the long tail of the distribution is what counts for offensive use. An attacker only needs one chain to land.

The 95-task suite covers vulnerability research, reverse engineering, web exploitation, and cryptographic attacks. Tasks score binary pass-fail and group into four difficulty tiers. AISI’s tooling lets the models open shells, edit files, and call out to debuggers like a real operator would.

One footnote sharpens the picture. AISI also tested both labs on a seven-step industrial control simulation built with Hack The Box, called Cooling Tower. No model has finished it yet, GPT-5.5 included. The cyber-physical bar is still out of reach.

The Rust_vm Result Mainstream Coverage Skipped

Tucked into AISI’s report is a single task that reframes the threat picture. The challenge, called rust_vm, asks the model to reverse engineer a Rust-based virtual machine, recover its instruction set, disassemble its bytecode, reverse a custom authenticator, and solve constraints to produce a key. An expert playtester used Binary Ninja, gdb, Python, and the Z3 solver. They needed about 12 hours.

GPT-5.5 finished in 10 minutes and 22 seconds. The total API cost was $1.73.

That figure compresses the whole offensive-AI argument into one line. A task that ate half a day of an experienced reverse engineer’s time fell to a model in under eleven minutes for less than the price of a coffee. The size of the gap, not the raw capability, is what AISI wants regulators to read.

A Universal Jailbreak Found In Six Hours

AISI’s red team also tested the safeguards OpenAI ships with GPT-5.5. Six hours of expert prompting was enough to find a single bypass that defeated every malicious cyber query AISI had prepared, including the multi-step agent runs where the model has to plan and execute over many turns.

OpenAI shipped a safeguard update in response. AISI said a configuration error in the version it received kept it from confirming whether the new defenses held. The audit cycle, in other words, has not closed.

“A second model, from a different developer, now reaches a similar level of performance on our cyber evaluations.”

The line, from AISI’s published evaluation, is the institute’s polite way of saying the parity is not a fluke. OpenAI’s internal classification rates GPT-5.5 a “high” cybersecurity risk under OpenAI’s updated Preparedness Framework, the second-highest tier, meaning the model can amplify existing attack pathways but stops short of “critical,” the bar for entirely new routes to severe harm.

The high tier carries deployment commitments. OpenAI agreed under the framework to ship monitoring, abuse detection, and rate-limiting around any high-rated production model. AISI’s universal-bypass finding tests whether those commitments translate to defenses that hold against a focused attacker.

Where ARC-AGI-3 Catches Both Models Out

Cyber benchmarks measure tasks that look broadly like training data. ARC-AGI-3 was built to do the opposite. The ARC Prize Foundation, run by Greg Kamradt, places models in 135 hand-crafted environments where no instructions are given and no prior data applies. Every environment has been solved by at least two humans without special training. Frontier models score near zero.

In a study released May 1, 2026, Kamradt’s team analyzed 160 replays and reasoning traces. GPT-5.5 scored 0.43 on the semi-private set. Opus 4.7 scored 0.18. The ARC Prize analysis of GPT-5.5 and Opus 4.7 identifies three repeating failure modes, but the most striking finding is how differently the two models broke.

GPT-5.5 Failed To Compress

GPT-5.5 generated multiple competing hypotheses about each environment but could not commit to one. Kamradt called this “wider hypothesis generation” without the closing step. The model saw that an action sometimes rotated an object and sometimes did nothing, but never compressed the observations into a single rule.

That pattern shows up in offensive cyber work too, just less visibly. Solving a known capture-the-flag means matching a pattern. Reasoning about a brand-new system means building the model and committing to it. AISI’s rust_vm result hides the distinction because the underlying instruction set, while custom, follows familiar conventions.

Opus 4.7 Locked Onto The Wrong Game

Opus 4.7 went the opposite way. It compressed quickly, then refused to revise. “Opus had the wrong compression,” Kamradt wrote. “GPT-5.5 failed to compress.” Opus runs repeatedly mistook ARC environments for Tetris, Frogger, Sokoban, Breakout, Pong, and Boulder Dash, then kept playing those games even after the rules disagreed.

The transfer problem hit both labs hard. Beating one level rarely helped on the next. Whatever a model learned in level one did not survive contact with level two. Background on the benchmark’s construction sits in the ARC-AGI-3 interactive reasoning benchmark paper.

Why Capability And Brittleness Live Together

The two evaluation streams point at the same fact from opposite sides. Cyber benchmarks reward fluency in patterns the model has seen many times. Reasoning benchmarks punish that fluency the moment the patterns no longer apply. Both labs are pushing the first lever and have done little for the second.

If AISI is right that the cyber jump came from general reasoning and agent gains rather than targeted training, the next frontier model will likely show both moves at once. More offensive capability. The same brittle compression. ARC Prize’s 2025 competition results already trailed this pattern, with strong scores on training-aligned tasks and collapses on novel ones.

Frequently Asked Questions

Is GPT-5.5 available to use right now?

GPT-5.5 is in limited preview as of May 2026. OpenAI has rolled it out to enterprise customers and API testers under usage agreements that include the high-risk safety controls AISI tested. A wider ChatGPT release has not been announced. Developers can apply for access through OpenAI’s platform page; rate limits and abuse-monitoring requirements come bundled with the high-risk classification.

Does the AISI finding mean AI can hack on its own?

Not quite. GPT-5.5 finished a full corporate intrusion only twice in ten attempts, and each run cost hundreds of dollars in compute. What changed is the speed on individual subtasks. Reverse engineering jobs that took human experts twelve hours fell in under eleven minutes for $1.73. Defenders should treat the model as a force multiplier for skilled attackers, not an autonomous threat actor yet.

How does ARC-AGI-3 differ from earlier ARC tests?

ARC-AGI-3 is interactive, not single-turn. The earlier ARC-AGI-2 asked models to fill in a missing grid pattern from a few examples. ARC-AGI-3 drops the model into 135 hand-built game environments with no instructions, where the model must figure out rules through trial and error. Humans clear them without training; frontier models score below 1%. The 2026 Kaggle round opens later this year for outside teams.

What did OpenAI say about the universal jailbreak?

OpenAI updated its safeguard stack after AISI shared the bypass details, the company told the institute. AISI then received a follow-up build, but a configuration error in that version blocked retesting, so the fix is unverified externally. OpenAI’s preparedness page lists GPT-5.5 at “high” risk on cybersecurity, the second-highest tier and the trigger for monitoring commitments around the model in production.

The next round of frontier evaluations is already in flight. AISI is iterating its 95-task suite while ARC Prize runs ARC-AGI-3 as a 2026 Kaggle competition with a $1 million prize pool. Whichever lab ships the next jump first will be tested against both, and the gap between those two scores is now the number that matters.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI

Subquadratic Launches A 12-Million-Token AI Model And Says The Wall Is Gone

Published

on

A Miami startup called Subquadratic launched its first model on Tuesday with a 12-million-token context window, twelve times the ceiling that every frontier lab has settled near for two years. The company says its proprietary attention architecture scales linearly in compute and memory, runs 52 times faster than dense attention at 1 million tokens, and beats OpenAI’s GPT-5.5 on multi-document recall by nine points. The product launched as an API, a coding agent called SubQ Code, and a deep-research tool called SubQ Search, all running on neoclouds rather than the hyperscalers. Subquadratic has raised $29 million at a $500 million valuation.

The pitch is the kind that surfaces every twelve months in this corner of AI. Magic.dev claimed a 100-million-token window in August 2024, raised more than $500 million, and has produced no public evidence the model is being used outside its own building. Subquadratic’s benchmarks, named experts, and architectural specifics are different. Whether the math survives outside controlled testing is the question every CTO will ask this week, with the company’s launch post detailing SubQ’s SSA architecture and benchmark methodology serving as the opening exhibit.

  • 12 million tokens of context, roughly 9 million words or 120 books in a single prompt.
  • 52x faster than dense attention at 1 million tokens, by the company’s own measurement.
  • 92.1% accuracy on needle-in-a-haystack retrieval at the full 12-million-token length.
  • $29 million raised at a $500 million valuation from Javier Villamizar, Justin Mateen, and early backers of Anthropic, OpenAI, Stripe, and Brex.

Why The Million-Token Wall Has Held This Long

Quadratic attention has been the bottleneck since the 2017 transformer paper. Doubling the input quadruples the work. The original architecture compares every token to every other token, so a million-token prompt costs a trillion pairwise comparisons. Every workaround the industry has shipped, from retrieval-augmented generation to agentic decomposition to KV-cache offloading, exists to dodge that cost.

The major frontier models have stalled at the same number. Anthropic’s Claude Opus 4.7, Google’s Gemini 3.1 Pro, and OpenAI’s GPT-5.5 all advertise context windows of around one million tokens. None of them work especially well at the upper end. On MRCR v2, the multi-reference recall benchmark labs report, GPT-5.5 leads at 74.0 percent. Opus 4.7 trails at 32.2 percent. Gemini 3.1 Pro lands at 26.3 percent. The advertised window and the usable window are not the same thing.

That gap is what the labs are quietly acknowledging when they pair their long-context offerings with retrieval pipelines. RAG was never a feature; it was a workaround. Subquadratic’s argument is that if attention itself stops scaling quadratically, the workarounds stop being necessary.

What Selective Attention Actually Picks

The architecture Subquadratic calls Subquadratic Selective Attention, or SSA, is a learned sparsity mechanism. For any given query token, the model picks which positions in the input matter, conditioned on what the query and keys actually contain. The selection itself does not run quadratic, which is the trap most prior sparse-attention work fell into.

Alex Whedon, Subquadratic’s CTO and a former Meta engineer who ran enterprise AI at TribeAI, described the mechanism in plain terms.

“Sparse attention basically means instead of doing what transformers do, which is if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 squared combinations. You realize that only a portion of those actually matter and you only process the portion that matter,” Whedon said in an interview.

The catalogue of prior approaches reads like a graveyard of clever tradeoffs:

  • Fixed-pattern sparse attention, used in models like Longformer, scales linearly by attending only to a sliding window. It works when relevant information sits nearby. It fails when it doesn’t.
  • State-space models like Mamba, Mamba-2, RWKV, and RetNet trade dense attention for a recurrent state. The state is lossy. Nvidia’s 8-billion-parameter study found pure Mamba-2 lagged transformers on MMLU and basic phonebook lookup until attention layers were stitched back in.
  • Hybrid architectures, including Jamba, Kimi Linear, Qwen3-Next, and Nvidia’s Nemotron v3, mix cheap layers with a few dense layers. The dense layers still do quadratic work, so a hybrid that’s three times cheaper at 32K tokens is still three times cheaper at 10 million tokens. The asymptotics never improve.

DeepSeek’s Native Sparse Attention won the ACL 2025 best paper award, and its successor DeepSeek Sparse Attention now ships in DeepSeek’s V3.2-Exp release notes describing the DSA lightning indexer. Sebastian Raschka’s technical breakdown of DeepSeek V3.2 sparse attention notes that DSA reduces complexity from quadratic to linear in the number of selected tokens. The catch, as independent analysts have flagged, is that the lightning indexer that picks those tokens still has to score every query against every key. The selection step itself stays quadratic.

Whedon argues SSA does what DSA tried to do without the indexer trap. “For prompt A, words one and six are going to be important to each other,” he said. “For prompt B, maybe it’s words two and three. It’s different for every single input.” Hybrids, he said, deliver “a scalar benefit.” A pure subquadratic mechanism delivers a scaling-law advantage.

The Benchmarks, And Where They Land

Subquadratic’s own technical paper publishes the headline numbers, with one third-party verification on MRCR v2. On RULER at 128,000 tokens, SubQ scores 97.1 against Opus 4.6’s 94.8. On MRCR v2 the company reports a research result of 83 and a verified production score of 65.9, both ahead of GPT-5.5’s 74. On SWE-Bench Verified, the long-running coding benchmark, SubQ logs 82.4 percent against Opus 4.6’s 81.42 and Gemini 3.1 Pro’s 80.6. At 12 million tokens, where no other frontier model operates at all, SubQ holds 92.1 percent on a needle-in-a-haystack retrieval task. The cost figures are larger than the accuracy figures. Subquadratic’s RULER comparison reports SubQ hit 95 percent at $8 of inference, against Claude Opus’s 94 percent at roughly $2,600.

Benchmark SubQ Opus 4.6 / 4.7 GPT-5.5 Gemini 3.1 Pro
RULER 128K 97.1 94.8 not reported not reported
MRCR v2 83.0 (research) 32.2 74.0 26.3
SWE-Bench Verified 82.4% 81.42% not reported 80.6%
Needle at 12M 92.1% not operable not operable not operable

Where Subquadratic’s Own Paper Slows Down

The technical paper is unusually candid about its caveats, which is the part most write-ups have skipped. Each model run was performed once. Inference costs at this scale make repeats prohibitive. Standard practice in academic ML is to run benchmarks several times and report median or mean. A single run leaves wider error bars than any of the published deltas.

The SWE-Bench result is also, by the paper’s own description, “harness as much as model.” SWE-Bench scores depend heavily on the agentic scaffolding wrapped around the model: how the harness reads the repository, how it iterates on patches, how it validates tests. A one-point margin over Opus 4.6 may reflect harness design rather than raw model capability.

Whedon also acknowledged that the SubQ model itself is, in his words, “way smaller than the big labs.” Parameter counts have not been disclosed. A subquadratic architecture that performs at frontier scores with fewer parameters is the strongest possible result. A subquadratic architecture that wins because it was carefully tuned for the published benchmarks, while a larger dense competitor runs out of the box, is a much weaker one.

There is also a theoretical ceiling worth flagging. The 2024 paper on fundamental limitations on subquadratic alternatives to transformers proved that for certain reasoning tasks, no truly subquadratic architecture can match dense attention without sacrificing capability. Whether SSA threads that needle, or whether it pays the price on tasks not in the current benchmark suite, will only show up in external use.

What Ships This Week

The launch package is three products. The SubQ API exposes the full 12-million-token window to developers in beta. SubQ Code is a CLI agent that loads an entire repository into a single context call, sidestepping the chunk-and-rerank pipelines most coding agents rely on. SubQ Search is a deep-research tool that runs free at launch as a customer acquisition lever. All three sit on neoclouds, the GPU-specialty providers like CoreWeave and Lambda, rather than AWS or Google Cloud. CEO Justin Dangel told reporters the major hyperscalers are “very expensive.”

The company is not open-sourcing weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. The 50-million-token window target is set for the fourth quarter of 2026. Whether that lands or slips is the first real test of whether SSA scales the way the paper claims. If it does, every retrieval pipeline built in the last three years has a competing answer to consider.

The Magic.dev Shadow Hangs Over Every Big Context Pitch

Subquadratic is not the first startup to claim the ceiling has been broken. The category has a recent and unflattering history that any investor in this round had to weigh.

August 2024, And A Number Nobody Could Test

Magic.dev announced LTM-2-mini’s 100-million-token context window in August 2024, with a claimed 1,000-fold efficiency advantage over Llama 3.1 405B’s attention. The company posted internal benchmarks and raised more than $500 million on the strength of the announcement.

Twenty months later, there is no public evidence of LTM-2-mini being used at scale outside Magic. No third-party benchmark replications. No production customers willing to be named. No follow-up model has shipped publicly. The company’s product page still markets the figure.

What Subquadratic Has To Avoid Repeating

Independent ML analysts have spent the intervening period auditing the broader subquadratic claim. Vladimir Ivanov, in a February 2026 audit of subquadratic attention claims published on LessWrong, surveyed the field and concluded that most reported breakthroughs are best understood as “incremental improvement number 93595 to the transformer architecture” rather than fundamental shifts. Ivanov’s read on Kimi Linear, DeepSeek Sparse Attention, and the state-space family was that each delivered constant-factor speedups while remaining quadratic in their selection or hybrid layers.

Subquadratic’s defense is that SSA is genuinely sparse end-to-end, including the selection mechanism, and that the published benchmarks reflect that. The paper publishes the math. What it cannot publish is the proof the architecture survives at the much larger parameter counts the frontier labs train. That proof comes from external use, or it doesn’t come at all.

$29 Million, A Pivot, And A Speech Model In The Drawer

The $29 million seed at a $500 million valuation is its own data point about the AI funding cycle. The capital came from Javier Villamizar, formerly of SoftBank Vision Fund, and Justin Mateen, the Tinder co-founder who runs JAM Fund. Several individual backers from early Anthropic, OpenAI, Stripe, and Brex rounds also wrote checks. The valuation is roughly 17 times the round size, which is aggressive for a pre-revenue company in any other market and merely typical for AI infrastructure right now.

Subquadratic is a pivot. The company was previously called Aldea and worked on speech models before redirecting to attention architecture. The speech work is shelved. The team that remains is 11 PhD researchers from Meta, Google, Oxford, Cambridge, ByteDance, Adobe, and Microsoft, plus a CEO with five prior companies in health-tech, insurance, and consumer goods, and a CTO who ran enterprise AI at TribeAI after leaving Meta.

Pivots in AI are common right now because the capital is patient and the architecture space is still moving. Whether SSA is the mechanism that finally retires dense attention, or the latest in a long line of constant-factor improvements, is what the next two quarters will tell.

Frequently Asked Questions

Can I Use SubQ Today?

Yes, in beta. The SubQ API opened to developers on May 5, 2026, with the full 12-million-token window exposed. Sign-up runs through subq.ai with an early-access waitlist. SubQ Code is a separate CLI agent built on the same model. SubQ Search runs free at launch. All three sit on neoclouds, not AWS or Google Cloud, so latency and region availability differ from what you’d get from a hyperscaler-hosted model.

How Much Does The SubQ API Cost?

Subquadratic has not posted a public price sheet yet. The company says SubQ runs roughly 50 times cheaper than frontier models at 1 million tokens, with one published RULER comparison showing $8 per benchmark run versus about $2,600 for Claude Opus on the same test. Specific per-token rates will land at general availability. Early access pricing is being negotiated case-by-case for enterprise customers through the company’s sales contact form.

Is SubQ Open Source?

No. Subquadratic has stated it will not release model weights. Enterprises that want their own fine-tuned version will get post-training tooling, but the base architecture stays closed. That puts SubQ in the same camp as Anthropic and OpenAI rather than DeepSeek or Meta. If open weights are a hard requirement, DeepSeek V3.2-Exp on Hugging Face is the closest sparse-attention alternative currently available, though its window tops out around 128K.

When Does The 50-Million-Token Window Launch?

Fourth quarter of 2026, per the company’s published roadmap. The current 12-million-token product is the first commercial release of SSA. The 50-million-token version is described as an extension of the same architecture rather than a new model family. Whether the schedule holds is the most direct test of whether linear scaling holds in production. Subquadratic has not committed to a specific month, and Q4 dates in AI tend to slip.

How Does SubQ Compare To Claude Opus 4.6 On Coding?

Subquadratic’s published numbers show SubQ at 82.4 percent on SWE-Bench Verified, against Opus 4.6 at 81.42. The margin is one point and the paper acknowledges the score is “harness as much as model,” meaning the agentic scaffolding around the model affects the result. For real coding work, the bigger differentiator is SubQ Code’s ability to load a full repository into one context call, which Opus 4.6’s 1-million-token window cannot match.

Will SubQ Work With My Existing IDE?

Through SubQ Code, yes for command-line workflows. Native plugins for Cursor, Cline, or VS Code’s Copilot extension are not yet shipping. You can route the SubQ API into any tool that accepts an OpenAI-compatible endpoint, but the full 12-million-token codebase-loading workflow currently lives only inside the SubQ Code CLI. Plugin integrations are on the public roadmap but not dated, so plan around the CLI for now.

The independent benchmarks will arrive when researchers outside Miami get the API in their own hands and run tests on prompts the company never saw. Until then, the math is on paper and the cautionary tales are documented. Subquadratic has the harder job ahead: producing the receipts.

Continue Reading

AI

Driscoll Pushes Nine Defense Giants to Open Code at Fort Carson

Published

on

The U.S. Army is putting nine of its biggest contractors in the same room later this month and asking them to do something the Pentagon has begged for since the 1990s. Make their proprietary weapons systems talk to each other. Army Secretary Daniel Driscoll announced the program, called Right to Integrate, on Tuesday, with the first hackathon convening at Fort Carson, Colorado. Engineers from Anduril, Boeing, General Dynamics, L3Harris, Lockheed Martin, Northrop Grumman, Palantir, Perennial Autonomy, and RTX will show up with hardware and code-level access to test whether their systems can share a common AI-ready data layer.

The pitch sounds clean. Get the platforms to share data so artificial intelligence can sit on top. The reality is messier. Driscoll is running this play four months after a damning internal memo nearly torpedoed the Army’s flagship command-and-control program, and seven months after his own staff publicly conceded “very high risk” inside a battlefield network Anduril and Palantir already built.

Fort Carson is the proof point. Either nine companies that compete for the same contracts open their interfaces in front of each other on the same floor, or the program collapses back into the same fragmented stack the Army has been paying for since the early Bush years. Driscoll, who took the job in February 2025 according to the Army’s announcement of his swearing-in as 26th Secretary, has staked his tenure on breaking that stack.

What Right to Integrate Actually Asks the Vendors to Hand Over

The hackathon is structured as a series of one-day working sessions, not a contract action. The goal, per the Tuesday release, is to “deconflict” the operating systems of existing weapons platforms so they can pass data without bespoke middleware, custom adapters, or vendor-specific gateways.

Translated, that means handing rival vendors enough technical detail to make fire-control software, drone autopilots, sensor packages, and back-end data layers behave as a single stack. That is not a small ask in a sector built on locked APIs and recurring license revenue. Each vendor is expected to bring real assets, not slideware.

  • A representative system from its current Army portfolio
  • Engineers and scientists with code-level authority
  • Compatible APIs or a documented path to one
  • Real-time integration testing on Army-issued hardware

The September Memo Still Hanging Over Fort Carson

Driscoll’s hackathon does not happen in a vacuum. It happens after Gabriele Chiulli, the Army’s authorizing official on the Next Generation Command and Control prototype, signed a September 5, 2025 internal memo describing the Anduril-Palantir-built system as carrying so many open holes the service had to treat it as “very high risk.”

The memo, first reported by Reuters in October, found that any authorized user could access every application and dataset on the platform regardless of clearance level, with no logging to track misuse. One application carried 25 high-severity code vulnerabilities. Three more averaged over 200 unassessed vulnerabilities each.

We cannot control who sees what, we cannot see what users are doing, and we cannot verify that the software itself is secure.

That sentence, attributed to Chiulli, set the political ceiling for everything that follows. The Army said in October the critical issues had been mitigated. Anduril called the report an “outdated snapshot” of a program it had detailed in its July 2025 prototype announcement. Driscoll cannot just tell vendors to integrate faster now. He has to prove the integration model can survive a service-level cyber audit.

The Nine Companies Coming to the Fort Carson Floor

The vendor mix shows what Driscoll is really doing. He is not picking a single prime to write the operating system. He is putting traditional and Silicon Valley contractors in direct competition over the data layer, with the Army keeping the integration referee role for itself.

Company Headquartered Existing Army Stake
Anduril Costa Mesa, CA $99.6M NGC2 prototype lead, $20B enterprise ceiling
Palantir Denver, CO $10B 10-year enterprise agreement, NGC2 data layer
Lockheed Martin Bethesda, MD NGC2 OTA awarded September 2025
RTX Arlington, VA Patriot, sensor and radar portfolio
Northrop Grumman Falls Church, VA IBCS battle-management franchise
Boeing Arlington, VA Apache, MQ-25, Army aviation systems
General Dynamics Reston, VA Abrams, networking, ground vehicle electronics
L3Harris Melbourne, FL Tactical radios, ISR sensors
Perennial Autonomy Santa Clara, CA Robotics and autonomy startup

Why Driscoll Keeps Pointing Back to Ukraine

Driscoll’s framing of the hackathon kept circling one place. “The war in Ukraine showed the world that speed matters and an open architecture construct is highly effective in high-intensity warfare,” he said in the Tuesday announcement. “We’ve known for a long time that our systems, weapons, and sensors need to talk to each other so that we can dominate the battlefield.”

That comment is doing real work. Ukrainian forces have published technical specs that mandate compatibility between drones, sensors, and weapons platforms, often built on off-the-shelf hardware. U.S. contractors historically built bespoke systems that connect only through expensive middleware, generating recurring license revenue and slowing every software refresh.

Army Chief Technology Officer Alex Miller, the technical face of the rollout, was blunter in the same release. “We have seen standards come and go in the department for decades, but are still beholden to sub-par implementation, closed and proprietary interfaces,” said Miller, who frames Right to Integrate as the moment the service stops accepting that posture.

A $30 Billion Bet on Two Silicon Valley Startups

The hackathon is the visible part of a much larger spending shift. Over the past year the Army has consolidated 120 separate Anduril contracts into a single enterprise agreement with a $20 billion ceiling over up to ten years, and 75 Palantir contracts into a $10 billion ten-year deal. The two startups now sit on combined Army ceiling commitments roughly the size of Lockheed Martin’s annual missiles and fire control segment.

That trajectory accelerated in July 2025. The Army awarded Anduril a $99.6 million Other Transaction Authority deal to deliver a division-level NGC2 prototype in 11 months, with Palantir, Microsoft, Striveworks, Govini, Instant Connect Enterprise, Research Innovations, and Rune Technologies on the team, per the Army’s NGC2 prototype contract award announcement.

Eight weeks after the deal was signed, the 4th Infantry Division ran NGC2 in a live-fire exercise called Ivy Sting 1, with Anduril’s Lattice Mesh and Palantir’s Target Workbench on the targeting chain. Palantir’s U.S. Army defense solutions overview details the data-fabric layer that ties those targeting workflows together.

Wall Street noticed. Palantir shares dipped on the October memo, then rebounded as the service said cyber issues had been remediated. William Blair analysts pegged Palantir’s NGC2 cut at roughly $30 million now, with potential to push past $150 million in annual recurring revenue inside three years.

The Right to Integrate hackathon protects that revenue. If the seven traditional primes refuse to align their systems to a Palantir or Anduril data layer, the entire $30 billion bet sits in a silo while older middleware contracts keep printing money for the same primes the Army wants to discipline.

Driscoll’s Quiet War on the Prime Contractor Model

Driscoll has not hidden where this is going. “I will measure it as success if in the next two years, one of the primes is no longer in business,” he said in a May 2025 Breaking Defense interview, a remark that reads very different now that nine of those primes are walking into Fort Carson with code-level access. He has paired that posture with the cancellation of the M10 Booker light tank, a pause on the robotic combat vehicle award, and a redirect of roughly 8 percent of non-lethal Defense Department spending into innovative weaponry, detailed in the Army’s plan to eliminate programs not contributing to lethality.

The service has also commissioned four senior tech executives, from Palantir, Meta, OpenAI, and Thinking Machines, as lieutenant colonels in a new unit. The Army’s launch of Detachment 201 as an Executive Innovation Corps places those advisors directly inside the same software-stack debates the hackathon will surface.

Right to Integrate is the operational expression of that posture. A prime that cannot or will not integrate at the data layer in May 2026 will likely lose ground to a vendor that does, and Driscoll has built the hackathon as the moment that loss becomes visible to everyone in the room at once.

Frequently Asked Questions

What Is the Army’s Right to Integrate Initiative?

It’s a series of one-day hackathons announced May 5, 2026 by Army Secretary Dan Driscoll. The first event runs later in May 2026 at Fort Carson, Colorado, and convenes engineers from nine major Army vendors to deconflict their software interfaces and feed a shared command-and-control data layer. There is no contract award attached. Vendors who integrate cleanly are positioned for follow-on NGC2 task orders.

Why Is the Army Calling It a Hackathon Instead of a Procurement?

Because it isn’t a contract action. It’s a working session where vendor engineers physically open their APIs in front of Army CTO Alex Miller’s team. Vendors that resist face exclusion from the next NGC2 task order. The structure also lets the Army avoid the year-long protest cycle that follows traditional awards, since no money changes hands at Fort Carson.

Does Right to Integrate Replace the Anduril-Palantir NGC2 Prototype?

No. NGC2 is the underlying command-and-control system that the integration sessions feed into. Anduril leads NGC2 under a $99.6 million Other Transaction Authority with Palantir on the data layer. Right to Integrate forces the other seven Fort Carson vendors to make their existing weapons platforms compatible with that NGC2 stack rather than running parallel ones, which is where most of the Army’s recent acquisition pain has lived.

What About the Security Flaws the Army Flagged in NGC2 Last Year?

The September 5, 2025 memo from Army authorizing official Gabriele Chiulli described “very high risk” cybersecurity gaps in early NGC2 builds, including over 200 unassessed code vulnerabilities in some apps. The Army said in October it had mitigated the critical issues. Anduril called the memo an “outdated snapshot.” Watch the next service-level cyber audit, expected before the 4th Infantry Division’s full operational deployment.

Fort Carson in late May 2026 will be the first place the United States tests whether nine companies that have spent decades writing closed code can, in a single room, agree on one open interface. Driscoll’s bet is that the threat of losing $30 billion in combined Anduril and Palantir ceiling commitments is finally enough leverage to break the model. The vendors who walk out integrated will write the next decade of Army software.

Continue Reading

Trending