Connect with us

AI

Claude Opus 4.8 Review: A Coding Win That Burns Its Own Quota

Claude Opus 4.8 scores 69.2% on SWE-Bench Pro and jumped 27 points on USAMO math, but one coding session can drain the full Pro plan quota. Six tests, full results.

Published

on

Claude Opus 4.8 arrived on May 28, 2026, exactly 41 days after Opus 4.7, and on SWE-Bench Pro, the coding benchmark drawn from actively maintained repositories with no public ground-truth answers, it climbed from 64.3% to 69.2%, more than 10 points clear of OpenAI’s GPT-5.5 at 58.6%. USAMO 2026, the USA Mathematical Olympiad competition, went from 69.3% to 96.7% in a single cycle. The coding game the model built in one prompt had cleaner design and better mechanics than any Anthropic model before it in that test. Anthropic calls the overall upgrade “a modest but tangible improvement.”

A six-test review published by Decrypt, covering creative writing, coding, math, logic, narrative reasoning, and long-context recall, found gains concentrated in coding and math, with creative writing and narrative reasoning flat or worse. The most striking outcome from that run had nothing to do with any answer the model gave. A single coding prompt consumed the team’s full Pro plan token quota before the session was over.

The Coding and Math Wins

The game built in a single prompt, Typing Dead, a zombie typing mechanic, came out with better visual design and more stable gameplay than any prior Anthropic model on this prompt. The model caught several of its own bugs mid-inference and corrected them before being told anything was wrong. Each follow-up in the multi-shot phase improved the build without breaking earlier work, the failure mode that forces most models into a full rebuild once a codebase reaches real complexity.

The math test asked for a degree-19 polynomial whose intersection curve has at least three irreducible components, then a specific computed output. Opus 4.7 had failed this problem across multiple attempts. Opus 4.8 identified the correct algebraic construction, found the right monodromy structure yielding ten irreducible components, and produced a 25-digit answer in a single pass. The full reasoning chain is in the tester’s Opus 4.8 math record on GitHub.

  • SWE-Bench Pro: 69.2% (Opus 4.7: 64.3%, GPT-5.5: 58.6%, per the Anthropic system card)
  • USAMO 2026 math: 96.7% (Opus 4.7: 69.3%)
  • Fast Mode: $10/$50 per million tokens (down from $30/$150 for Opus 4.7’s fast tier)
  • Code flaw detection: 4x less likely than Opus 4.7 to let bugs in its own code pass unremarked, per Anthropic’s alignment evaluations

The logic test was a clean pass. A linguistic trap built around a dead man’s widow was identified as self-contradictory before the model answered, and the substantive follow-up analysis was correct.

Dynamic Workflows, a Claude Code feature now in research preview on Enterprise and Max plans, lets the model plan a task and spawn hundreds of parallel subagents to run large codebase migrations autonomously. Anthropic describes the target workload as repository migrations spanning 100,000 lines or more. The feature isn’t yet generally available.

The Pro Plan Token Wall

Why One Prompt Emptied the Tank

Artificial Analysis, which runs standardized cross-model evaluations, measured 110 million tokens generated when running Opus 4.8 through their Intelligence Index. The average for comparable models in the same test is 35 million. Opus 4.8 scored 4 out of 4 on verbosity, the maximum rating available.

Several things stack on top of each other. Anthropic set the effort default to “high,” meaning the model thinks more frequently and more deeply per response than a lower setting would require. Anthropic says high-effort mode consumes a token volume similar to Opus 4.7’s default on coding tasks but outputs more thorough results, which means longer responses on average. Push to “extra” or “max” effort and consumption climbs further. The tokenizer, inherited from Opus 4.7, generates up to 35% more tokens for the same input text than Opus 4.6 did. Teams migrating directly from 4.6 absorb that full overhead before the model writes a single output word.

Agentic coding sessions compound the problem. Each turn replays the full system prompt, file references, and any active tool definitions. Reference a handful of medium-sized source files and the session burns through tens of thousands of tokens before Opus 4.8 produces a first output line. On a single heavy coding prompt, as the Decrypt team discovered, all three factors combine at once.

Max Plan Versus Pro Plan

Pro plan users get roughly 40 Opus-class active compute hours per week before reaching the weekly cap. One serious coding session can exhaust a significant portion of that before noon. The Decrypt review describes three paths for a developer who hits the wall:

  1. Wait for the quota window to reset, measured in hours depending on when the session ran.
  2. Upgrade to Claude Max, which the reviewer puts at $100 to $200 per month depending on the tier, five to ten times the Pro subscription cost.
  3. Move to a competitor with longer quotas or lower per-token cost.

Anthropic raised Claude Code rate limits alongside the 4.8 launch. The increase widened the per-window throughput; the weekly budget cap remained unchanged.

Where the Gains Don’t Land

The creative writing prompt is consistent across all tests in this series: a time-travel story anchored to the writer’s cultural background, set in a specific historical location, built around a paradox where the past cannot be changed. Opus 4.8 produced a story set in Venezuela’s Orinoco delta around the year 1000, with a protagonist named José Lanz sent eleven centuries back to discredit a song that had shaped a future dystopian society. The prose is vivid. The paradox resolves cleanly: the song was written about the protagonist himself, so discrediting its author means discrediting himself, collapsing the mission before it starts.

Measured against Xiaomi’s MiMo v2.5 on the same prompt, the Decrypt reviewer rated Opus 4.8 as less fluid with less narrative momentum. Against Opus 4.7 on a default single-pass setting, the result reads as lateral at best. Higher-effort settings and multi-shot prompting would likely help, but on the base pass, creative prose is clearly the surface that got the least iteration this cycle.

The non-math reasoning test is a whodunit: three abductions during a winter school trip, with a timeline that requires careful tracking. The tester designed the problem so that the correct answer is Leo. Opus 4.8 returned a confident, detailed case for Eric instead, constructing a timeline-based alibi for the actual perpetrator and finding enough behavioral oddities in Leo to make him look plausibly guilty. The case for Eric is internally consistent and presented without hedging. Opus 4.7 reached Leo.

One early faulty inference compounded into a full case that held together in every internal respect while pointing at the wrong person. A reader who didn’t already know the answer would have no reason to question the conclusion.

The Needle Test and the Refusal

The long-context test plants two lines inside a large document and asks the model to locate them. The 300,000-token version didn’t load: Opus 4.8 collapsed under the context size and couldn’t process the document. Anthropic’s marketed context window is one million tokens; that capability is available through the API, not through the subscription interface the Decrypt team was using.

The 85,000-token version worked. Both planted lines, inserted into a copy of Ambrose Bierce’s 1906 “The Devil’s Dictionary,” were found. The model correctly identified them as interpolations that don’t appear in Bierce’s original text. Then it refused to report what it had found, concluding the task looked like a prompt injection attempt or an atypical test designed to manipulate it. The refusal held across repeated follow-up attempts to surface the results.

Anthropic’s safety team reports that Opus 4.8 reaches near-Mythos-class alignment quality on their evaluations, and the honesty improvements are measurable across most of the test battery. In this case, the safety reflex fired on a completed, benign retrieval task and suppressed the answer the model had already produced.

Anthropic’s Math on a Near-Trillion Dollar Debut

On June 1, five days after the model launched, Anthropic filed a confidential S-1 with the U.S. Securities and Exchange Commission (SEC), the first formal step toward a public listing. The company had just closed a $65 billion Series H that put its valuation at $965 billion, surpassing OpenAI’s for the first time. Revenue run rate hit $47 billion in May 2026, up from roughly $4 billion a year earlier. Claude Code crossed $1 billion in annualized revenue within six months of its launch.

The Pro plan version of Opus 4.8 runs out of room on serious agentic workloads. Claude Max, with more headroom, costs five to ten times more per month. Teams that need stable large-scale production use have a third option: direct API access, with the cost-management overhead that requires.

Claude Model Input (per 1M tokens) Output (per 1M tokens)
Opus 4.8 (current flagship) $5.00 $25.00
Sonnet 4.6 (mid-tier) $3.00 $15.00
Haiku 4.5 (fast, cheap) $1.00 $5.00

Opus 4.8 leads GPT-5.5 by more than 10 percentage points on SWE-Bench Pro and now runs its fast tier at one-third the price it did on Opus 4.7. On quota structure, OpenAI’s GPT-5.5 subscription allows longer effective usage per period, which for developers running large agentic sessions determines whether a project finishes in one sitting. Chinese models, which the Decrypt reviewer places at under 25% of Opus 4.8’s cost for comparable output quality, represent a third bracket for a developer who hits the Pro wall and starts comparing alternatives.

At $100 to $200 a month for the tier that handles serious coding sessions without interruption, Claude Max sits in a different category from most individual developer budgets. Anthropic’s confidential S-1 landed on June 1 carrying a $47 billion annualized revenue run rate. A year ago, the number was $4 billion.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending