Connect with us

AI

AI Outages Drive Downtime Costs to a Record $600 Billion

Published

on

Companies bought artificial intelligence (AI) to make outages rare. Instead, they are managing more of them. A new Splunk study across the Global 2000, the world’s largest public companies, puts the yearly cost of unplanned downtime at a record $600 billion, up by half in two years, even as those firms spend a median of $24.5 million each on AI meant to keep systems running.

Beneath that figure sits a problem the dollar signs obscure. Only 38% of technology executives say they can consistently identify what caused an outage once one lands. As automation absorbs the routine decisions, the engineers who once diagnosed failures are losing the reflexes to do it.

A $600 Billion Bill the Automation Was Meant to Prevent

Downtime, the unplanned interruption of the software that runs sales, logistics and customer support, has always cost money. What changed is the size of the bill. Splunk, now part of Cisco, surveyed 2,000 executives with the research firm Oxford Economics for its Hidden Costs of Downtime report and landed on a figure that comes to roughly $300 million for the average company each year.

The pain is sharpest by the minute. While a system is down the meter runs fast, and a single incident now dents a company’s market value before the engineers have even found the cause. Nine in ten technology leaders say an outage spikes customer-support demand, and 47% admit customers are often the first to notice something is wrong.

  • $600 billion a year in downtime across the Global 2000, a 50% jump in two years.
  • $15,000 a minute is the average cost while systems are dark, about $900,000 an hour.
  • 3.4% is the average slide in a company’s share price after one major incident.

Direct costs are climbing on every line. Regulatory fines now average $51 million as enforcement tightens under the European Union’s General Data Protection Regulation (GDPR) and breach-disclosure rules from the U.S. Securities and Exchange Commission (SEC), while ransomware payouts have risen nearly threefold, to $40 million. The spending meant to hold all of this back keeps growing too, which is the heart of what the report calls the reliability paradox. “Downtime is inevitable; prolonged disruption is not,” says Kamal Hathi, a senior vice president and general manager at Splunk, who puts it plainly: the harder companies lean on AI to wipe out operational risk, the more of a new and less predictable kind they create.

Only 38% Can Find What Broke

Heavy spending on monitoring has not bought clarity. In the survey, only 38% of technology executives said they can consistently trace an outage back to its root cause, even with dashboards and alerting platforms layered across their systems. When the cause does surface, it is rarely quick; downtime traced to human error takes most teams the better part of a day just to detect, and days more to fully resolve. The other side of that number is the worrying one: most of the time, something breaks and no one is sure why.

Part of the reason is cultural, and it compounds quietly. As automated systems take over the routine operational calls, fewer engineers spend their days building the deep familiarity with how the pieces fit, the instinct that lets a veteran look at a symptom and name the cause. When the automation itself stumbles, that instinct is what is missing.

If engineering teams aren’t measuring reliability with the same rigor they measure velocity, governance frameworks will always lose to ship timelines.

The warning comes from Greg Leffler, Splunk’s director of developer evangelism, who argues that the engineering discipline built for software releases, the staged rollouts, the canary tests, the ready rollback, now has to cover every AI model that carries decision-making authority. Too often, he says, those models ship on the assumption that they will correct themselves, a courtesy traditional infrastructure was never given.

That dynamic is the cost the headline number hides. Every automated decision shifts a little more diagnostic muscle from people to systems. So when an AI process finally misfires, the firm has both a harder failure to untangle and fewer engineers with the instinct to untangle it, even as the systems they oversee grow more interconnected.

The Outages That Never Look Like Outages

An AI-driven outage rarely arrives as a dramatic crash. More often it looks like a slow erosion of good behavior that spreads well before anyone thinks to investigate. Leffler points to two patterns the report sees again and again. One is model drift, which he describes as “an automation pipeline making correct decisions six months ago whose training data no longer reflects current traffic. By the time anyone notices, the damage is already spreading across interconnected services.” The other is broken integrations, where a system acts on incomplete data and sets off a chain of failures across services that no single team fully owns. Both erode trust gradually, until something critical finally tips over.

Those failure modes show up at scale. Every technology leader surveyed reported at least one AI-related outage in the past year, even though 56% also credited AI with lowering their overall risk. Prompt injection and data poisoning, where attackers feed a model bad inputs or quietly corrupt its training so it acts in their favor, are the newest entry on the list, and 77% of leaders expect AI-armed criminals to drive their downtime higher still. The AI-driven outage modes Splunk catalogs sort the damage by how widely each has been felt.

AI failure mode Share of organizations hit What goes wrong
Incorrect AI automation 50% Automated actions fire on faulty logic and push errors straight into production
Model drift 50% Training data stops matching live traffic, so decisions quietly degrade over months
Bugs from embedding AI About a third New AI code dropped into live systems breaks connected services
Prompt injection and data poisoning 26% Attackers manipulate what a model reads or learns to bend its behavior

Agentic Systems Raise the Blast Radius

Technology is not standing still, and it is moving toward less human supervision. What started with chatbots and copilots is shifting to autonomous agents that act on their own, and 44% of organizations now run some form of agentic AI. The appeal is obvious: software that does not just suggest a fix but applies it.

The worry travels with it. 68% of technology leaders said they fear their agents will behave unpredictably and cause outages, and the math of autonomy is unforgiving. When an agent is wrong, it tends to be wrong at machine speed and machine scale, executing the mistake across connected systems faster than a person can step in.

That risk is already showing up. A McKinsey playbook on deploying agentic AI safely reports that 80% of organizations have already seen risky agent behavior, from improper data exposure to systems being reached without authorization. The agents do not need a malicious hand to cause harm; a misread instruction or a bad chain of context is enough.

Hathi’s prescription is to slow the handover of control. “Agentic systems need to earn their autonomy incrementally,” he says, governed by visibility and accountability at each step rather than turned loose and watched after the fact.

That is not how most deployments work today. Agents are often pushed into mission-critical systems with no clear escalation path and no monitoring tuned to catch drift, then audited only once something has already gone wrong. The budgets reflect the unease: 85% of leaders are prioritizing AI-driven security automation and 65% are buying AI-powered observability to watch their own systems more closely. The speed that makes AI attractive is the same speed that turns a small error into an incident.

The AI Layer Nobody Can See

Some of the hardest risk to measure never appears on an official systems diagram. Fully 66% of organizations said employees are using unapproved AI tools, the so-called shadow AI, to write code, draft work and automate decisions, usually with no central record of what data those tools touch. Unlike the shadow IT of a decade ago, a rogue tool here can reshape how work gets done while leaving almost no trace of how a decision was reached. That is what makes shadow AI so hard to measure: the tools that matter most are the ones nobody logged.

The visibility gap widens further out, at the third-party layer. Most enterprises now lean on outside AI providers and cloud services they cannot see into, which Leffler calls a compounding opacity problem: tiers of interconnected risk sitting beyond anything a team can directly observe. The exposure is not hypothetical. When OpenAI’s ChatGPT went dark in a global outage, it took with it a service that roughly four million developers build against through its application programming interface (API), a single point of failure sitting under countless other products.

What the report keeps circling back to is simpler: keep human judgment firmly in the loop. Organizations that minimize AI-related downtime, says Hanlin Fang, a vice president of product management at Splunk, are not the ones with the fanciest models but the ones that keep “humans in control,” with continuous monitoring and quick intervention when results start to drift. Closing the gap, he argues, takes policy, visibility and governance together, resting on a telemetry layer that shows why more data can leave a company less resilient.

What the Resilient Outfits Do Differently

Not every company is bleeding equally. The report flags a cohort it calls AI Workflow and Triage Experts, firms that pair automation with disciplined oversight, and their outcomes pull away from the pack. Last year 74% of them avoided having to publicly disclose a data breach, against 54% of everyone else, and they were nearly three times as likely to say they had never lost a customer to downtime, 42% versus 15%. They also recover faster when they are hit, the report finds, and pay tens of millions less in fines and missed-contract penalties after an incident.

What separates them is closer to operating habit than to budget. The practices the report keeps prescribing are unglamorous and familiar to anyone who has run production software:

  • Stage rollouts and test changes on a small slice of live traffic before turning them loose everywhere.
  • Keep a tested rollback ready for every model that carries decision-making authority.
  • Route high-stakes calls to a human for sign-off instead of letting an agent act unchecked.
  • Run a telemetry layer of logs, metrics and traces so AI actions can be watched as they happen.

The money is not the constraint. Companies already spend a median of $24.5 million a year on AI tools to prevent and respond to downtime, and the ones that come out ahead are not those that spent the most. “Resilience, governance, and observability are becoming the real differentiators,” Hathi says, a pointed reminder that every rival now reaches for the same models and the same cloud.

Until more than a third of these companies can say what broke, the $600 billion bill has room left to climb.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending