AI

AI Agents Won’t Crash the Internet. They’ll Drift Until It Breaks.

AI agents are moving into production, but their failure mode is invisible to traditional monitoring. PagerDuty’s Jenn Tejada explains why.

Published

4 weeks ago

July 4, 2026

Logan Pierce

PagerDuty Executive Chair Jenn Tejada has watched three technology cycles from inside the operations business. She took PagerDuty public in 2019, ran the company through the cloud era, and on July 2 told Forbes contributor Martine Paris that AI is now moving from the experimentation phase to production. Her warning sits on a specific technical concern: AI agents introduce a failure mode the industry’s existing monitoring stack was never built to see.

AI agents do not crash servers. They drift. A drifted model keeps answering, keeps transacting, keeps shipping output, but the output quietly degrades across thousands of small calls until something downstream fails and no one can point to a root cause. Tejada told Forbes the failure pattern looks nothing like a service outage, and the company’s existing observability tools are not built to catch it. That gap, not the AI itself, is what SRE and ML platform teams will spend the next year trying to close, according to her July 2 interview on agent drift risk.

Why Drift Doesn’t Look Like a Crash

Tejada laid out the distinction in plain terms. “Software fails in a way you can see,” she told Forbes. A crashed service produces a hard signal: a 503, an error rate spike, a missing heartbeat.

When AI drifts, it’s actually harder to see, and you don’t see until it’s executed that drift in a number of ways, and now it’s evolved into multiple failures.

Drift produces no such signal. It is the same model returning the same status codes and shipping the same volume of traffic. The degradation shows up in the distribution of outputs, in confidence scores, in subtle pattern shifts that no service-level indicator is configured to surface. A drifted agent sits in the gray zone between working and broken, and the page never fires because, by every metric the operations team is watching, nothing has failed. Model drift of this kind is the failure mode Tejada argues the current stack cannot see.

The compounding pattern is what makes drift dangerous. When one agent’s slightly-wrong answer becomes another agent’s input, becomes a third agent’s assumption, becomes a multi-system failure with no clean starting point, no alert will trigger until something downstream is already broken. Catching it requires extending traditional incident-response tooling with agent-specific telemetry: decisions, tool calls, confidence signals, running alongside conventional infrastructure data. Tejada has argued for an independent layer that monitors how agents behave and gives humans a way to interrupt them when outputs look wrong, a theme she returned to in her 2026 impact letter on AI-first operations.

The tools most teams run today do not produce those signals. PagerDuty’s own impact letter from this year frames the company as “the central nervous system for modern, AI-first operations.” The shift Tejada is naming in the Forbes interview is what that positioning now has to absorb: failures that do not announce themselves as failures, originating inside the AI layer rather than the infrastructure layer below it.

AI agent drift invisible production failure mode

The October 2025 Outage as Preview

The pattern Tejada described has a recent template.

Outage duration: over 15 hours
Companies affected: more than 2,000
User reports: 8.1 million globally, including 1.9 million in the US, 1 million in the UK, and 418,000 in Australia
Root cause: latent race condition in DynamoDB’s automated DNS management system
Recovery: full restoration took until October 21, more than 24 hours after the initial DNS race condition

On October 20, 2025, a DNS race condition inside AWS’s DynamoDB automation in the US-EAST-1 region triggered a cascade across interconnected cloud services, according to the post-incident analysis of the AWS DNS race condition. The defect sat in a critical subsystem, between two DNS enactor processes racing to apply overlapping plans, and the cleanup logic deleted the active plan. With no DNS record to find, dependent services literally could not locate DynamoDB.

More than 2,000 companies were affected, with 8.1 million Downdetector reports filed globally, per the Guardian’s coverage of the outage. UK banks including Lloyds, Halifax, and Bank of Scotland went dark alongside consumer apps like Snapchat, Signal, Duolingo, Slack, and Ring. Amazon’s own retail site and its smart-home business were hit. Recovery extended past midnight UTC on October 21, more than a day after the initial DNS race condition, as backlogged workflows drained across EC2, Redshift, and Amazon Connect.

AWS itself was, in every important sense, a textbook infrastructure failure: a software defect inside a critical subsystem, exposed by a load-balancer monitoring flaw, cascading through dependent services. Tejada’s worry is that the next cascade will not look like that. A drifting agent producing slightly skewed outputs would not have triggered AWS’s health dashboard on October 20. It would have looked like normal traffic for hours while each call quietly bent the next. By the time anything downstream failed, the root cause would sit several agents upstream, with no single telemetry signal pointing back to the original drift.

Smaller Teams, More Complex Systems

Tejada also pointed to a structural tension inside the AI buildout. Smaller “one and two pizza” engineering teams, an Amazon term for groups small enough to be fed with one or two pizzas, will become more common as AI lets a small headcount ship more. At the same time, the systems those teams ship are getting denser and more interconnected. Meta’s data center buildout, which Tejada cited in the interview, sits alongside the company’s America’s Workforce Academy as one example of the capital intensity behind the transition. The new CEO John DiLullo is being introduced to Nvidia’s engineering team as part of the same push, and an intern class at Nvidia reminded Tejada on a recent visit that the workforce is already turning over inside the buildout.

A two-pizza team can deploy an agent that touches half a dozen external services in an afternoon, and the blast radius of a single drifted decision now extends across more systems than any one human on the team has full mental model of. The cloud made infrastructure reliable enough that small teams could ship without owning the data center. AI is doing the same thing for decisions, while the monitoring layer that would catch a drifting agent has not yet been built to match. The compounding pattern in production now resembles the silent-failure mode Tejada is putting on the industry’s radar.

Inside the $725 Billion Buildout

The financial backdrop is the hyperscaler capital expenditure cycle. Tejada cited a BNP Paribas estimate putting 2026 hyperscaler AI capital spending at $725 billion, roughly double the prior year’s figure. Forbes contributor Jon Markman traced the same aggregate in April, drawing on the four hyperscalers’ own 2026 capex guidance.

That money breaks down as roughly $190 billion at Microsoft, $200 billion at Amazon, $185 billion at Alphabet, and $115 to $135 billion at Meta, per Markman. It is going into data centers, networking silicon, GPUs, and the energy contracts to power them. The capital flowing in is for the AI itself; the corresponding spend on the layer that watches the AI has not yet matched. Each new dollar of capex expands the surface area on which an agent can drift.

More compute means more concurrent agent decisions, and more interconnected services means more places for a small behavioral shift to compound. The economics of the buildout assume the underlying software works. The open operational question, on Tejada’s reading, is whether monitoring can catch a model whose output is technically correct and behaviorally wrong, and whether the SRE stack can catch it before it cascades.

The Observability Stack Reaches for Models

PagerDuty, the company Tejada still chairs, made its own bet on the problem in October 2025 when it launched what it called the industry’s first end-to-end AI agent suite. The release, announced from San Francisco, included four agents designed to extend its digital-operations platform into the new failure shape Tejada described. The suite shipped alongside more than 150 platform enhancements. Together they pushed PagerDuty’s product surface from infrastructure monitoring into agent and model monitoring. The four agents each target a different slice of the incident lifecycle.

PagerDuty SRE Agent: learns from related incidents, surfaces context, recommends and executes diagnostics and remediations, and generates self-updating runbooks
PagerDuty Scribe Agent: transcribes Zoom calls and chat during incidents into structured summaries and status updates in Slack or Microsoft Teams
PagerDuty Shift Agent: detects and resolves on-call scheduling conflicts automatically
PagerDuty Insights Agent: delivers context-aware answers and proactive recommendations from PagerDuty analytics

PagerDuty said early customer adopters were able to resolve incidents up to 50% faster with the suite in use, per the company’s announcement of the four-agent AI suite launched in October 2025. The release also included general availability of a remote Model Context Protocol server, the open standard Anthropic introduced, and the company said more than 250 customers adopted the MCP integration inside two months. RedMonk analyst James Governor called the release a platform-reliability story focused on developer experience and open standards. The direction is the same one Tejada articulated in the Forbes interview: extending monitoring from the infrastructure layer up into the model and agent layer, where the new failure modes live.

How the SRE Job Changes Next

The concrete requirement Tejada laid out, in her own framing, is that teams running agentic systems need an independent layer that monitors agent behavior continuously and gives humans a way to interrupt or pause work when outputs look wrong. That work falls to SRE and ML platform teams. It looks like extending existing incident-response tooling with agent-specific signals plus a human-in-the-loop override path.

The adjacent work is already visible. Kyndryl’s 2026 People Readiness Report, drawing on 1,100 senior leaders, found only 23% said their workforce was fully ready for AI, down six points from a year earlier. The KPMG 2026 Global Tech Report, based on a survey of 2,500 tech executives across 27 countries, puts managing AI agents as a 2031 core skill for 92% of tech executives. AvePoint’s 2026 State of AI report, conducted with Osterman Research across 750 enterprise leaders, found 88.4% of organisations reported at least one AI agent-related security incident in the past 12 months. The gap between AI agent deployment and workforce readiness runs in parallel with the drift detection gap Tejada is naming.

The drift problem Tejada is putting forward will land, first and worst, on the SRE and platform teams tasked with detecting it before the next 15-hour cascade. The work is unglamorous: extend existing pipelines with agent-level signals, build the override path, and accept that the operational discipline for AI looks more like SRE than like ML model evaluation. The teams that build that capability will own the next layer of the AI stack, and the public postmortems of agent-driven incidents that begin appearing over the next 12 months will start to show who has.