AI

Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows

Microsoft ASSERT converts plain-language AI behavior rules into executable tests, trace records and scorecards for teams shipping agents in production.

Published

2 months ago

June 4, 2026

Logan Pierce

Microsoft ASSERT gives developers an open-source evaluation framework for turning plain-language artificial intelligence (AI) behavior rules into executable tests, scored results and trace records. The new tool, whose full name is Adaptive Spec-driven Scoring for Evaluation and Regression Testing, targets agents and applications that already have written policies, product requirements or launch criteria.

The launch puts evaluation closer to release engineering. Microsoft is asking teams to treat their own rules as test inputs, then rerun those tests as prompts, models, tools and retrieval sources change.

Plain English Moves Into the Test Suite

The Microsoft ASSERT launch post says the framework begins with written intent, then turns it into scenarios, datasets, metrics and scorecards. That intent can be a product requirement, a policy document, a system prompt, a launch checklist or a review note.

That is a familiar place for AI teams to get stuck. A product manager may know that a support agent can approve small refunds, escalate fraud flags and reject out-of-policy requests. A security lead may know that a document agent can summarize confidential material for executives and refuse to send it outside the company. The hard work starts when those rules need to become repeatable tests.

Microsoft names helpfulness, relevance, groundedness, toxicity and faithfulness as useful signals that still miss product-specific boundaries. ASSERT is built for plain-language specs that describe those boundaries. The framework turns them into an editable taxonomy, generates benign and adversarial cases, runs those cases against the target system and records the path the agent took.

That last part is aimed at agent failures that hide between the prompt and the final answer. Tool calls, retrieved context, routing behavior and intermediate actions can be captured, so a developer can inspect the step where the agent left the policy path.

Microsoft ASSERT AI agent behavior testing framework

What the Pipeline Produces

Microsoft describes ASSERT as a staged pipeline. The stages matter because the output is meant to be reviewed by people who own the policy, the product and the system.

Stage	Main Input	Output Developers Inspect
Systematize	A broad behavior such as tool-use governance or unsafe health guidance	A concept specification with patterns, definitions and edge cases
Taxonomize	The concept specification and policy stance	An editable taxonomy of permitted and prohibited behavior
Generate Test Set	Declared dimensions such as persona, task type, tool access or request class	Single-turn prompts and multi-turn scenarios
Run Inference	The model, agent or application workflow under test	Outputs plus traces that show tool calls and intermediate state
Score	The trace and the policy taxonomy	Labels, rationales, policy citations and failure patterns

The run directory produces local artifacts, including taxonomy.json, test_set.jsonl, inference_set.jsonl, scores.jsonl and metrics.json. JavaScript Object Notation (JSON) and JSON Lines files are plain enough to move into review, continuous integration jobs or a release archive.

The design pushes the policy statement into the scoring record. A failed case should point back to the taxonomy behavior or developer-provided decision that produced the verdict, rather than leaving a team with a red cell and no explanation.

Build Release Wraps Testing Around Controls

In the Microsoft Foundry Build announcement, ASSERT arrived beside Agent Control Specification (ACS, an open standard for runtime controls). Microsoft says ACS defines five validation checkpoints in an agent lifecycle: input, large language model (LLM), state, tool execution and output.

The pairing makes the release broader than a test generator. ASSERT finds policy failures before or after deployment. ACS gives teams a way to place controls at the workflow points where those failures happen. Microsoft also tied the release to Foundry evaluators, tracing and production monitoring, making the pitch to teams already moving agents from demos into business systems.

6 million to 13 million generative AI developers are the audience Microsoft cites for framework-agnostic agent tooling.
ASSERT is pitched for LangChain, CrewAI, LiteLLM, OpenAI and other stacks.
ACS uses policy YAML, the plain-text configuration format many engineering teams already keep in version control.

The same Build cycle also produced Microsoft’s in-house AI model push, a separate move that gives the company more control over the systems its developer tools will test and monitor.

Validation Data Shows Where Specs Do the Work

Microsoft’s validation claims are internal vendor data, but the figures are specific. In a coverage study across social scoring, sycophancy, task adherence, tool-use governance and unsafe health guidance, ASSERT was compared with an in-house baseline that started from the same written intent.

The company says ASSERT covered about 1.2 times as much of the intended behavior space, surfaced about 1.5 times as many cases worth inspection and produced more than four times stronger separation between stronger and weaker systems. It also had about half as many saturated cases, where every model behaved the same way. Microsoft treated the roughly two-times increase in distinct failure patterns as directional because failure-type labeling is harder to stabilize.

A second validation pass looked at LLM judges against human review. Across more than 10 behavior concepts, Microsoft says judge agreement with human annotators was typically in the 80% to 90% range, while human inter-annotator agreement was around 90%. Subject-matter experts also reviewed 15 generated datasets for policy alignment, behavioral relevance and quality.

Those numbers leave the usual problem in place. Judge quality depends on the model used as judge, and small policy distinctions can move when the judge model changes.

The GitHub Package Keeps the Evidence Local

The ASSERT public GitHub repository describes the project as local-first, framework-agnostic and trace-aware. The repository lists an MIT open-source license and support for Python 3.11, 3.12 and 3.13.

The package can test hosted models, callable wrappers and OpenTelemetry-traced agents. Its README says LiteLLM support reaches more than 100 model endpoints, including Azure, OpenAI, Vertex AI, Anthropic, Bedrock, Cohere and Hugging Face. It also describes OpenInference integrations for agent systems such as LangGraph, CrewAI, OpenAI Agents SDK, DSPy, LlamaIndex and AutoGen.

Microsoft says the project collects no telemetry by default. Runs write local artifacts under artifacts/results/, while optional trace capture depends on the user’s configuration and collector. That detail matters for companies testing agents against internal policies, private prompts and restricted data paths.

The same artifact habit is showing up in security and verification work outside Microsoft. Oton Technology recently covered Apple’s post-quantum code release on GitHub, where formal proof material traveled with source code so implementers could inspect the claim instead of taking it on trust.

Where the Tool Will Miss Production Failure

The method behind ASSERT comes from systematization work. The AI-assisted systematization paper by Dhruv Agarwal, Emily Sheng and co-authors argues that many generative AI evaluation targets are broad, contested concepts. The paper gives examples such as reasoning, fairness and creativity, then says underspecified concepts make measurement and interpretation unclear.

Microsoft carries that caveat into the product notes. ASSERT works best when the behavior definition is narrow and the relevant constraints are clearly specified. Vague policies produce vague scenarios. Synthetic interactions can miss failures that only appear in production. Model-based judges can be unreliable when a policy distinction is subtle or domain-specific.

Practical boundary: ASSERT can speed up evaluation when a team writes a narrow behavior, target context, tools, constraints and scoring dimensions. A broad instruction like ‘be safe’ still leaves too much room for the system to invent test categories.

The framework also stops short of compliance certification. Microsoft says specification-driven evaluation should sit with human review, telemetry and domain expertise. A regulated health, finance or workplace system still needs the review path its field already requires.

How a Team Can Try the Travel Planner Example

The ASSERT travel-planner walkthrough shows the tool in a concrete agent setting. The sample target is a multi-agent LangGraph travel planner with tools for flight search, hotel search, weather checks, travel advisories and budget validation.

Write a behavior specification describing quality failures and safety failures.
Add application context, including the target system, tools and evaluation dimensions.
Generate single-turn prompts and multi-turn scenarios from the taxonomy.
Run the cases against the model, retrieval-augmented generation application, prompt chain, multi-agent workflow or application programming interface (API).
Inspect transcripts, traces, verdicts and aggregate metrics in the viewer.

The project page shows a sample configuration that generated 12 behavior categories, created 480 test scenarios, ran 480 scenarios against a travel planner and scored policy_violation plus overrefusal. Those names are deliberately operational. One score catches policy breaks. The other catches refusals of legitimate requests.

For release teams, the usable output is the record that can be rerun after a model swap, prompt edit, tool change or retrieval update. The tool is available under the MIT license, with the public GitHub repository listing ASSERT v0.1.0 as its initial release.

Oton Technology

Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows

AI

Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows

Plain English Moves Into the Test Suite

What the Pipeline Produces

Build Release Wraps Testing Around Controls

Validation Data Shows Where Specs Do the Work

The GitHub Package Keeps the Evidence Local

Where the Tool Will Miss Production Failure

How a Team Can Try the Travel Planner Example

Leave a Reply
Cancel reply

Leave a Reply

SEARCH

Qualcomm Stretches Snapdragon 8 Elite Gen 5 as Chip Costs Climb

Thailand and the Philippines Force Crypto Exchanges to Hunt Scammers

WhatsApp Adds Adobe-Powered PDF Tools and Hands-Free Car Mode

Samsung’s Smart Glasses Replay Android’s Old Playbook Against Meta

Xbox Backward Compatibility Hits PC After Microsoft Called It Done

A New Nautilus Extension Restores GNOME’s Lost Computer View

Bezos’s Prime Video AI Overhaul Puts Studio Deals at Risk

Samsung’s PCIe 6.0 AI SSD Arrives Five Months Behind Micron

Coinbase Grows Singapore Staff to 200 After Global Cuts

Galaxy Z Fold 8 Series Skips S Pen Support Again

Fable 5 and Mythos 5 Return as US Lifts Anthropic Export Controls

SpaceX’s Google Deal Turns a Rocket Company Into a Cloud Landlord

CD Projekt Red Co-CEO: Redemption Arc Isn’t Done, Witcher 4 in 2027

DGO App Brings Rs 549 Mobile Pass for FIFA World Cup 2026 in Nepal

XPL Rallies 30% Ahead of Plasma One Card Tier Launch

Oracle Cuts 21,000 Jobs in a Year, Cites AI in 10-K Filing

Google Search Profiles Build a Follow Graph Inside Discover

Moonshot AI Targets $30 Billion in China’s Fastest AI Funding Sprint

Microsoft Xbox Layoffs Start in July as Sharma Slams 3% Margin

WhatsApp Meta Business Agent Reaches India, With a New Pricing Meter

Trending

Oton Technology

Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows

Plain English Moves Into the Test Suite

What the Pipeline Produces

Build Release Wraps Testing Around Controls

Validation Data Shows Where Specs Do the Work

The GitHub Package Keeps the Evidence Local

Where the Tool Will Miss Production Failure

How a Team Can Try the Travel Planner Example

You may like

Leave a Reply Cancel reply

Leave a Reply

SEARCH

Qualcomm Stretches Snapdragon 8 Elite Gen 5 as Chip Costs Climb

Thailand and the Philippines Force Crypto Exchanges to Hunt Scammers

WhatsApp Adds Adobe-Powered PDF Tools and Hands-Free Car Mode

Samsung’s Smart Glasses Replay Android’s Old Playbook Against Meta

Xbox Backward Compatibility Hits PC After Microsoft Called It Done

A New Nautilus Extension Restores GNOME’s Lost Computer View

Bezos’s Prime Video AI Overhaul Puts Studio Deals at Risk

Samsung’s PCIe 6.0 AI SSD Arrives Five Months Behind Micron

Coinbase Grows Singapore Staff to 200 After Global Cuts

Galaxy Z Fold 8 Series Skips S Pen Support Again

Fable 5 and Mythos 5 Return as US Lifts Anthropic Export Controls

SpaceX’s Google Deal Turns a Rocket Company Into a Cloud Landlord

CD Projekt Red Co-CEO: Redemption Arc Isn’t Done, Witcher 4 in 2027

DGO App Brings Rs 549 Mobile Pass for FIFA World Cup 2026 in Nepal

XPL Rallies 30% Ahead of Plasma One Card Tier Launch

Oracle Cuts 21,000 Jobs in a Year, Cites AI in 10-K Filing

Google Search Profiles Build a Follow Graph Inside Discover

Moonshot AI Targets $30 Billion in China’s Fastest AI Funding Sprint

Microsoft Xbox Layoffs Start in July as Sharma Slams 3% Margin

WhatsApp Meta Business Agent Reaches India, With a New Pricing Meter

Trending

Leave a Reply
Cancel reply