AI
Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows
Microsoft ASSERT converts plain-language AI behavior rules into executable tests, trace records and scorecards for teams shipping agents in production.
Microsoft ASSERT gives developers an open-source evaluation framework for turning plain-language artificial intelligence (AI) behavior rules into executable tests, scored results and trace records. The new tool, whose full name is Adaptive Spec-driven Scoring for Evaluation and Regression Testing, targets agents and applications that already have written policies, product requirements or launch criteria.
The launch puts evaluation closer to release engineering. Microsoft is asking teams to treat their own rules as test inputs, then rerun those tests as prompts, models, tools and retrieval sources change.
Plain English Moves Into the Test Suite
The Microsoft ASSERT launch post says the framework begins with written intent, then turns it into scenarios, datasets, metrics and scorecards. That intent can be a product requirement, a policy document, a system prompt, a launch checklist or a review note.
That is a familiar place for AI teams to get stuck. A product manager may know that a support agent can approve small refunds, escalate fraud flags and reject out-of-policy requests. A security lead may know that a document agent can summarize confidential material for executives and refuse to send it outside the company. The hard work starts when those rules need to become repeatable tests.
Microsoft names helpfulness, relevance, groundedness, toxicity and faithfulness as useful signals that still miss product-specific boundaries. ASSERT is built for plain-language specs that describe those boundaries. The framework turns them into an editable taxonomy, generates benign and adversarial cases, runs those cases against the target system and records the path the agent took.
That last part is aimed at agent failures that hide between the prompt and the final answer. Tool calls, retrieved context, routing behavior and intermediate actions can be captured, so a developer can inspect the step where the agent left the policy path.
What the Pipeline Produces
Microsoft describes ASSERT as a staged pipeline. The stages matter because the output is meant to be reviewed by people who own the policy, the product and the system.
| Stage | Main Input | Output Developers Inspect |
|---|---|---|
| Systematize | A broad behavior such as tool-use governance or unsafe health guidance | A concept specification with patterns, definitions and edge cases |
| Taxonomize | The concept specification and policy stance | An editable taxonomy of permitted and prohibited behavior |
| Generate Test Set | Declared dimensions such as persona, task type, tool access or request class | Single-turn prompts and multi-turn scenarios |
| Run Inference | The model, agent or application workflow under test | Outputs plus traces that show tool calls and intermediate state |
| Score | The trace and the policy taxonomy | Labels, rationales, policy citations and failure patterns |
The run directory produces local artifacts, including taxonomy.json, test_set.jsonl, inference_set.jsonl, scores.jsonl and metrics.json. JavaScript Object Notation (JSON) and JSON Lines files are plain enough to move into review, continuous integration jobs or a release archive.
The design pushes the policy statement into the scoring record. A failed case should point back to the taxonomy behavior or developer-provided decision that produced the verdict, rather than leaving a team with a red cell and no explanation.
Build Release Wraps Testing Around Controls
In the Microsoft Foundry Build announcement, ASSERT arrived beside Agent Control Specification (ACS, an open standard for runtime controls). Microsoft says ACS defines five validation checkpoints in an agent lifecycle: input, large language model (LLM), state, tool execution and output.
The pairing makes the release broader than a test generator. ASSERT finds policy failures before or after deployment. ACS gives teams a way to place controls at the workflow points where those failures happen. Microsoft also tied the release to Foundry evaluators, tracing and production monitoring, making the pitch to teams already moving agents from demos into business systems.
- 6 million to 13 million generative AI developers are the audience Microsoft cites for framework-agnostic agent tooling.
- ASSERT is pitched for LangChain, CrewAI, LiteLLM, OpenAI and other stacks.
- ACS uses policy YAML, the plain-text configuration format many engineering teams already keep in version control.
The same Build cycle also produced Microsoft’s in-house AI model push, a separate move that gives the company more control over the systems its developer tools will test and monitor.
Validation Data Shows Where Specs Do the Work
Microsoft’s validation claims are internal vendor data, but the figures are specific. In a coverage study across social scoring, sycophancy, task adherence, tool-use governance and unsafe health guidance, ASSERT was compared with an in-house baseline that started from the same written intent.
The company says ASSERT covered about 1.2 times as much of the intended behavior space, surfaced about 1.5 times as many cases worth inspection and produced more than four times stronger separation between stronger and weaker systems. It also had about half as many saturated cases, where every model behaved the same way. Microsoft treated the roughly two-times increase in distinct failure patterns as directional because failure-type labeling is harder to stabilize.
A second validation pass looked at LLM judges against human review. Across more than 10 behavior concepts, Microsoft says judge agreement with human annotators was typically in the 80% to 90% range, while human inter-annotator agreement was around 90%. Subject-matter experts also reviewed 15 generated datasets for policy alignment, behavioral relevance and quality.
Those numbers leave the usual problem in place. Judge quality depends on the model used as judge, and small policy distinctions can move when the judge model changes.
The GitHub Package Keeps the Evidence Local
The ASSERT public GitHub repository describes the project as local-first, framework-agnostic and trace-aware. The repository lists an MIT open-source license and support for Python 3.11, 3.12 and 3.13.
The package can test hosted models, callable wrappers and OpenTelemetry-traced agents. Its README says LiteLLM support reaches more than 100 model endpoints, including Azure, OpenAI, Vertex AI, Anthropic, Bedrock, Cohere and Hugging Face. It also describes OpenInference integrations for agent systems such as LangGraph, CrewAI, OpenAI Agents SDK, DSPy, LlamaIndex and AutoGen.
Microsoft says the project collects no telemetry by default. Runs write local artifacts under artifacts/results/, while optional trace capture depends on the user’s configuration and collector. That detail matters for companies testing agents against internal policies, private prompts and restricted data paths.
The same artifact habit is showing up in security and verification work outside Microsoft. Oton Technology recently covered Apple’s post-quantum code release on GitHub, where formal proof material traveled with source code so implementers could inspect the claim instead of taking it on trust.
Where the Tool Will Miss Production Failure
The method behind ASSERT comes from systematization work. The AI-assisted systematization paper by Dhruv Agarwal, Emily Sheng and co-authors argues that many generative AI evaluation targets are broad, contested concepts. The paper gives examples such as reasoning, fairness and creativity, then says underspecified concepts make measurement and interpretation unclear.
Microsoft carries that caveat into the product notes. ASSERT works best when the behavior definition is narrow and the relevant constraints are clearly specified. Vague policies produce vague scenarios. Synthetic interactions can miss failures that only appear in production. Model-based judges can be unreliable when a policy distinction is subtle or domain-specific.
Practical boundary: ASSERT can speed up evaluation when a team writes a narrow behavior, target context, tools, constraints and scoring dimensions. A broad instruction like ‘be safe’ still leaves too much room for the system to invent test categories.
The framework also stops short of compliance certification. Microsoft says specification-driven evaluation should sit with human review, telemetry and domain expertise. A regulated health, finance or workplace system still needs the review path its field already requires.
How a Team Can Try the Travel Planner Example
The ASSERT travel-planner walkthrough shows the tool in a concrete agent setting. The sample target is a multi-agent LangGraph travel planner with tools for flight search, hotel search, weather checks, travel advisories and budget validation.
- Write a behavior specification describing quality failures and safety failures.
- Add application context, including the target system, tools and evaluation dimensions.
- Generate single-turn prompts and multi-turn scenarios from the taxonomy.
- Run the cases against the model, retrieval-augmented generation application, prompt chain, multi-agent workflow or application programming interface (API).
- Inspect transcripts, traces, verdicts and aggregate metrics in the viewer.
The project page shows a sample configuration that generated 12 behavior categories, created 480 test scenarios, ran 480 scenarios against a travel planner and scored policy_violation plus overrefusal. Those names are deliberately operational. One score catches policy breaks. The other catches refusals of legitimate requests.
For release teams, the usable output is the record that can be rerun after a model swap, prompt edit, tool change or retrieval update. The tool is available under the MIT license, with the public GitHub repository listing ASSERT v0.1.0 as its initial release.
-
CRYPTO4 weeks agoAndreessen Horowitz Bets $2.2B on Crypto’s Quiet Cycle
-
CRYPTO4 weeks agoCathie Wood Calls SpaceX IPO Demand ‘Voracious’ Ahead Of $1.75T Debut
-
NEWS4 weeks agoGhana CSA Plants Office In Ho As Volta Cybercrime Climbs
-
NEWS4 weeks agoHormuud Bets $19 Down Will Finally Pull Somalia Online
-
APPS1 month agoGoogle’s Buried Page Reveals 500 Niche Websites Still Making Cash
-
NEWS4 weeks agoApple Strikes Preliminary Deal For Intel To Make iPhone And Mac Chips
-
NEWS4 weeks agoMetalenz Polar ID Hides Face Unlock Under OLED Smartphone Screens
-
AI4 weeks agoGoogle AI Overviews Adds Subscribed Label, Reddit Quotes Inline
