Connect with us

AI

Microsoft ASSERT Moves AI Behavior Tests Into Release Workflows

Microsoft ASSERT converts plain-language AI behavior rules into executable tests, trace records and scorecards for teams shipping agents in production.

Published

on

Microsoft ASSERT gives developers an open-source evaluation framework for turning plain-language artificial intelligence (AI) behavior rules into executable tests, scored results and trace records. The new tool, whose full name is Adaptive Spec-driven Scoring for Evaluation and Regression Testing, targets agents and applications that already have written policies, product requirements or launch criteria.

The launch puts evaluation closer to release engineering. Microsoft is asking teams to treat their own rules as test inputs, then rerun those tests as prompts, models, tools and retrieval sources change.

Plain English Moves Into the Test Suite

The Microsoft ASSERT launch post says the framework begins with written intent, then turns it into scenarios, datasets, metrics and scorecards. That intent can be a product requirement, a policy document, a system prompt, a launch checklist or a review note.

That is a familiar place for AI teams to get stuck. A product manager may know that a support agent can approve small refunds, escalate fraud flags and reject out-of-policy requests. A security lead may know that a document agent can summarize confidential material for executives and refuse to send it outside the company. The hard work starts when those rules need to become repeatable tests.

Microsoft names helpfulness, relevance, groundedness, toxicity and faithfulness as useful signals that still miss product-specific boundaries. ASSERT is built for plain-language specs that describe those boundaries. The framework turns them into an editable taxonomy, generates benign and adversarial cases, runs those cases against the target system and records the path the agent took.

That last part is aimed at agent failures that hide between the prompt and the final answer. Tool calls, retrieved context, routing behavior and intermediate actions can be captured, so a developer can inspect the step where the agent left the policy path.

What the Pipeline Produces

Microsoft describes ASSERT as a staged pipeline. The stages matter because the output is meant to be reviewed by people who own the policy, the product and the system.

Stage Main Input Output Developers Inspect
Systematize A broad behavior such as tool-use governance or unsafe health guidance A concept specification with patterns, definitions and edge cases
Taxonomize The concept specification and policy stance An editable taxonomy of permitted and prohibited behavior
Generate Test Set Declared dimensions such as persona, task type, tool access or request class Single-turn prompts and multi-turn scenarios
Run Inference The model, agent or application workflow under test Outputs plus traces that show tool calls and intermediate state
Score The trace and the policy taxonomy Labels, rationales, policy citations and failure patterns

The run directory produces local artifacts, including taxonomy.json, test_set.jsonl, inference_set.jsonl, scores.jsonl and metrics.json. JavaScript Object Notation (JSON) and JSON Lines files are plain enough to move into review, continuous integration jobs or a release archive.

The design pushes the policy statement into the scoring record. A failed case should point back to the taxonomy behavior or developer-provided decision that produced the verdict, rather than leaving a team with a red cell and no explanation.

Build Release Wraps Testing Around Controls

In the Microsoft Foundry Build announcement, ASSERT arrived beside Agent Control Specification (ACS, an open standard for runtime controls). Microsoft says ACS defines five validation checkpoints in an agent lifecycle: input, large language model (LLM), state, tool execution and output.

The pairing makes the release broader than a test generator. ASSERT finds policy failures before or after deployment. ACS gives teams a way to place controls at the workflow points where those failures happen. Microsoft also tied the release to Foundry evaluators, tracing and production monitoring, making the pitch to teams already moving agents from demos into business systems.

  • 6 million to 13 million generative AI developers are the audience Microsoft cites for framework-agnostic agent tooling.
  • ASSERT is pitched for LangChain, CrewAI, LiteLLM, OpenAI and other stacks.
  • ACS uses policy YAML, the plain-text configuration format many engineering teams already keep in version control.

The same Build cycle also produced Microsoft’s in-house AI model push, a separate move that gives the company more control over the systems its developer tools will test and monitor.

Validation Data Shows Where Specs Do the Work

Microsoft’s validation claims are internal vendor data, but the figures are specific. In a coverage study across social scoring, sycophancy, task adherence, tool-use governance and unsafe health guidance, ASSERT was compared with an in-house baseline that started from the same written intent.

The company says ASSERT covered about 1.2 times as much of the intended behavior space, surfaced about 1.5 times as many cases worth inspection and produced more than four times stronger separation between stronger and weaker systems. It also had about half as many saturated cases, where every model behaved the same way. Microsoft treated the roughly two-times increase in distinct failure patterns as directional because failure-type labeling is harder to stabilize.

A second validation pass looked at LLM judges against human review. Across more than 10 behavior concepts, Microsoft says judge agreement with human annotators was typically in the 80% to 90% range, while human inter-annotator agreement was around 90%. Subject-matter experts also reviewed 15 generated datasets for policy alignment, behavioral relevance and quality.

Those numbers leave the usual problem in place. Judge quality depends on the model used as judge, and small policy distinctions can move when the judge model changes.

The GitHub Package Keeps the Evidence Local

The ASSERT public GitHub repository describes the project as local-first, framework-agnostic and trace-aware. The repository lists an MIT open-source license and support for Python 3.11, 3.12 and 3.13.

The package can test hosted models, callable wrappers and OpenTelemetry-traced agents. Its README says LiteLLM support reaches more than 100 model endpoints, including Azure, OpenAI, Vertex AI, Anthropic, Bedrock, Cohere and Hugging Face. It also describes OpenInference integrations for agent systems such as LangGraph, CrewAI, OpenAI Agents SDK, DSPy, LlamaIndex and AutoGen.

Microsoft says the project collects no telemetry by default. Runs write local artifacts under artifacts/results/, while optional trace capture depends on the user’s configuration and collector. That detail matters for companies testing agents against internal policies, private prompts and restricted data paths.

The same artifact habit is showing up in security and verification work outside Microsoft. Oton Technology recently covered Apple’s post-quantum code release on GitHub, where formal proof material traveled with source code so implementers could inspect the claim instead of taking it on trust.

Where the Tool Will Miss Production Failure

The method behind ASSERT comes from systematization work. The AI-assisted systematization paper by Dhruv Agarwal, Emily Sheng and co-authors argues that many generative AI evaluation targets are broad, contested concepts. The paper gives examples such as reasoning, fairness and creativity, then says underspecified concepts make measurement and interpretation unclear.

Microsoft carries that caveat into the product notes. ASSERT works best when the behavior definition is narrow and the relevant constraints are clearly specified. Vague policies produce vague scenarios. Synthetic interactions can miss failures that only appear in production. Model-based judges can be unreliable when a policy distinction is subtle or domain-specific.

Practical boundary: ASSERT can speed up evaluation when a team writes a narrow behavior, target context, tools, constraints and scoring dimensions. A broad instruction like ‘be safe’ still leaves too much room for the system to invent test categories.

The framework also stops short of compliance certification. Microsoft says specification-driven evaluation should sit with human review, telemetry and domain expertise. A regulated health, finance or workplace system still needs the review path its field already requires.

How a Team Can Try the Travel Planner Example

The ASSERT travel-planner walkthrough shows the tool in a concrete agent setting. The sample target is a multi-agent LangGraph travel planner with tools for flight search, hotel search, weather checks, travel advisories and budget validation.

  1. Write a behavior specification describing quality failures and safety failures.
  2. Add application context, including the target system, tools and evaluation dimensions.
  3. Generate single-turn prompts and multi-turn scenarios from the taxonomy.
  4. Run the cases against the model, retrieval-augmented generation application, prompt chain, multi-agent workflow or application programming interface (API).
  5. Inspect transcripts, traces, verdicts and aggregate metrics in the viewer.

The project page shows a sample configuration that generated 12 behavior categories, created 480 test scenarios, ran 480 scenarios against a travel planner and scored policy_violation plus overrefusal. Those names are deliberately operational. One score catches policy breaks. The other catches refusals of legitimate requests.

For release teams, the usable output is the record that can be rerun after a model swap, prompt edit, tool change or retrieval update. The tool is available under the MIT license, with the public GitHub repository listing ASSERT v0.1.0 as its initial release.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending