AI

Why Most Lab AI Pilots Fail to Reach Production

Most lab AI pilots never reach production. Two red flags that kill them, a working pipeline from Yahara Software and UW-Madison, and a five-step roadmap.

Published

2 weeks ago

June 11, 2026

Logan Pierce

Most lab AI pilots never reach production, even when the model performs well in the demo. What fails is the data architecture that feeds the model and the cross-functional scaffolding that has to run it at scale, according to Smagala’s framework for legacy lab data.

Smagala’s framework treats the algorithm as the easy part. The harder work is standardizing the legacy LIMS and spreadsheet ecosystem underneath, getting IT, DevOps, and cybersecurity aligned before a line of pilot code is written, and convincing veteran scientists to act as the final editors of whatever the model produces. Get those wrong, and the most accurate neural network in the world still ships as a slide deck. The veterans, the IT teams, and the compliance officers decide whether the model ships, and most pilot programs never loop them in early enough.

Most Lab AI Pilots Never Reach Production

The narrative around laboratory AI is one of relentless acceleration. Boardroom pressure to automate discovery has pushed AI pilots into nearly every research-intensive organization. A single champion scientist, a clean curated dataset, and an API key are usually enough to generate results that draw applause in a steering committee meeting.

What happens next is the part of the story that rarely makes it into the steering committee. The vast majority of lab AI pilots fail to scale, according to Smagala. The pattern shows up across adjacent sectors too, with the AI pilot problem in India’s GCC sector reflecting the same gap between demo and production. The data infrastructure underneath the pilot, the cross-functional teams required to run it, and the validation discipline needed to defend its outputs at production volume are three areas where pilots consistently run aground.

Two of those failure modes are predictable enough that Smagala says he can identify a doomed pilot on sight. Both live in the operating environment around the model. They are the two red flags that determine whether a pilot dies in the demo or earns production status.

Scaling lab AI pilots from spreadsheets to production systems

The Two Red Flags That Signal a Pilot Is Doomed

Smagala’s first red flag is the operational and infrastructure blind spot. A successful pilot is, by nature, small-scale. It often runs on personal access keys to an API or a hand-picked, static dataset. The pieces of infrastructure required for full-scale production, including authentication, logging, data lineage, and cross-team support, are usually absent, because the pilot was never built to test them.

Those infrastructure layers are not optional at production volume. They cannot be retrofitted cheaply once a model is in active use. The fix is to engage cross-functional stakeholders, including IT, DevOps, cybersecurity, and data engineers, before writing the first line of pilot code.

His second red flag is the happy path validation trap. Pilots establish value on one or two clean use cases, with noisy inputs, corrupted files, and edge cases never entering the picture.

Smagala warns that pilots without a rich enough program around how the data is going to interact once you scale this up are doomed. The fix is to treat the pilot like a full scientific experiment, with explicit validation protocols, quality control checkpoints, and edge-case testing plans that run during the pilot. Verification has to be a continuous process that builds on every iteration. The pilot that demonstrates value on two clean use cases has shown nothing about how it will behave in production.

The Blueprint Must Come Before the Algorithm

Before any AI tool touches legacy data, laboratories have to define what finished data looks like. The temptation is to throw an LLM or a foundation model at a folder of unstructured LIMS exports and wait for magic. Smagala is blunt about what happens next. The model will only reflect whatever structure the inputs already have, and unstructured inputs produce unstructured outputs.

The work is unglamorous. Laboratories have to define standard taxonomies, data schemas, and data guidelines before any model touches the data. Without that conceptual map, training an AI to organize the data is impossible, and the model has nothing to organize against.

The data quality bar also depends on what the AI will actually do.

Approach	Data Quality Bar	Rationale
Machine learning (predictive analytics, automated assay design)	Exceptionally high	ML algorithms are highly sensitive to bias and noise, and need representative and unbiased datasets to build secure clinical algorithms
Generative AI (LLM or RAG queries of SOPs, historical reports)	Less rigid	GenAI can parse unstructured text more dynamically, but a foundational framework is still required to extract reliable value

As soon as you’ve got some basic foundation, you absolutely can use sort of the AI cleaning crew idea as a mechanism to start to quickly iterate through your data, get it into a more cleaned up state, and archive it into a little longer term repository for how you want to organize it.

Smagala called this approach the AI cleaning crew in the same interview. The blueprint is the rulebook; the AI is the workforce that fills it in.

How Veteran Scientists Decide If AI Is Production-Ready

The hardest layer to build is trust, and the people who withhold it longest are usually the most expensive to lose. Veteran scientists are often the loudest skeptics of an AI rollout, because a neural network looks like a black box doing blind matrix multiplications with no understanding of biological or chemical context. Smagala agrees with the skeptics, at least in part.

A completely untuned model or a completely untuned AI may not really be respecting the biological nuances, and I think that a lot of the skeptical ‘I-want-it-to-work-a-certain-way’ crowd, has a point in being concerned about that.

Smagala, in the same interview, framed the issue as one of trust the AI rollout has to earn. The path forward is to give the skeptics a formal seat at the table through a human-in-the-loop approach: hardcoded biological rules and thermodynamic limits in the preprocessing pipeline, a rigorous review loop where the veteran scientist acts as the primary editor, and workflow systems where AI handles the tedious work while the expert retains final approval. The EU AI Act codifies a similar posture for high-stakes applications, with mandated human oversight, transparent data logging, and rigorous validation across regulated industries.

The same logic drives a key choice in any production pipeline: where the AI stops and the deterministic math takes over. Smagala’s team at Yahara Software used that division of labor to build a working pipeline with human oncology researchers at the University of Wisconsin-Madison. The case shows what an AI-plus-deterministic-math pipeline looks like when it works in production.

A Hybrid Pipeline That Earned Production Status

A working example comes from Yahara Software’s collaboration with human oncology researchers at the University of Wisconsin-Madison. The original task was evaluating fluorescence in situ hybridization (FISH) microscopy, work that forced graduate students to spend hours manually tracing cell boundaries and counting chromosomes, a process highly vulnerable to fatigue and subjective bias.

The team built a two-tiered hybrid pipeline. Tier 1 used MicroSAM (µSAM), a microscopy-specialized adaptation of Meta’s open-source Segment Anything Model, to identify cell boundaries, with strong general segmentation capabilities built in. Tier 2 stepped away from machine learning entirely, using a classic bright-spot detection algorithm to count distinct fluorescent signals within those boundaries. The MicroSAM preprint on microscopy segmentation describes the underlying model, and the micro-sam source code and napari plugins are open source.

The result is auditable in a way a pure ML pipeline usually is not. A developer or scientist can inspect the exact mathematical parameters of the spot-detection algorithm and reason about its output, with no model retraining required when results are challenged in a compliance review.

‘It is returning what you expect at a higher correctness rate and a higher consistency rate than most humans can do it,’ Smagala says of validated pipelines. ‘And now you are ready to use that model in a production context. Until that point, your skeptic is right to be skeptical.’

The Regulatory Floor Labs Build On Top Of

Three regulatory frameworks set the floor that any lab AI integration has to clear. FDA’s 21 CFR Part 11 governs electronic records and signatures, with the 21 CFR Part 11 guidance on electronic records requiring data trustworthiness, security, and complete auditability. ISO/IEC 17025 governs the competency and quality standards of testing and calibration laboratories worldwide. The FDA’s AI/ML Software as a Medical Device (SaMD) Action Plan emphasizes representative and unbiased datasets for clinical algorithms, with the EU AI Act adding a risk-based compliance layer for high-stakes applications that mandates human oversight and transparent data logging.

None of these frameworks is a barrier to lab AI integration. They are the reason a scaled pipeline can be defended in an audit. Skipping them is what turns a validated pilot into an unreviewable black box, which is the exact failure mode veteran scientists already distrust.

A practical way to read the regulatory floor: compliance is the price of admission to a production environment. The deliverable is a pipeline that performs on real laboratory data, at production volume, under audit.

A Five-Step Roadmap to Production

Smagala’s framework collapses into a five-step roadmap that any lab AI program can execute. Each step has a clear focus area and a concrete deliverable. The order matters, because a pilot that skips step one or step four usually shows up in his list of doomed projects within a quarter.

Step	Focus Area	Actionable Deliverable
1	Data Architecture Blueprint	Establish a standardized taxonomy for laboratory data before purchasing AI tools
2	Targeted Data Cleanup	Use basic AI algorithms to clean, categorize, and archive legacy spreadsheets based on the established blueprint
3	Human-in-the-Loop Validation	Embed veteran scientists as key reviewers, building guidelines and constraints into the AI’s workflows
4	Cross-Functional Alignment	Pre-coordinate with IT, data engineering, and compliance departments during the pilot stage
5	Edge-Case Stress Testing	Run explicit experiments designed to stress-test the AI model and ensure resilience in a high-volume production environment

The roadmap also reads as a list of failure modes. Skipping step one produces a model trained on inconsistent inputs. Skipping step three produces an AI rollout the people closest to the science will not defend in front of a regulator. Skipping step four produces a pilot the operations team cannot run at production volume. Skipping step five produces a model that passes the steering committee review and fails the first week of production use.

Frequently Asked Questions

Why do most lab AI pilots fail to reach production?

Lab AI pilots usually perform well in a controlled demo with a single champion scientist and a clean dataset. They break when the production environment exposes gaps in data infrastructure, cross-functional support, and validation discipline. The model is rarely what kills them. The two red flags that signal a doomed pilot are the operational and infrastructure blind spot and the happy path validation trap, both of which live in the operating environment, not the algorithm.

What is the AI cleaning crew approach to legacy lab data?

The AI cleaning crew is Smagala’s term for using basic AI models to iteratively clean, categorize, and archive legacy LIMS and spreadsheet data into a more durable repository. The approach only works after a lab has defined a blueprint for what finished data should look like, including standard taxonomies, data schemas, and data guidelines. Without that conceptual map, the cleaning crew has no rulebook to follow.

What is human-in-the-loop validation in a laboratory AI rollout?

Human-in-the-loop validation places the veteran scientist at the center of the AI workflow. The model handles the tedious analytical work, and the expert retains final approval. The implementation includes hardcoded biological rules and thermodynamic limits in the preprocessing pipeline, a rigorous review loop where the scientist acts as the primary editor, and workflow systems that let the AI accelerate the repetitive elements of data analysis. The EU AI Act’s risk-based compliance tiers codify a similar posture for high-stakes applications, with mandated human oversight and transparent data logging.

How does the Yahara Software FISH pipeline use MicroSAM?

The pipeline, built by Yahara Software with human oncology researchers at the University of Wisconsin-Madison, evaluates fluorescence in situ hybridization (FISH) microscopy. Tier 1 uses MicroSAM (µSAM), a microscopy-specialized adaptation of Meta’s open-source Segment Anything Model, to identify cell boundaries without requiring the lab to hand-label a large proprietary training dataset. Tier 2 steps away from machine learning entirely, using a classical bright-spot detection algorithm to count fluorescent signals within those boundaries. The result is auditable in a way a pure ML pipeline usually is not, since a developer can inspect the exact mathematical parameters of the spot-detection algorithm in a compliance review.

What is the five-step roadmap for scaling lab AI?

The roadmap is: build a data architecture blueprint with a standardized taxonomy before purchasing any AI tool, use basic AI algorithms to clean and archive legacy data against that blueprint, embed veteran scientists as key reviewers in a human-in-the-loop workflow, pre-coordinate with IT, data engineering, and compliance teams during the pilot stage, and run explicit edge-case stress tests designed to prove production readiness. Skipping any step is a known failure mode. The roadmap’s value is that each step pre-empts a failure mode Smagala has watched kill pilots in production.