[seopress_breadcrumbs]

Building AI-ready Cultures in Life Sciences R&D

•

March 16, 2026

These workflows depend on fragmented data sources spanning molecular databases, clinical trial registries, internal protocols, and published literature. This fragmentation makes it difficult to reuse prior knowledge and evaluate results consistently across programs.

Persistent challenges around reproducibility continue to slow progress in biomedical research. The National Academies of Sciences, Engineering, and Medicine has documented how limited transparency, inconsistent validation practices, and weak reporting standards undermine confidence in scientific findings and reduce the efficiency of scientific progress. Strengthening verification, data reuse, and workflow consistency remains a priority across the research lifecycle.

Clinical trial reporting patterns reinforce this challenge. A peer-reviewed observational study by researchers from Harvard, MIT, and Boston Children’s Hospital, published in Annals of Internal Medicine, found that publication rates for registered drug trials within 24 months varied widely by sponsor type, ranging from roughly one-third to just over half of completed studies. These delays reduce opportunities for organizations to learn systematically from prior trials and apply insights across development programs.

At the same time, the adoption of generative AI is accelerating rapidly across the workforce.

Findings from a nationally representative, survey-based research study conducted by economists from the Federal Reserve Bank of St. Louis, Harvard Kennedy School, and Vanderbilt University, using the Real-Time Population Survey to measure generative AI adoption across the U.S. workforce, found that as of August 2024, 39% of U.S. adults aged 18-64 had used generative AI. Among employed respondents, just over a quarter reported using generative AI at work, while about one-third reported using it outside of work.

As organizations attempt to move generative AI beyond experimentation, attention shifts from model capability to execution discipline.

Emerj Editorial Director Matthew DeMello recently hosted a conversation with Xiong Liu, Director of Data Science and AI at Novartis, to examine why generative AI remains difficult to scale inside life sciences R&D.

Across the episode, the central question was how data architecture, domain‑aware evaluation, and cross‑functional scientific alignment determine whether generative AI can deliver reliable, repeatable value across discovery and development workflows.

This article examines the operational disciplines that determine whether generative AI can scale beyond isolated pilots and deliver reliable value across the R&D lifecycle:

Foundation models as reusable scientific priors: Using domain-scale pretraining to extract value from limited indication data, then fine-tuning models to improve relevance and accuracy for specific R&D tasks.
Benchmarking to manage hallucinations and model selection: Establishing domain-aware evaluation metrics so generated outputs can be scored, compared, and validated before entering scientific workflows.
Cross-functional alignment as a scaling requirement: Aligning AI practitioners, domain scientists, and leadership around shared validation standards, data constraints, and deployment goals.

Listen to the full episode below:

Guest: Xiong Liu, Director of Data Science and AI, Novartis

Expertise: Foundation models, clinical trial natural language processing, molecular discovery, AI evaluation, and benchmarking

Brief recognition: Dr. Xiong Liu is Director of Data Science and AI at Novartis, where he leads AI initiatives across drug discovery and clinical development. Before Novartis, he spent seven years at Eli Lilly building enterprise‑level NLP and advanced analytics capabilities for R&D. Earlier in his career, he served as Principal Investigator on multiple multi‑million‑dollar, SBIR‑funded AI programs for U.S. federal agencies. He holds a Ph.D. from the University of Pittsburgh and completed postdoctoral training at the Johns Hopkins University School of Medicine.

Foundation Models As Reusable Scientific Priors

Liu frames foundation models as a shift in how life sciences teams approach data scarcity and reuse. Earlier machine learning workflows typically relied on labeled datasets built for narrowly defined tasks, often limited to a single therapeutic area. These approaches were constrained not only by computational limits but also by the pace and cost of experimental data generation.

He notes that foundation models change this dynamic by learning broad statistical structure from large collections of domain-relevant data. These sources include public molecular datasets, gene expression resources, and clinical trial documentation. The resulting models encode generalizable background information that can be reused across programs.

The approach Liu describes here does not eliminate the need for indication-specific data. Instead, it changes its role. Smaller, targeted datasets are used to fine-tune models so they align more closely with the biological questions and constraints of a specific program.

He highlights disease pathway analysis as a representative example. Traditionally, teams collected data specific to an indication, such as lung cancer, and trained models limited to that context. With a domain-trained foundation model, teams can begin from representations that already reflect gene interactions across multiple cell types.

Limited indication data can then be used to adapt the model toward the disease area of interest, allowing teams to extract useful signals even when program-specific datasets are relatively small.

The operational takeaway is a two-stage decision framework:

Adopt or develop foundation models trained on the widest defensible set of biomedical and chemical data.
Fine-tune those models using internal or indication-specific data to improve task relevance and predictive accuracy.

Liu notes for life sciences leaders that fine-tuning does not require retraining a model from scratch. Instead, model weights are adjusted using available data so outputs better reflect the organization’s scientific context. Balancing models against outputs using available data also enables iterative improvement as new data becomes available, rather than requiring perfect datasets upfront.

Benchmarking To Manage Hallucinations And Model Selection

While foundation models expand what teams can attempt, Liu cautions that hallucinations remain a persistent challenge, especially in biomedical applications where outputs are not immediately verifiable. A generated molecule, gene interaction, or pathway hypothesis cannot be validated as quickly as generated software code.

Confirmation often requires comparison with existing biological knowledge or follow-on experimentation. For this reason, Liu argues that trust must be operationalized through benchmarking rather than assumed.

He emphasizes two complementary practices:

First, organizations need knowledge-checking benchmarks grounded in established domain facts. These benchmarks provide a repeatable way to test whether a model produces outputs consistent with known biology. The goal is not to eliminate hallucinations entirely, but to understand when and how a model fails for a given task.

Second, benchmarking enables deliberate model selection. With many models and versions available, teams often default to whichever option is easiest to access. Liu recommends domain-aware evaluation metrics that allow teams to score models against specific tasks and select the most appropriate option:

“We have to define those knowledge-checking benchmarks and also have objective metrics to measure against those models. Hallucination is always there, so it is important to benchmark and select models accordingly.”

– Xiong Liu, Director of Data Science and AI at Novartis

From a governance perspective, his assertion here implies that every proposed generative AI workflow should include an evaluation plan before deployment. If outputs cannot be measured against agreed benchmarks, the workflow is not ready to scale.

Cross-functional Alignment As A Scaling Requirement

Liu’s final insight focuses on organizational alignment. Scaling generative AI in life sciences requires sustained coordination between AI practitioners, domain scientists, and leadership teams. Each group operates with different priorities and time horizons, and misalignment often prevents pilot projects from becoming durable capabilities.

AI development teams may move quickly in building models and pipelines, but model capability alone does not guarantee adoption. Domain scientists must understand how outputs are generated and validated. Liu emphasizes that leaders must understand the data, resourcing, and risk implications of deployment because without shared checkpoints and communication structures, progress stalls.

He recommends designing operating models that explicitly connect these groups. As general examples, practical steps can include shared evaluation reviews, clear communication of model limitations, and alignment on data readiness constraints.

Liu also notes that model ambition can exceed data availability. In life sciences, experimental data generation may lag behind modeling goals. In these cases, sequencing matters. Organizations must invest in data quality and generation alongside model development.

An AI-ready enterprise culture, as Liu describes it, is not defined by enthusiasm for new tools. It is defined by the ability to coordinate expertise, enforce validation discipline, and integrate AI into scientific workflows in ways that scientists trust and leaders can govern.