Artificial Intelligence at AstraZeneca

Matthew DeMello

Matthew is Senior Editor at Emerj, focused on enterprise AI use-cases and trends. He previously served as podcast producer with CrossBorder Solutions, a venture-back AI-enabled tax solutions firm. Prior, Matthew served three years at the World Policy Institute as a news editor and podcast producer.

Artificial Intelligence at AstraZeneca-1-min

AstraZeneca is a global biopharmaceutical company that researches, develops, manufactures, and markets prescription drugs and vaccines. Its key therapeutic areas include oncology, cardiovascular, renal, metabolism, respiratory, and immunology. In 2022, the company reported revenue of $42.67 billion and a profit of $4.08 billion. The company has a significant global presence, employing around 89,900 people across more than 60 countries as of 2023.

The pharma company has invested more than USD 250 million in AI research and developing an antibody for cancer. The company also claims to have data and AI  embedded across its research and development. 

In this article, we will examine two use cases showing how AI initiatives currently support AstraZeneca’s business goals:

  • Overcoming data integration challenges: Leveraging NLP to process and analyze a vast library of scientific literature and data sources, thus facilitating the integration of disjointed data and helping build scalable and performant data pipelines.
  • Streamlining machine learning model deployment: Using fully managed machine learning services to build, train, and deploy machine learning models efficiently. 

Overcoming Data Integration Challenges with AI

On average, it takes 10 to 15 years to research the drug and complete all 3 phases of clinical trials. Even after that, a whopping 90 percent of drugs fail to meet their intended goals. With this significant investment and much less success rate, scientists and researchers need to assess the data sets and innovate in a fast-track manner.

At AstraZeneca, the team of researchers felt that they were not able to make decisions even with all the information available at their fingertips. They faced several challenges:

  • Disjointed Data Sources: Scientists needed help with data scattered across various internal and external sources, making it challenging to access and integrate information necessary for drug discovery and clinical trials.
  • Infrastructure Complexity: The need for a flexible yet low-maintenance infrastructure was critical. Existing systems required constant upkeep, which diverted resources from the core scientific tasks.
  • Scaling Data Science Efforts: Existing tools and workflows, particularly open-source Python notebooks, couldn’t scale effectively to support the extensive data science efforts required.

Additionally, the necessity to ingest, parse, and analyze millions of data points from hundreds of sources, including scientific literature and public databases, was a significant technical challenge. Moreover, ensuring that the data pipelines and machine learning models could handle the vast and growing volume of data while maintaining performance was critical for the company.

As a solution to these challenges, AstraZeneca adopted Databricks to leverage its fully managed platform with the aim of simplifying cluster management and maintaining the analytic resources at scale.

According to use case documentation from Databricks, AstraZeneca used the Databricks platform to build scalable and performant data pipelines, utilizing NLP to process and analyze a vast library of scientific literature and data sources.

Screenshot from Databricks Video (Source: Databricks)

Below is a five-minute video for the demo of the Databricks Data Intelligence Platform:

This platform also empowered data scientists to:

  • Build and train models that provide ranking predictions, enhance decision-making capabilities, and
  • Construct a knowledge graph that powers a recommendation system, allowing scientists to generate novel target hypotheses for various diseases using all available data.

The case study claims that after the adoption of the data bricks platform, AstraZeneca was able to:

  • Improve operational efficiency
  • Increase data science team productivity
  • Faster time to insight

While the company did not share the quantifiable results of the data bricks platform at AstraZeneca, it reported the below numbers for some of its other customers:

  • 14 databases replaced by one delta lake
  • Six seconds to perform a complex analytics task, which previously consumed 6 hours
  • 20 % faster performance after unifying data through Databricks Data Intelligence Platform

Streamlining Machine Learning Model Deployment

In managing the vast amount of data, companies often need to realize the missed opportunity of actually gaining meaningful insights from these vast data sources. AstraZeneca was going through something similar. Additionally, the machine learning development process was heavily manual, which demanded more effort from the data scientists. Moreover, The prior system was not extensible, flexible, or scalable enough to meet the needs of AstraZeneca’s commercial data analysis.

In addition to this, the company also observed the following challenges:

  • Inefficient Development Process: The company needed an efficient process for creating and deploying machine learning models into production, slowing down data analysis and insight generation.
  • Slow Insight Generation: The previous machine-learning solution needed to be faster. It required over a month to set up an environment for data scientists and thus delayed insight delivery.
  • Lack of Cohesion Among ML Tools: The existing technology stack had no cohesive way to integrate various ML tools, making it challenging to create a seamless environment for data scientists.

AstraZeneca needed an efficient development process to create and deploy machine learning (ML) models into production, enabling rapid data analysis at scale and generating business insights. It would enhance research and development, accelerate the commercialization of new therapeutics, and ultimately speed up the delivery of life-changing medicines to patients.

To solve these issues for the researchers and the data science team, AstraZeneca adopted AWS Sagemaker with the aim of streamlining the preparation, building, training, and deployment of machine learning models.

AWS Sagemaker helps companies build, train, and deploy ML models using tools like notebooks, debuggers, profilers, pipelines, and MLOps. It supports governance requirements with simplified access control and transparency over the ML projects. 

Additionally, the companies can access pre-trained models via Sagemaker.

Screenshot from AWS Sagemakaer (Source: AWS

Below is an Eight-minute video demo of AWS Sagemaker:

Here is a 6-point workflow for using AWS SageMaker based on the above video.

  • Data Preparation: Loading and preparing data for model training using SageMaker Data Wrangler or custom scripts.
  • Model Training: Using SageMaker’s built-in algorithms or custom code to train machine learning models on the prepared data.
  • Model Tracking: Logging all steps of the model training workflow, creating an auditable trail to reproduce models and troubleshoot issues.
  • Model Registry: Centrally tracking different versions of trained models, their metadata, and performance metrics to select the right model for deployment.
  • Model Deployment: Deploying the selected model for inference using SageMaker’s built-in hosting capabilities or custom deployment options.
  • Model Monitoring: Monitoring the deployed model’s performance, data drift, bias, and other metrics using SageMaker Model Monitor.

Since SageMaker provides all the above tools in one platform, it makes it easier for the data sciences to access information.

AWS claims that AstraZeneca observed the below business results from the adoption of the said tech stack:

  • Increased Speed to Insights: The time to generate insights decreased from over six months to less than 2.5 months, a 150% improvement.
  • Improved Efficiency: Automating the ML development process within Amazon SageMaker Studio reduced the manual workload, allowing data scientists to focus on valuable tasks.
  • Scalability and Repeatability: The solution’s infrastructure as code made it simple to repeat and share across internal and external partners, enhancing collaboration and scalability.

Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter: