Explainable AI models are essential in pharmaceutical R&D because they provide transparency and understanding of how AI-driven predictions are made. In drug discovery and development, stakeholders, including researchers, regulatory bodies, and healthcare professionals, need to trust and understand AI models’ outputs to make informed decisions. Without explainability, AI models can be seen as “black boxes,” leading to skepticism and reluctance to adopt these technologies in critical decision-making processes.
This lack of transparency can hinder the approval of new drugs and slow down innovation, posing significant business problems by increasing the time and cost associated with bringing new treatments to market. As noted by many published sources, from Nationwide Children’s Hospital to Eli Lilly, it takes an average of ten years and hundreds of millions of dollars to get a new medication approved by the FDA.
Knowledge graphs can address these challenges by enhancing the explainability of AI models in pharmaceutical R&D. They provide a structured representation of complex biomedical data, linking entities such as genes, proteins, and diseases with transparent, interpretable relationships.
Emerj Senior Editor Matthew DeMello recently spoke with Dr. Xiong Liu from Novartis on the ‘AI in Business’ podcast to discuss the integration of AI, advanced data management techniques, and knowledge graphs in pharmaceutical R&D. They discussed challenges and strategies related to the technology stack, data quality, algorithm performance, and infrastructure requirements.
In the following analysis of their conversation, we examine two key insights:
- Driving model explainability by addressing bias: Enhancing model accuracy and user trust by ensuring diverse data representation to avoid prejudice and by identifying specific features that contribute to predictions to enhance the explainability of models ultimately.
- Utilizing knowledge graphs for enhanced predictions: Capturing and querying relationships between entities like genes and diseases with knowledge graphs to improve prediction capabilities.
Listen to the full episode below:
Guest: Dr. Xiong Liu, Director of Data Science and AI at Novartis
Expertise: Technology innovation, Partnerships, and Strategies.
Brief Recognition: Dr. Xiong Liu has ten years of experience in pharma R&D (diabetes, neuroscience, immunology, and oncology) and 20 years of experience in data mining and machine learning. He has led data and AI programs to accelerate drug development, from target discovery to clinical trials and post-marketing research.
Driving Model Explainability by Addressing Bias
Xiong begins the podcast by discussing the integration of AI in pharmaceutical R&D, highlighting the following points:
- Technology Stack and Business Applications: The technology stack in pharmaceutical R&D includes data, algorithms, and platforms. Each layer presents unique challenges that must be addressed to support drug development stages, from early discovery to clinical trials and real-world evidence.
- Data Challenges: Data comes from various platforms, leading to inconsistencies (batch effects). Harmonizing these datasets and preprocessing them for machine learning is a significant challenge.
- Algorithm Performance: AI algorithms don’t always perform perfectly. Extensive training, testing, and evaluation are required to determine the best algorithms for specific use cases, ensuring practical, real-life applications.
- Infrastructure Requirements: Robust infrastructure is needed for data storage, GPU computing, model training, testing, evaluation, and deployment. Existing platforms in AI and health tech can help, but it’s essential to leverage their benefits while addressing specific challenges.
He discusses the complexities and challenges related to data in pharmaceutical R&D, focusing on five key issues:
- Data Quality and Preprocessing: Data quality can be inconsistent due to the various platforms and technologies used to generate it. An example is single-cell RNA sequencing, where different platforms create batch effects, necessitating quality control and preprocessing to remove artifacts before applying machine learning algorithms.
- Data Sharing: Data sharing faces challenges but can be facilitated by principles like FAIR (Findable, Accessible, Interoperable, and Reusable). Implementing these principles on cloud computing platforms can enhance data sharing across organizations.
- Ethics in Data Handling: Ethical considerations include ensuring patient data is de-identified and anonymized before modeling. Standard algorithms are available to achieve this, and it’s crucial to adhere to ethical standards.
- Bias in Modeling: Models may exhibit bias if the data represents only some patient populations, leading to inaccurate predictions for underrepresented groups. Ensuring diverse data representation is essential to characterize features accurately across populations.
- Algorithm Accuracy and Explainability: Deep learning, particularly with transformer models, has improved precision and recall in predictions. However, these models often need more explainability, which is crucial for user trust and understanding. For example, in predicting patient outcomes like hospital readmission, it’s essential to identify specific features (keywords, patterns) that contribute to the prediction to address explainability issues.
“So, for example, if we think about explainability when we make patient outcome predictions using the clinical notes in the EHR systems, we can say, ‘Okay, now the algorithms can accurately predict the outcomes.’ For example, hospital readmission. Now, what are the specifics? What are the features of the data? So then, we can find that specific keyword patterns are associated with the high occurrence of hospital readmission. For example, if there are serious symptoms, and if they do not have those preventive surgeries, then there’s a likelihood of readmission. So those kinds of expandability issues we have to address as well.”
– Dr. Xiong Liu, Director of Data Science and AI at Novartis
The FAIR data principles are a set of guidelines aimed at making data more accessible and reusable in the field of data science. FAIR stands for Findable, Accessible, Interoperable, and Reusable. It means:
- Findable: Data needs sufficient metadata, a unique identifier, and must be indexed in a searchable resource.
- Accessible: Metadata and data should be machine-readable and stored in a trusted repository.
- Interoperable: Data should have a standard structure, with metadata using recognized terminologies.
- Reusable: Data must have clear usage licenses, provenance, and meet relevant community standards.
Utilizing Knowledge Graphs for Enhanced Predictions
Xiong explains how knowledge graphs are beneficial for organizing and utilizing data in pharmaceutical R&D. Firstly, knowledge graphs effectively capture relationships between data entities, such as gene regulatory networks, protein-protein interactions, and gene-disease-drug relationships. They enable users to query entities and relationships of interest quickly, supported by associated databases and query technologies.
Additionally, knowledge graphs enhance prediction capabilities through representation learning. Throughout the process, different entities (e.g., genes, diseases, cells, patients) are represented in a hidden space. Leveraging knowledge graphs This method, similar to dimensionality reduction, captures underlying information and creates embeddings—new data structures that significantly improve prediction capabilities.
These applications have practical benefits. For example, in gene function prediction, knowledge graphs can increase accuracy by learning embeddings from diverse data. Similarly, learning patient representations from knowledge graphs built from EHR data can enhance the prediction of patient outcomes, outlooks, or prognoses.
Xiong discusses the collaboration between tech companies and pharmaceutical companies, emphasizing the role of leading cloud computing infrastructures like Azure and AWS. He notes that while these infrastructures are widely used, the challenge lies in translating them into decision-making and predictive power within the pharma and healthcare sectors. The process is influenced by business considerations, such as goals and costs, which shape the development of initial platforms, data models, and use cases.
Xiong highlights the importance of agility, which involves understanding customer and stakeholder requirements. Understanding these factors helps to determine the scope of data and the selection of algorithms. While many open-source algorithms are available, choosing suitable initial use cases is challenging. Once use cases are selected, it becomes clearer how to collect data, set up models, and determine the necessary platforms for running these models.
In early exploration within life sciences, starting with more minor, manageable projects is expected. For instance, high-performance computing (HPC) might be used initially, and depending on the needs, GPUs may or may not be required. By achieving initial results and communicating them to stakeholders, support, and interest can be garnered, paving the way for scaling up computing power, use cases, and applications across different disease areas.
Xiong discusses the technical aspects of storing knowledge graphs, explaining various methods and tools. He begins by mentioning simple storage methods, such as using text or tabular formats. These methods involve capturing entities and their attributes in tables and mapping the relationships between entities. While the approach can be helpful for computational purposes, it is not ideal for efficient storage and retrieval.
To address these limitations, Xiong highlights advanced tools like Neo4j and MongoDB. Neo4j is an open-source tool that stores data in a graph format, allowing for complex queries and retrieval of subgraphs. MongoDB, a NoSQL database, addresses scalability and can be used in combination with other technologies to enhance data management.
Xiong also notes that many vendors are working on customizing and combining different technologies to better manage large-scale healthcare data. These efforts aim to leverage the strengths of various tools to create more efficient and practical solutions for handling knowledge graphs.