Natural Language Processing in Pharma – Current Applications

Niccolo Mejia

Niccolo is a content writer and Junior Analyst at Emerj, developing both web content and helping with quantitative research. He holds a bachelor's degree in Writing, Literature, and Publishing from Emerson College.

Natural Language Processing in Pharma - Current Applications

Natural language processing (NLP) seems to see less use in pharma than AI approaches such as machine vision and predictive analytics, but nevertheless there are a few applications for NLP in pharma. The industry deals mostly with structured data, but in some business areas, unstructured data is the norm. In this article, we discuss how natural language processing can help pharmaceutical companies make sense of their unstructured data and use it to make decisions.

Pharmaceutical companies are likely to own typed, unstructured data in various digital formats that are useful in determining the eligibility of a patient to participate in a clinical trial. The most common formats are as follows:

  • Physician’s notes
  • Pathology reports
  • Operational notes
  • Electronic medical record (EMR) data

We detail the ways NLP could help research and development teams at pharmaceutical companies with combing through clinical trial documents and electronic medical records, as well as improve speed up clinical trial turnaround time with better patient matching. We also show how mining unstructured data with NLP can assist pharmaceutical marketing teams in creating engaging campaigns. We explain each use case and explore it through the experiences of big pharmaceutical companies with AI vendors.

The possibilities for NLP in the pharmaceutical industry covered in this article are as follows:

  • Discovering New Drug Compounds
  • Matching Participants to Clinical Trials
  • Marketing Pharmaceuticals

For more information on Emerj’s participation in discussions regarding AI in healthcare or agriculture, read about our Content Lead’s experience at the NITI Aayog-ORF AI for All Conference. Our overview of NLP software possibilities in the pharmaceutical industry begins with drug discovery:

NLP for Discovering New Drug Compounds

Drug Discovery is a business area that many software vendors offer solutions for, but most of them claim to cover big data analytics or molecular imaging of drug compounds. NLP solutions, however, do not share many use cases with other applications for this area.

Instead, the technology is more suited to detect information within unstructured data that may facilitate the drug discovery process. This could include extracting information from previous research documents to find notes and results of past chemical experiments.

For example, a scientist at a pharmaceutical company may use an NLP tool to find previously discovered chemical reactions and find they do not need to conduct a given experiment because the result is already known. The scientist could then reevaluate further experiments with all relevant knowledge factored in.

Past company experiences with testing certain drugs or molecules are saved as lab note data or clinical trial data. These documents are typically written by a person using common language complete with pharmaceutical terminology. A developer would need to train a machine learning model on labeled versions of these documents so that it can “learn” to recognize the difference between separate fields on a single form.

We interviewed Amir Saffari, Senior Vice President of AI at BenevolentAI, about how AI will affect drug discovery in the near future. When asked about how his company finds so much information from large amounts of scientific literature, Saffari said,

In natural language processing, two streams of data can be utilized. One is structured data, where databases are created for specific uses cases and curated by humans who scan literature, conduct experiments and populate these databases. You can bring all these together and extract similar kinds of information in a form that is solvable and digestible to humans. There are also machine learning algorithms on top of that data to model the entirety of those relationships of the networks that have been discovered in scanning the literature or data source. Those models can start generating hypotheses or making inferences that we can take to labs and test.

A single model could be trained on multiple types of forms as well as the type of information that comes from each field or form.

Adverse effects of drugs are usually reported in clinical trials and at regular physician visits. It follows that a machine learning model would need to be trained on clinical trial reports and EMR data in order to recognize information from them. Any adverse effects would likely be marked as such, with an indication as to which drug it is likely a side effect of.

Chemical and dosage information can be found with NLP software through a process called text mining. The database of research information is run through the NLP engine with certain topics, phrases, and words selected for it to search for. The software will attempt to find all data points relating to the user’s desired topic and present them as search results or possibly through an analytics dashboard.

The short, 7-minute video below from AI vendor Linguamatics offers an introductory explanation of text mining and its capabilities. The sections of Linguamatics’ explanation are as follows:

  • 1:00: Search Engines and Keywords
  • 3:23: Interpreting the Meaning of Text
  • 5:10: Automatically Extract Relevant Facts
  • 5:33: Comparing the two Approaches

AI vendor Linguamatics is one such company that offers an NLP text mining solution, called I2E, which they claim can help pharmaceutical and life sciences companies search information about different chemical compounds. This helps pharmaceutical scientists quickly retrieve facts about similar molecular compounds so they can know how chemicals may react before testing them together.

According to a case study by Linguamatics, the company was able to help Roche Pharma Research and Early Development speed up their drug discovery process. They did this by allowing medical chemists at Roche to search through “internal and external databases” for information regarding the relationships between chemical compounds and the disease they intend the drug to treat. The software allowed Roche scientists to input queries related to one, two, or all three of these categories to aggregate their relevant chemical information.

Linguamatics helped Roche develop their own AI platform called Artemis using I2E, which allowed the pharma company to search for chemical and pharmacological terms more easily. The solution purportedly saved Roche $10,000 per search based on a full-time equivalent cost of $200,000 per year.

NLP for Matching Participants to Clinical Trials

Clinical trial matching is another use case for NLP in pharma. Finding the right patients for clinical trials from physician’s notes and past trials is time-consuming but could be automated with AI software trained on those types of documents. Common solutions from AI vendors are NLP programs that can discern the patients best suited for a given trial from expansive lists or databases of patient files.

In addition to how long it takes clinician teams to complete, matching patients to appropriate clinical trials poses a few key problems to pharmaceutical companies. Each one is listed below:

  • Detecting and recognizing International Classification of Diseases (ICD-10) codes for illnesses and injuries
  • Extracting important data points from various unstructured data sources
  • Utilizing patient data while maintaining their privacy and hiding protected information

ICD-10 codes are imperative to determining a patient’s viability for a clinical trial. These codes standardize nearly every possible disease, illness or injury a patient may be suffering from or have suffered from in the past. A machine learning model for clinical trial matching would need to be trained to recognize the ICD-10 code or codes associated with a patient and determine if it is closely related to the drug being tested.

Recognizing information from these formats would require a machine learning model to be trained on pharmaceutical data found in the same types of documents. Developers would need to label each specific field in each type of document and run tens of thousands of reports through the model, in addition to running every ICD-10 code through as well.

This would allow the resulting software to be able to detect which fields hold which types of information, as well as the classification for any ailments that the documentation may refer to.

Some patient data may be protected by law or agreement and cannot be divulged in a way that would link the information back to the individual. Not as many vendors offer solutions that can specifically handle cases like this, but some claim to be able to obfuscate the information in terms of privacy while still providing useful insight. This is likely possible with graphs or other visualizations that do not detail specific patients or physicians but contain relevant statistics.

One vendor that claims to be able to keep protected patient data safe while matching patients to clinical trials is The company offers NLP software that can purportedly detect specific traits associated with individual patients such as symptoms, past diagnoses, genomics, or test results. They claim the software can help clients find more patients for each clinical trial than they normally would and in a shorter amount of time.

Below is a graphic from’s website that details the proposed benefits of implementing their solution:

Deep 6 AI's value proposition
Deep 6 AI’s value proposition

In a video from, Clive Svendson, Director, Board of Governors Regenerative Medicine Institute at Cedars Sinai speaks about his company’s improved efficiency after implementing the NLP software. With regards to’s solution, Svendson said,

Deep6’s ability to go into the electronic medical record, select that information almost instantaneously, and put together a list of one thousand patients and then drill it down to 30 or 40 for the trial is going to save clinicians here an enormous amount of time. That’s why it’s so exciting for us. has published a case study in which they claim to have helped Cedars-Sinai Heart Institute improve the process of finding eligible patients for a study on a drug called Udenafil. The drug is a chemical inhibitor used for treating those born with a specific heart defect, and Cedars-Sinai had only found two patients for the study prior to working with

According to the case study, Cedars-Sinai used’s “cohort builder tool,” or application for organizing and identifying patients from a database. The company was purportedly able to identify 19 patients and then validate 16 of them as eligible for the Udenafil trial. claims this process took less than one hour, though it is unclear if that is an accurate representation of the average time this would take a business leader in a similar field.

NLP for Marketing Pharmaceuticals

We have reported on AI applications for pharmaceutical marketing and sales in detail in the past, but our research has yielded a markedly smaller amount of NLP applications for this business area. That said, there are still many marketing and sales opportunities to be found within unstructured customer data. It follows that NLP would still be a possibility for solving marketing or sales problems for pharmaceutical products.

The most common business problems that an NLP software solution could solve for marketing or sales team in pharma are:

  • Evaluating social media buzz surrounding the product and determining the sentiment of individual posts
  • Making use of customer or patient profile data to find opportunity value in individuals or demographics
  • Analyzing call center data from recordings of sales calls

Analyzing social media posts for their sentiment and association with a specific pharmaceutical product would require those training the machine learning model to label various words, phrases, and possibly internet slang as positive or negative when associated with the product. Some sentence fragments may also be labeled to allow for more specific interpretation of context.

This would allow the company to aggregate social media response to an advertisement or a product as individual data points that can be measured to evaluate the performance of a marketing campaign.

Customer and patient information can consist of data from a CRM, anonymised electronic medical record (EMR) data, or transactional data based on their previous purchases. An NLP software could detect a patient or customer’s history with an advertisement, drug, or other product, and aggregate that with other experiences of those in their demographic. This can positively affect marketing campaigns by showing the client team which areas may need a better strategy in the future.

Sales call center data is likely stored as audio recordings of sales calls which can be processed and recognized by speech recognition software. This application of NLP sees a lot of use in the creation of EMRs and other digitally transcribed medical documents, but here it can be used to discern customer engagement and adherence of the sales rep to sales protocol.

AI speech recognition software needs to be trained on as many distinct voices as possible using various inflections and with various levels of background noise. This way the model can more easily discern the words spoken in the call should there be any interference while a sales representative is with a customer.

It is important to note that marketing and sales solutions for pharmaceuticals will require a more thorough data labelling and preparation process than with other industries. This is because machine learning models for pharma need to be trained on individual medical and pharmaceutical phrases and codes that do not come up for other industries.

We interviewed Gunjan Bhardwaj about the challenges life sciences companies face in terms of data and talent in the age of AI. In reference to his statement about AI insights from pharmaceutical data needing to be context aware, Bhardwaj said,

In life science and pharma, it’s not English that the data speaks. It’s not French it’s not German it’s not Spanish. It’s medical or pharma English. There is a language of the domain, what we call a metaontology. A metaontology that encompasses the ICDs of this world, the gene ontologies, the pathways, the biomarkers. All biomedical concepts and their relationships defined. Unless the system speaks the language of that domain, you cannot understand the context [in which] something is said.

Eularis is one such company that offers NLP software to the pharmaceutical industry for sales purposes. They list Merck, Novo Nordisk, and Shire Pharmaceuticals as past clients, and claim their software can also parse stores of big data. Their NLP solutions can purportedly handle various unstructured data in many formats.

The company claims their software can also analyze social media data, though whether this application uses sentiment analysis is not expressly stated.

Below is an image that highlights the espoused benefits of using Eularis’ software:

Eularis' value proposition
Eularis’ value proposition

According to a case study on Eularis’ website, the company helped a pharmaceutical company predict how much their brand was being shared around social media and the audience reaction to new advertisements. Eularis purportedly used sentiment analysis as part of their NLP solution to gauge positive or negative response to the client’s advertisements.

Eularis helped the company make use of their social media data to gauge the future of their market share and sales, and which variables are have the most impact on customers choosing their brand. The could purportedly calculate how much brand revenue is collected as a result of digital marketing.

The case study states that the client company was able to see how their social media marketing initiatives affected sales and brand reputation in each country where their products are sold/ They also found more information on what influences customers to choose their brand and how they can change their marketing behavior to improve brand choice.

The client company could then justify increasing their digital marketing budget because they found improving on that front was a direct cause for improved brand results.


Header Image Credit: The Conversation



Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter: