In the era of big data, companies need help navigating through an overwhelming volume of unstructured data to uncover meaningful insights. The topic search process presents unique challenges in deciphering data signals and identifying critical information before problems escalate.
However, advanced language models and cutting-edge techniques, such as vector embeddings, revolutionize topic search, enabling organizations to unlock valuable knowledge and drive informed decision-making.
Emerj Technology Research Senior Editor Matthew DeMello recently spoke with Ben Webster, Modeling and Analytics Team Lead at NLP Logix, on the ‘AI in Business’ podcast to discuss topic search as a business capability and the value of detecting meaningful signals.
In the following analysis of their conversation, we examine two key insights:
- Extracting insights from unstructured data to understand the business context: Analyzing unstructured data by identifying relevant topics and training a model with client-specific examples, providing actionable insights from a small percentage of data.
- Leveraging large language models and connecting insights: Utilizing vector embeddings and large language models to extract topics, sentiments, and customer behavior to provide actionable insights to drive business growth.
Listen to the full episode below:
Guest: Ben Webster, Modeling and Analytics Team Lead, NLP Logix
Expertise: Advanced Analytics, Predictive Modeling, and Sentiment Analysis
Brief Recognition: Ben has spent the last ten years at NLP Logix, first as a data scientist from 2013 to 2021 before being promoted to his current position as Modeling and Analytics Team Lead. He earned his master’s in Mathematics and Statistics from the University of North Florida in 2016.
Extracting Insights from Unstructured Data to Understand Business Context
Ben begins by citing that 90% of the data a company captures is unstructured, including conversations through emails or inbound call centers. Though such unstructured data contains a lot of valuable information, there’s so much of it that a human can’t read through it all.
The real challenge then becomes to figure out what the data is hinting at, find the information and context necessary to the business, and identify problems before they become more significant.
As he puts it, “The problem faced is to get to the end site, before you have an angry customer or a broken system, that you trace back to discover within the text in your data.”
In such a way, topic search becomes a way of viewing problems like Voice of the Customer and media surveillance that is agnostic of company boundaries and only tuned into the signals appearing in data collection.
Ben further states that recording and capturing information and conversations within a company is essential. Companies can look for information that can be referred to as a comment, which can come in the form of emails or trouble tickets, and is easy to capture.
However, the most critical problems are often discussed offline, such as when C-suite members talk to each other or when someone sends a text message about an issue.
These are significant indicators in the conversation chain but can be challenging to capture because they come through official channels. Therefore, companies must ensure that all the essential, rich data is captured and stored. There’s no business value in capturing 80% of company data, but missing the most critical 12% because it was held on personal devices.
Ben shares that there is often confusion between the concept of “topic modeling” and “model clustering” when it comes to data science projects.
This confusion leads to miscommunication between a project’s business and technical sides, and the technical side may end up building something that doesn’t support the product.
Therefore, both sides must use the correct language and communicate clearly to avoid confusion and ensure that the final product meets the business needs.
Ben explains a specific approach NLP Logix uses in training a large unsupervised model in how the client’s data is used in specific contexts, leading to a deeper understanding of how that particular client speaks or communicates. The methodology bares similarities to few-shot learning, a machine learning technique where the model is trained with only a few examples of a particular concept or topic.
In this case, the client provides three or four examples of what they consider timely, alert-worthy, or relevant topics. Seeding the model with these examples allows it to discover and understand the issues most important to the user. The goal is to quickly provide the user with helpful information rather than something that may be scientifically interesting but irrelevant to their needs.
Given the relevance criteria, he feels that many clients mistakenly believe it’s necessary to partition all of their data. He clarifies that what is relevant to business leaders in the client’s business use case is often only a tiny percentage of the words. What becomes most interesting is the occurrence and co-occurrence of those critical word choices with other relevant themes or topics.
Against client expectations, the results do not include detailed analyses of sentence anatomy – that wouldn’t be relevant for the business leader, the customer or the provider. Ben tells Emerj that subsequently, in topic search as a discipline– the fewer data points and contexts there are, the better:
“What I find that always becomes interesting to the customer is how they come in thinking that I’m going to help them completely understand the anatomy of every sentence. And what we come back with is something that brings insight from two percent because that’s all that they need. I don’t have enough time for a human to comb through this 100%. What we want to do is find out what you as a customer think is actionable and bring that right up to the front.”
– Modeling and Analytics Team Lead at NLP Logix, Ben Webster
Leveraging Large Language Models and Connecting Insights
Ben explains his experience utilizing vector embeddings for topic modeling over the past seven years. He mentions that large language models are essentially an improved form of embedding. Despite this improvement, Ben emphasizes that their belief in the end user’s expertise and the importance of using their initial input and examples to learn about the desired topics remains the same.
What Ben says becomes easier with time and using the LLM is that the system will begin providing more nuanced suggestions. The development leads users to ask more interesting questions. He appreciates that the LLM allows users to understand topics not just based on one or two-word phrases but by reading and considering entire sentences or combinations of sentences. Ben highlights that although the conversation becomes more in-depth and nuanced, their mechanics and approach remain consistent.
He explains the importance of integrating structured versus unstructured data based on the typical forms they take in modern enterprises. Unstructured data contains valuable insights, often provided to them as a large corpus of text to analyze.
On the other hand, structured data consists of organized information – such as whether a customer quit, gave a bad review, made a purchase, etc. Ben emphasizes the need to marry these two data types because they complement each other.
Ben tells Emerj no matter how often the arrangement is deployed, the enterprise often needs help understanding the relationship between the score and customer behavior in the long term. He mentions a typical scenario where a company collects unstructured data and a numerical associated score. They cannot determine if delighted customers made purchases or not.
“So we extract the topics, we apply sentiment analysis to the mentions of the topics. Your happiest customer can complain about a topic – even your most upset customer can have positive things to say about a topic. When you combine that to vote, the result is the outcome – which is the thing you cared about whether or not they bought or not. And then their tendency to score you in the presence of that. That’s when you get the whole story.”
– Modeling and Analytics Team Lead at NLP Logix, Ben Webster
He further explains why he focuses explicitly on comments for analysis. He mentions that comments are polarized, meaning people express either a plea for help or dissatisfaction when discussing topics like call centers in emails. Such polarization makes it easy to understand the sentiment and emotions conveyed in the comments. However, if they were to analyze all text data, technical discussions may need more emotional expression or could be misinterpreted.
To address this, NLP Logix has a solution that uses a template based on another approach similar to few-shot learning that leverages other AI capabilities in large language models and unsupervised learning.
The solution focuses explicitly on comments that relate to the user experience because understanding and improving the user experience is crucial for generating revenue.
While discussing the importance of connecting visibility to return on investment, Ben mentions that people are often concerned about factors related to brand loyalty. In an example, he highlights a situation when clients consistently rate a company positively due to a long-standing relationship but still provide valuable feedback within the data, such as complaints or suggestions for improvement.
He emphasizes to Emerj the need to connect process improvements to potential ROI. By doing so, you can show how you monitor and identify issues early, leading to actionable changes that can positively impact the company’s financial performance.
Ben concludes by saying that simply presenting word clouds or superficial insights may initially impress clients, but it does not lead to a long-term product or sustainable relationship.