How to Determine the Data Needs of an AI Project or Initiative

Ayn de Jesus

Ayn serves as AI Analyst at Emerj - covering artificial intelligence use-cases and trends across industries. She previously held various roles at Accenture.

How to Determine the Data Needs of an AI Project or Initiative

Episode Summary: We receive a lot of interest from business leaders in the domain of data enrichment, and we’ve executed on a few campaigns for these businesses. At the same time, our audience seems particularly interested in the collection of data to train a bespoke machine learning algorithm for business, asking questions related to how to get started on data collection and from where that data could come.

This week on AI in Industry, we seek to answer those questions. We are joined by Daniela Braga, CEO and founder of DefinedCrowd, a data enrichment and crowdsourcing firm, who discusses with us how a business might determine what kind of data it might need for its AI initiative.

We hope the insights garnered from this interview will help business leaders get a better idea of how they could go about starting an AI initiative and seeing it through from data collection or enhancement to solving its business problem.

Subscribe to our AI in Industry Podcast with your favorite podcast service:

Guest: Daniela Braga, founder and CEO — DefinedCrowd Corp.

Expertise: Speech recognition, natural language processing, machine learning, software engineering, program management, linguistics

Brief Recognition: Daniela Braga has 18 years of work experience in speech technologies, both in academia and industry. Prior to DefinedCrowd, she served at Microsoft, where she worked on speech technology. At Voicebox, she created the Data Science and Crowdsourcing team. She also was a lecturer at the University of A Coruna and a researcher at the University of Porto. She holds a PhD in speech technology from the University of A Coruna.

Interview Highlights

(04:05) How can executives get a grasp of the data requirements for a certain project?

Daniela Braga: Two types of clients come to us: the ones who have tons of data that they don’t know what to do with it, and the ones who don’t have any data and need to start building some automation or efficiency in their systems, and need to start from scratch. We serve all stages in the AI lifecycle from providing data collection, from collecting raw data, start the model from scratch or looking at the large amounts of raw data that clients have and giving them guidance on how to structure their data to get the best results in the application.

We move ourselves within the AI. A lot of people say they do AI but are really doing rule-based models. For AI, you need machine learning-driven models and lots of structured data. We move along the lines of cognitive services. We collect and structure high-quality training data for voice applications, text and computer vision.

Our clients usually come with situations like building a personal assistant, saying “I want to go multilingual or a different market, and I need 2,000 people speaking different dialects in the domain, in banking, in Germany or Vietnam. How do I get that data? I have a model in English working but don’t know how to get to the next level.” That is one example which happens a lot in the finance world.

We try to see a lot of models that work well for domain generic: play music, get information about the weather, or basic search. But when you get in front of a finance or insurance client, or even medicine, you need domain-specific data. Domain customization is all about domain-specific data. That’s also what we do here.

Another problem that other clients bring is, “my model works well in identifying entities. For example, in text, I’m using Stanford NLP (natural language processing) models for entity recognition, one of our clients does, but they don’t work well with my entities in finance. They don’t work well in Japanese because the Stanford NLP does not support Japanese. How do I get it?”

So you need to collect data, you need to annotate data. There is a lot involved in entity tagging. It is one of the most difficult tasks especially, in domains. You need sometimes domain-specialized people to annotate those entities. And there is a whole way around measuring the quality of the people because we combine people with machines to make our data processing more efficient and accurate.

(08:15) When it comes to finding crowdsourcing expertise in, for example, insurance or banking, how is that pulled off?

DB: We have partnerships with universities. Recent graduates or university students in domains that are learning a discipline. Sometimes we also use the client’s internal crowd mixed with ours, which is very interesting. The whole point of making machines better is bringing the internal and external expertise.

We just worked with a big hospital group in Portugal in a prediction model for ICD 10, a type of labeling of medical reports that only specialized physicians do whenever there is a case that requires a report. The client wants more automation around that. So they go to a catalog of conditions. Its NLP activity, so they dye the notes of the clinical reports and the system will recommend an ICD code.

This ICD code now is automatized, but in order to get there, we have both internal physicians and physicians from the outside to help train the model in the classification of completely open text notes from doctor’s abbreviation.

(12:10) When it comes to helping executives think through for their application, how do you help [people interested in AI for AI’s sake] get a better sense of what is required to get “X” results from a machine learning system?

DB: It’s normal for people to think that they have data and they can do something with it. It an be a simple rule, in fact. No machine learning needed and they get disappointed. A lot of the times they have a big budget and don’t know what to do with them.

In our field, we are all about human-computer interaction here and that’s where you are dealing with multiple variables. We are really mimicking the human brain with multiple variables. It’s not just two or three variables that we are putting together that almost can be a rule. We have contextual information often, they have linguistic information divided in multiple areas from the sound on the acoustic side to the traumatic side, to the syntactic, pragmatic, semantic.

We guide the clients by pointing them in the direction of whether it is an AI problem or not.

Most of the time is when the best clients have multilingual or multimarket problem solved, that is a big thing for us because we support 46 languages. The other thing that is important is putting people in the loop with machines that are already learning and making sure you are not missing th false positives and false negatives in your model training that makes the model blindsided and biased. It’s very important to have the right distribution of the data. All of that comes in a package with us. We don’t charge for that type of consulting. We do that all that time.

(15:47) What is one thing you wished executives knew so that if they thought about data collection needs, they would have a better grasp of what’s reasonable, what’s possible, and what’s required.

DB: I think it’s important that more people understand what is high-quality data. Everybody talks about quality but most people don’t really care. A lot of offerings don’t measure quality. They have to understand that we have a certification for every data unit we produce or process. A lot of companies don’t do that and a lot of executives go by pricing instead of quality….That can hurt an organization’s project lifecycle. I’ve been there. It just delays everything and turns out to be more expensive in the end. That is to me the biggest pain…to pay attention, to not be fooled by the words and the buzz and look at the certification.

(18:40) Are there types of business functions, industries or sectors that you think in the next 5 years will have the deepest and most robust needs for data collection and enrichment? Are there commonalities for companies who are likely to become customers for this kind of service?

DB: We think this cognitive world and human interaction as a completely horizontal, industry agnostic. The only difference is the domain specialization and the language reach. A lot of industries are collecting lots of data that are sensitive so that you cannot leave the client’s premises makes the processing at scale very difficult if you need people in the loop. But that’s why there are techniques of deploying on premises where we are going as well with our roadmap and having the client’s own people help in the processing and continuous model training and improvement.

The biggest challenge is that we are only tackling—us and our competitors—the tip of the iceberg of the data that can be touched. All the data right now, GDPR (General Data Protection Regulation), with all the leaks and scanners, cannot even be touched. There are oceans and oceans that companies cannot process at scale. The big shift is when we can process Big Data internally, on premises, and with a lot of technology.

Subscribe to our AI in Industry Podcast with your favorite podcast service:

 

Header Image Credit: feadas.lu

Subscribe
subscribe-image
Stay Ahead of the Machine Learning Curve

Join over 20,000 AI-focused business leaders and receive our latest AI research and trends delivered weekly.

Thanks for subscribing to the Emerj "AI Advantage" newsletter, check your email inbox for confirmation.