Data Collection and Enhancement Strategies for AI Initiatives in Business

Dylan Azulay

Dylan is Senior Analyst of Financial Services at Emerj, conducting research on AI use-cases across banking, insurance, and wealth management.

Data Collection and Enhancement Strategies for AI Initiatives in Business

There’s more to successful AI adoption than picking the right technology. Business leaders should be aware of the technical requirements of the initiative they’re undertaking, and few of those requirements are as important as data.

For this article, we spoke with Mark Brayan, CEO of Appen, a firm that offers crowdsourced training data for machine learning applications. We discuss how developing a sound data strategy is essential for using AI to solve business problems. Brayan also helped us detail how and when a business can make use of certain data collection and enrichment methods depending on their business goals.

Listen to the full audio below:

We also outline below how executives should consider their data needs, and we’ll also address the various circumstances in which crowdsourcing might serve a business’ AI initiatives well.

Effective AI Mimics Human Function

“AI is a product that mimics human function,” according to Brayan. Artificial intelligence intends to recreate the processes of the human brain at a speed and scale out of reach from our own organic brains by definition.

When thinking about AI, Brayan suggests we think about how a human might learn. If that human is working at a call center, they might start their first few days on the job learning the schedule, the scripts, the company clients, and when to escalate the call.

After that, the call center employee starts to learn more about their role while they work. They’ll learn how to intonate their voice depending on the customer they have on the line and more nuanced instances when an escalation might be necessary. They’ll learn when they’ve performed well or when they’ve made a mistake based on when their supervisor tells them.

Similarly, Brayan suggests that AI will learn “on the job,” so to speak. If someone wants to make a phone call, they might tell their phone, “Call Dan.” The machine learning behind the voice recognizer in the phone would then pull up Dan’s phone number and await the caller to press the dial button.

If the caller presses the dial button, the AI-based voice recognizer would in effect learn that it pulled the correct phone number. If the caller does not press the dial button, the recognizer would then learn that it pulled the wrong phone number. In either case, this learning should inform the next time the recognizer pulls up a phone number for the caller.

That said, however, occasionally a call center employee might need to be briefed on new company clients, new products, or new scripts. This is not unlike machine learning models, many of which require frequent updates. In fact, McKinsey estimated that a quarter of AI models require weekly updates in order to remain effective for their purposes.

It follows then that in order to create the kind of AI that most resembles human function, a business might need to feed that AI the most human-like data during its development and on an ongoing basis.

Effective AI Targets a Specific Use Case

Aside from maintaining a certain quality standard for its data, a business may also need to make sure that the data it collects lines up with the data’s intended use. In other words, “What’s the problem we’re trying to solve?” Brayan asks.

Integrating AI solutions into a business is not as easy as wanting to tack AI onto workflows in order to stay relevant in today’s landscape; in fact, we heavily advise against this. Businesses need only consider AI when they have a problem that traditional computing cannot solve and when humans could theoretically solve that problem given infinite time.

“We have to collect data that fits the use case,” Brayan says. He elaborates:

The best AI is narrow AI. When you target a specific problem and get the right data for that problem, you’re going to get a better answer. If you try to build something real broad—if you want a speech recognizer to recognize any accent—you’re going to have to collect a lot of data. If you want to build [a speech recognizer] that recognizes just one person’s voice, that’s a lot easier to build because there’s less nuance for the recognizer to understand. That applies across AI in general. The narrower the use case, the more specific the data, the better the chance of an outcome that reflects what you’re trying to do.

If the most effective AI mimics human function and is targeted toward a specific use-case, how can businesses collect and enhance the data needed to feed into it? The ways in which a business might begin an AI initiative might differ; some businesses may need to collect brand new data and others might need to enhance that which they currently have. Regardless of which a business might do, the quality of that data will likely play a large role in determining the success of a business’ AI initiative.

There are several ways a business might go about collecting quality data or enhancing its current data with quality information in order to build an AI that mimics human function. These may include:

  • Buying the Data
  • Crowdsourcing

Buying the Data

Some businesses may already have all the data they require to train an algorithm to achieve their ends. If they don’t, they might be able to outright purchase all of that data if they have the resources. In these scenarios, the business would not need to do anything more than let the machine learning algorithm loose on their database, assuming its clean. Unfortunately, most businesses won’t find themselves in these scenarios.

Third-party Data

In some cases, a business may find that it has some data, but not all of the data it may need to train its machine learning model on the problem it wishes to solve. To remedy this, it might look to enhance its data using data that’s already been collected by a third-party firm. The business could then purchase that data from the third-party firm, thus completing its database for the purposes of training a machine learning algorithm.

For example, an eCommerce company might be looking to send physical goods or coupons to customers for an ad campaign. They have data on past purchase history, and so they know who might be interested in the coupon. What they don’t know is the current financial situation of these customers that could inform the likelihood that they will use the coupon and make the large purchase on which it saves.

The eCommerce company might look for a way to add this data to the profiles of their existing customers. They might pay Equifax or a similar enterprise with access to data revealing people’s financial situations to provide that data, matching the eCommerce company’s customers with those in Equifax’s database based on customer credit cards. The eCommerce company would then have more data on their customers, and they would be able to determine to which customers are worth sending the coupons.

In this case, a business is making use of another company’s or institution’s database to feed their own databases and enrich their data. This data can then in turn be used to train a machine learning algorithm on a specific use case.

This bodes well for eCommerce businesses, insurers, bankers, and creditors—sectors with an easier time digitizing than, say, car manufacturers and other businesses with more happening in the physical world. More complex use cases involving a blend of physical and digital or a large amount of nuance may not benefit from accessing pre-existing databases that don’t store data specific to those use cases.

Training a machine learning algorithm on a large pool of nondescript eCommerce data, for example, may not help a business selling pewter golfing memorabilia because chances are data relevant to that business isn’t available.


“Where the crowd comes in is when the problems are really complex,” Brayan says. Crowdsourcing is one method of data collection and enhancement that involves paying a crowdsourcing company to gather people to enhance or collect the data pertinent to a business’ use case.

Enhancing the Data

For example, a security company that has thousands of hours of video footage depicting vehicles driving into a facility parking lot may have trained a machine learning model to differentiate between people and vehicles, but it may now need their model to differentiate between cars and trucks. It can’t simply reach out to a company with thousands of hours of driving footage depicting cars and trucks labeled differently because it’s unlikely to find that data for sale.

Instead, the company can pay a crowdsourcing firm to gather people to enrich their existing video data so that cars and trucks are labeled differently. These people could manually label cars and trucks as they watch the company’s footage, helping to train algorithms to make the same distinction.

The security company would then be left with labeled data that could be fed into their machine learning model, and the model behind the company’s cameras would then be able to differentiate between cars and trucks that pass by them.

Collecting the Data

A machine learning model trained on data from a variety of people from all over the world in a variety of situations could provide businesses greater access to international markets.

Brayan points out that if a self-driving car company already operates in Australia, the voice recognition software that works in its self-driving cars may not work the same for users in the UK. This is due to the fact that the machine learning algorithms behind the voice recognizer would have been trained on Australian speech and Australian driving conditions. The system would have trouble understanding commands given in different accents, dialects, and weather conditions.

For example, a customer in Australia could interact with an Australian insurance chatbot equipped with a voice recognizer by saying “I crashed my car.” The chatbot could be trained to understand this, but if a UK customer interacts with that chatbot and says, “I pranged my motor,” it might not be able to recognize that both customers are really saying the same thing.

Brayan points out, “The success of the chatbot comes down to the ability to answer questions. The more natural it is, the more capable it is, the more the end user…gets value out of that channel.” In other words, tailoring the AI behind the chatbot to the users by training it on their speech patterns is more likely to create a chatbot that drives ROI for businesses. The ability to detect the tones and terms of different dialects gives a machine learning model the ability to respond to a wider range of customers.

Businesses that wish to have global reach may find they need to train their machine learning models on populations native to those areas. Brayan suggests that businesses looking to expand internationally are “going to have to localize that data not just from one language to the next, but make sure it fits the culture and so on.”

A business might benefit from taking accents, dialects, and cultures into account when training models to work for people all over the world. For a machine learning chatbot or voice recognizer to work for customers in other countries, people in those countries are going to have to train it.

Crowdsourcing companies can gather people from other countries to train machine learning models for international commercial use. A US-based self-driving car company looking to sell cars in Ireland may find that they need to pay a crowdsourcing company to gather speech command data from Ireland natives, training its models to understand the nuances and quirks of Irish English.

Planning a Data Strategy

Brayan relays that before considering AI solutions, a business should first ask if the problem they wish to solve is “something that can be done computationally or… something that needs the nuance of humans to work out?”

If a business answers the latter, Brayan suggests that it then looks at the data it already has. Although some businesses may have swaths of data going back decades, they may not have the data they need to answer the question or solve the problem they are attempting to solve.

If this is the case, the business may need to collect the relevant data or enhance the data it already has, and this is a potential opportunity for crowdsourcing. “The advantage of the crowd,” according to Brayan, “is you get human-derived or human-quality data.” This kind of data allows for the nuance of the human experience, providing a solid background for a machine learning model that intends to serve global markets.

Regardless of which methods of data collection and enhancement a business uses for their AI initiatives, it should only choose to leverage AI when it makes good business sense. Some businesses could purchase the data they need to collect or the data they need to enhance their existing data from third-parties, and this would get them the relevant data in the shortest amount of time. That option might be inaccessible to a business due to its cost or due to the complexity of the problem it’s trying to solve.

On the other hand, crowdsourcing companies may in some cases be a more cost-effective alternative, but the data may take longer to collect than if a business were to purchase a dataset from a third-party. This is a consideration that businesses should take before investing in any one data collection or enhancement method.

About Appen

Appen develops high-quality, human-annotated datasets for machine learning and artificial intelligence. Appen brings over 20 years of experience capturing and enriching a wide variety of data types including speech, text, image, and video. Appen partners with technology, automotive and eCommerce companies, as well as governments, to help them develop, enhance, and use products that rely on natural languages and machine learning.


This article was sponsored by Appen, and was written, edited, and published in alignment with our transparent Emerj sponsored content guidelines. Learn more about reaching our AI-focused executive audience on our Emerj advertising page.

Header Image Credit: Medium

Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter: