Why Big Data is Not Necessarily the Best Data for Business – A Conversation with Slater Victoroff

Daniel Faggella

Daniel Faggella is Head of Research at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and leading enterprises, Daniel is a globally sought-after expert on the competitive strategy implications of AI for business and government leaders.

Why Big Data is Not Necessarily the Best Data for Business - A Conversation with Slater Victoroff

Episode Summary: You’re a business, and you’ve collected data – now how do you now make sense of it? Bring in ‘sentiment analysis’, a form of machine learning that determines whether text is positive or negative. Slater Victoroff’s company Indico provides algorithms that specialize in this task.

In this episode, we talk about about the common misconceptions that businesses have about where ‘big data’ may be applicable, and the lessons he’s learned by gaining more tangible insights from smaller sets of data for companies. He explains why big data is not necessarily better, and discusses the steps that companies should take early on to make sure they’re prepared when it’s time to apply machine learning to their processes.

GuestSlater Victoroff

Expertise: Computer science/software

Recognition in BriefSlater Victoroff is a graduate of the Franklin W. Ollin College of Engineering. In 2015, Victoroff won Boston TechJam and mentored at over 10 hackathons, including Hack the North, Hack Princeton, and Hack MIT. When founding Indico in 2014, Slater raised over $3 million as part of Boston’s Techstars accelerator program.

Current Affiliations: CEO at Indico Data Solutions

Machine Learning – More Than Big Data?

Big data is a trend for good reason. Without lots of information, machine learning would not have gotten its second wind in the world of AI. But what sort and amount of data is necessary to glean useful business insights from machine learning in the first place?

“It’s where a lot of people fall down, the notion that it’s not big data is really what I want to emphasize here, it’s all about quality over quantity, and what I mean by that is the primary types of data that are most valuable to companies in the long run is rich media,” says Victoroff.

When people think about big data, Slater says there’s a tendency for people to focus on collecting lots of data and fit it into the right format – taking customer support logs and fitting that information into a spreadsheet, for example. But massive amounts of tightly organized data isn’t everything. “When you look at the largest amount of value possible it’s really coming from rich media, which is text, which is audio, which is video, and the more of that data you can keep with useful metadata associated, I think the more value you’re going to get,” says Slater.

Victoroff founded Indico, a machine-learning based company offering sentiment analysis application programming interfaces (API) that use algorithms to analyze a range of text. One of the most common blunders that Victoroff has seen in working with businesses is working with Clickstream data, which collects huge amounts, up to terabytes of data; however, in Slater’s experience, the amount of value garnered by this data is negligible.

“It’s a perfect example that people are optimizing for types of data which are as large as possible to give themselves effective bragging rights without focusing on what would be deeply valuable,” he says. Storing all of a user’s clickstreams throughout their lifecycle has far less value than focusing on the data available through an associated social media account, which has a fraction of the data (such as Facebook posts), but which Slater believes offers far more insight than any mass of text in a decontextualized spreadsheet.

Dissecting Rich Media with Machine Learning

How should businesses decide what to track in the first place? Victoroff believes it starts with going back to the basics.

“There’s a big assumption that people have that for some reason because you’re dealing with data science and machine learning, you need to approach the probably fundamentally differently than you would as a human being, people assume that computers can’t deal with the same kind of information that people can and assume they can’t step through the same processes, but that assumption means that you’re trying to do apples and oranges comparison,” he says.

For example, many eCommerce businesses share the goal of maximizing product sales, and sending out accurate recommendations is a key facet of this business model. A human’s first goal is likely centered on better understanding the goals of the buying customers and products in their marketplace. They might come up with the idea of gleaning valuable information from a customer’s review on previous products or looking at their browsing history. On the product side, a person might try to find out more about designs and assessing the factors driving buying decisions.

There’s a logical progression to finding out this information, says Slater. In a perfect world, we can get social media login information for our users. A fashion company, for example, might check out a customer’s Instagram account and look at the items that they’ve liked (whether from their store or others) then select items similar from their store and use this information to drive recommendations.

“In my mind that’s a very intuitive, a very human way of thinking about it, and that’s exactly the way you should approach it with machine-learning perspective, but what I see typically is people faced with exactly that sort of problem (driving recommendations), instead of thinking about it like every other problem, say let’s instead find the biggest amount of data that we can,” says Victoroff.

Collecting 500 gigabytes of Clickstream data that lists what products people have looked at, for example, might be logged and forgotten about, and by the time it’s pulled up people are left with a lot of data taken out of context and is essentially useless. Slater emphasizes that the assumption that computers can’t look at the same information as humans (such as social media posts) and make similar conclusions without a ton of data is false. “Getting out of that mindset that computers somehow deal with these problems in a fundamentally different way than people do, I think that’s really the first thing to go.”

But Victoroff notes that while there are some eCommerce sites using this strategy, a lot of big eCommerce companies, including Amazon, are not (to the best of his knowledge) currently linking social media accounts to their buyer profiles and using machine learning to analyze people’s posts and comments. “If they talk frequently on Facebook about running and cooking, that’s a product recommendation right there  on a fundamental level and potentially you’re allowed to show them things that then you wouldn’t have insight into otherwise,” says Victoroff.

If you understand someone is a cooking fanatic, just a tiny granular of information you can get from someone’s profile, then you can provide better recommendations.  This approach speaks to the evolving marketing trend of helping people by recommending products or services that fit their needs, even if they’re not actively looking through search queries.

Potentials of Machine Learning in Business Applications

Victoroff and his team have seen positive outcomes by using machine learning in business with smaller and more immediate sets of data, including social media information. Indico has helped pick up on actionable and relevant trends in the area of customer support, for example. “We did some early experiments and found some really interesting results…a lot of people on the customer support side are interested in getting through tickets as quickly as possible…it’s a great problem, it something that absolutely needs to be solved,” says Slater. He notes that there are many companies who do this well, but his company decided to look at the level beyond, which was taking the customer support data and assessing real actionable feedback on the product.

When the indico algorithm analyzed just the text data in customer support records, it was able to recognize things like product features broken by a new release or users unhappy with a particular aspect of a product. Sentiment around features change over time, and those mentioned frequently were correlated with a significant drop in satisfaction rate.

They were also able to recognize the things about which a specific customer might be upset about, providing a kind of early intervention in some cases by picking up on text clues.

“If the last three interchanges have all been a little bit negative, that’s a strong signal of a customer potentially leaving the service…this is looking through data that’s already there, people already have access to but they largely ignore, because even though the first thing a person would do if they were trying to innovate on a product is look through a customer’s support record, there’s this assumption that computers can’t do the same job,” says Slater.

There seems to potentially immense value in a computer that can take customer intervention to the next level by picking up early on keywords and associated sentiments, correlated with customer satisfaction and a particular product change, that might only become clear to humans once a certain amount of service tickets have stacked up.

What are the things that Victoroff would warn people against when leveraging machine learning for their business? One is the issue of collecting data without keeping useful associated metadata. An eCommerce site that wants to optimize for sales needs to keep its sales data, and a lot of people don’t realize this point, says Slater. “Whatever your highest level business objective is, you want to make sure that that’s somewhere in the data that you’re storing,” says Slater.

A company might want to do something with images and try to store all images, but if they’re not collecting attached metadata to traffic or sales or whatever the high-level directive is, it’s basically useless. Victoroff emphasizes that a lot of the time if you don’t store this information when it happens, then it’s basically impossible to go back and find this data.

A lot of people also assume they need a custom solution for their data.

“Pretty much nobody’s data is a snowflake, 9 times out of 10 you’d be just as well going with some off-the-shelf solution and the attempt to specifically customize something for your application will actually hurt you more than help you,” says Slater.

If a vendor has a solution ready to help optimize images for sales, for example, it’s probably a good fit for you business’ needs.

Victoroff even goes so far as to state that if a vendor is promoting a custom solution, that it might be a way to conceal what’s called “overfitting” i.e. masking the weakness or narrow problem domain of the technology with which they’re working. The amount of data that most businesses bring is small in the machine learning world, and while you could try and stretch your data set out, but the delta that most businesses are going to get from a customized solution is negligible, especially as a first step.

Image credit: generalassemb.ly

 

Subscribe