Have you ever read a blog post or a whitepaper and heard the terms “data science” or “predictive analytics” used in ways that aren’t quite right? As it turns out, terms like these are often used incorrectly, but by the end of this episode of the AI in Industry podcast, you’ll have greater clarity about five key terms in AI and data science that are sometimes overused in conversations about AI in the enterprise.
This week, we interview German Sanchis-Trilles. He holds a PhD in Computer Science from Valencia Polytech in Spain, and he focused his PhD on natural language processing. In addition, he is the founder of Sciling, a machine learning consultancy, and he is a research advisor at Emerj.
Sanchis-Trilles runs through five AI and data science terms that business leaders often struggle to speak about correctly in conversations with AI vendors and in-house data scientists. Learning the ins and outs of these terms will be critical in the coming years as AI and data science become integral parts of IT infrastructure in the enterprise; leaders will need to know how to “talk shop” with their data science hires or else they risk losing them to tech giants like Google and Facebook, where everyone knows how to speak with them effectively.
Subscribe to our AI in Industry Podcast with your favorite podcast service:
Expertise: machine learning, natural language processing, AI enterprise adoption
Brief Recognition: Sanchis-Tilles holds a PhD in Computer Science from Valencia Polytech in Spain, for which he studied natural language processing. He is the founder of Sciling, a machine learning consultancy, and a research advisor at Emerj.
1. Data Science
(02:30) How would you explain what is and isn’t “data science” to business leaders?
Sanchis-Trilles: “Data science” is perhaps the term that is the most difficult to frame because it’s a very broad term that has been discussed quite often lately, but in fact, data science refers to everything which has to do with data. So, it’s a very general term that basically…that feeds from different fields, such as statistics, data analysis, machine learning, and everything which is related to data.
[Related terms include] big data, data mining. They are terms that are very ambiguous, very generic, and that try to refer to everything which is done with data, but everything that is related to data would come into data science.
(04:00) Does that mean that someone who’s referring to a machine learning engineer or a data analyst can say that they are both doing data science? Is the person who’s doing the feature engineering, perhaps just a subject matter expert, also doing data science? Where do we draw that line if we can?
Sanchis-Trilles: Well, I would say they are doing some data science because feature engineering is definitely part of data science, but…the line would be in terms of, there is this subject matter expert who is using his own knowledge, his brain, to [participate in feature engineering], but he is not using the data [to do so]. It’s the data scientist who is trying to take this knowledge from the subject matter expert and apply it to the data that he is dealing with.
(05:15) So essentially, the “doing” of data science is with dealing with the data and code. Feature engineering could be part of the data science process, but we might not say a subject matter expert is “doing” data science because they aren’t dealing with the data.
Sanchis-Trilles: Yeah, I think that’s a good way to understand that. [Also,] when we work with big data, it’s important that it’s big. So if you work on an axle with a couple 10 [data points], then that’s probably not big data, but you might still be doing some data science. Machine learning is the part of the data science or of the big data that tries to extract some patterns out of that data that we are using.
(07:00) Is manually feeding, training, iterating with, and tinkering with the algorithm machine learning whereas just working with the data in more of a hands-on way without touching the algorithm yet might just be doing data science? Is that a proper line to draw or would you draw it differently?
Sanchis-Trilles: I think that’s pretty accurate. I imagine doing data science by just drawing a couple of plots, looking at how the data is distributed, trying to fit [for a use-case such as] trying to explore which is the maximum amount of product someone has purchased…that would still be data science, but it’s definitely not machine learning.
(08:15) So, the training of the actual algorithm would be where we would get down to machine learning.
2. Predictive Analytics
(08:30) Where do you draw the line as to predictive analytics versus normal analytics?
Sanchis-Trilles: So, predictive analytics is indeed another term which is quite broad in terms of what it encompasses. I would say that predictive analytics is also a combination of different techniques and fields, but basically, the purpose is to attempt to predict some future event based on past historic events.
If we compare it with normal analytics, when I listen to the term analytics, the first thing that comes to my mind is Google Analytics. [In Google Analytics,] you look at the traffic of your website, perhaps the purchases of your eCommerce, looking at the data and trying to understand it. When you are doing predictive analytics, there is this predictive part in which you think of the data you have, you build something with that data, an algorithm or a model or whatever, and try to infer something that is going to happen in the future based on the past.
(10:40) If we’re going to draw a line between predictive analytics and analytics, the cleanest distinction for me is that normal analytics is just showing me my current data or my past data. Predictive analytics is telling me where things are headed, and my supposition is that this would involve the fact that all of that streaming data that we’re looking at is training some kind of an algorithm that’s helping to make those predictions to tell me whether my manufacturing machinery here is going to overheat or whether this particular set of transactions is likely to have a certain percentage of fraud within in, whatever the case may be.
Sanchis-Trilles: I think that’s a pretty good line you just drew there.
(12:15) I would suspect that predictive analytics necessitates using data to train an algorithm, and I would also suspect that data would ideally be real-time. Are those correct assumptions?
Sanchis-Trilles: I think they are with one subtlety again, which is, “What do we consider real time?” If we are dealing with events that happen once monthly, for instance, we are going to attempt to predict whether our customer is going to churn and the customer renewal is every year. So in this case, real-time means once a year.
(13:45) So we may have to cobble the data together in an imperfect way, but we do it once every month or something and we figure out who, coming up in the next six months, is more or less likely to church and then maybe we can take actions, depending on that, and it can be perfectly valuable, perfectly predictive in terms of using an algorithm, but not real time.
(14:30) So predictive needs to mean it’s feeding a model. Traditionally with “regular” analytics, where we’re just pumping data into something visual or into a sheet somewhere, people do the predicting, right? They’re the ones that project where it’s going to go. The difference here is we have a machine that will actually continue to draw that chart or graph without human extrapolation. It’s an extension of intuition based on training an algorithm.
Sanchis-Trilles: Right and also the machine is typically able to leverage more data.
3. Deep Learning
Sanchis-Trilles: Deep learning is a subset of machine learning.
Machine learning deals basically with a bunch of models that try to model reality and try to do predictions or whatever, and deep learning is just one subset of those models in which we are trying to leverage neural networks, complex neural networks, as a model for modeling that reality.
A neural network is basically a model that was developed in the 1940s based on the way of operating off a neuron. When an impulse arrives through the channel, this neuron activates its output based on how strong the input is and a couple of more variables.
The scientist back then tried to replicate this behavior and you have a bunch of artificial neurons that only get activated if the input or inputs have a certain level of strength. Then basically if you take that model of an artificial neuron and build it into a very complex set of neurons, you have a neural network. If you do that in a very complex way with…and you make it very sophisticated, you end up with a deep learning model. But in the end, it’s a bunch of very simple models and you have a bunch of matrix multiplications of different multiplications of weights that get estimated in a very efficient way.
There is one threshold between machine learning and deep learning, which is the use of neural networks. Between neural networks and deep learning, it’s deep learning because also the different layers can be very sophisticated and very complex and you can have a model with one single layer that is considered deep learning.
(18:30) So for deep learning, I think most people tuned in, certainly myself included, presume that for the most part, deep learning is required when there is richer or more challenging data to work with, such as audio files and video and images. These are not the same as text. In text, basically all text could, so long as it’s not a picture of text, be boiled down to ones and zeroes that represent the letter “L,” the letter “T,” and it’s easy for any machine to drink those in and potentially manipulate them.
For images, we’re just looking at pixels and colors of individual pixels and that becomes so much more challenging that we need deep learning. When, generally, is the distinction drawn where we have to bring in the “big guns,” throw in more layers, leverage deep learning as opposed to a problem that a more subtle, a more modest neural network orchestration could potentially get the job done? How should business people think about when deep learning is needed?
Sanchis-Tilles: Well, you mentioned something which is actually important. In fact, when dealing with words, with texts and that kind of stuff, that’s one of the first fields where deep learning made an important breakthrough a couple of years ago. In fact, if you deal with words, the number of words in a language is not infinite, but it’s not countable either. It’s a subtle distinction, but you are not able to say draw a whole list of words in a language. That’s not possible.
Where I want to go with this is that in fact language and images and these kinds of problems can actually be very similar in terms of how you deal with them. In the end, when you deal with language, you have an image of a sentence, which is a bunch of parades of different levels, which is pretty similar to an image.
If you want to use a deep learning approach or a rather simpler approach, it’s not necessarily in terms of how complex the problem is, which is obviously important because more complex approaches are able to deal with more complex problems, but if you only have two images for your training data, for creating your model, you might not want to use deep learning because a deep learning approach has a very large amount of weights, of parameters that we need to tune. If the amount of data you have is very small, then most likely, you will have an algorithm that will just memorize your data because you have more to tune than actual data.
(21:30) So, is there a light distinction here that the more torrential and gargantuan our set of data, the potentially more required it’s going to be to leverage deep learning?
Sanchis-Tilles: I think that the complexity of the problem is a good way to draw a distinction because, for instance, if we are dealing with video data, but we just want to know if there is a red frame in the data, then we might not need deep learning. We might not need machine learning at all. We just need to analyze the frames of the data of the video.
So, there are video problems that can be pretty simple, but, for instance, if we want to create a deep fake video out of somebody else, then probably that’s quite a difficult problem because we need to estimate the pose, vary the way in which the image is being drawn, and so it’s a very complex problem that you need two guns to deal with.
(23:30)Any other important considerations around deep learning? Maybe a use case that you think is critical to understand or another quick concept before we move on to the next topic?
Sanchis-Tilles: I think one important thing about deep learning is that we are hearing about it right now because it’s very useful right now. It has been around for quite some time, but a couple of years ago in about 2012, there was this perfect storm for deep learning to really go into the machine learning landscape as strong as it did because there were a couple of algorithmic advances in the deep learning community, but also because we had several important tool kits coming around such as TensorFlow pushed by Google.
We also had massive amounts of data stacking up in the last years, and also we started using the graphics processing units, which implied an important, a very important, computational breakthrough for these models to actually be able to come through. On my laptop with my regular CPU, a model can take several days to train, whereas on a GPU, on a graphics processing unit, a deep learning model can take minutes, so it’s several hours of magnitude sometimes, and that of course draws the line between what you can do and what you cannot do.
You can even draw the line even higher. I mean some of the latest models by Google are taking I think that was a couple of hours on their machines. Their machines are TPUs, which are further improved version of a GPU, and they had, I don’t know, something like 1,000 TPUs in half an hour, so basically that was an eternity on a regular computer, so that was unfeasible, so it also draws the line between what is feasible and what is not feasible.
4. Robotic Process Automation (RPA)
(26:15) What is RPA and what is it not?
Sanchis-Trilles: RPA basically deals with automating some tasks and of course that needs a bit more of definition. So if you have, for instance, a business process where you have one person sitting in front of the computer and copying some data from the CRM to the ERP, just copying that data, that’s something that RPA can automate, and you can basically save loads of time and loads of effort and loads of errors as well by automating that.
So basically, RPA is the automation of some processes that are pretty mechanic from the start in an intelligent way, and that’s where actually RPA does start to bring artificial intelligence in terms of learning automatically to do some tasks that the user was doing manually before that are very mechanic and that are very automatable ends.
(27:30) So it doesn’t necessarily have to be deep learning here. We’re just training a system to click this, move that, copy this, put it here and that doesn’t have to involve AI. It sounds as though maybe that’s what RPA is. Where is AI creeping its way in?
Sanchis-Trilles: I feel there are mainly two areas. One of them is automating some tasks which were difficult to automate previously. For instance, if we are trying to copy some data from an invoice into an ERP, and the invoice is a picture of an invoice, then we need there to put in place some optical character recognition technology in order to be able to do that.
The optical character recognition technology, or OCR, has been having an important breakthrough as well due to deep learning and is definitely not an easy problem even nowadays, so that’s someplace that artificial intelligence, machine learning, and deep learning has a good space and in automating some tasks that are difficult per se for a computer to perform.
Another part is in actually learning from the user to perform those actions, so the user is clicking around on the screen and the system needs to be able to learn what the user is doing and of course that means “Open this window. If this window contains a red box, then do this. If it contains a green box, then do that.” That kind of actions that the system needs to learn are also being fed by artificial intelligence techniques and algorithms.
(30:00) On the one hand, we have optical character recognition, which we could think of maybe as document digitization. Then the second thing is seeing what the user is doing, how they’re adapting, what processes they’re using to get this work done and have a machine that can learn what are the subtle rules that the human never told it. Are these the two areas where you see that blending of what was RPA with what is evolving into machine learning?
Sanchis-Trilles: Right. In the first case though what I meant is that machine learning can pick up specific tasks, not just OCR. You can also have, I don’t know, a user watching a video and if he sees an accident, a car accident, then the user needs to do something, to write it down with a timestamp or whatever. That’s something that can also potentially be automated by deep learning system, or a machine learning system that looks at the video, detects if there’s a car crash and writes down the timestamp, so I just meant to say that there is the specific task that the system can automate, and then there is also the workflow that the system can learn to automate.
5. Natural Language Processing (NLP)
Sanchis-Trilles: Natural language processing is everything which is related to human language. If you have a system that needs to recognize what a human wrote, that’s NLP. If you have a system that tries to understand what a human said with his voice or with her voice, that’s NLP as well. If you want a system to speak and to do some speech synthesis, that’s NLP as well. If you want a system to understand the sentiment behind a tweet, that’s NLP as well, and if you want to classify an email between spam and no spam, that’s NLP as well. So everything which is related to having a machine understand something or do something with human language that would be the “definition” of NLP.
(34:00) Is the feeding of masses of data to train a system always part of NLP or can it sometimes not be?
Sanchis-Trilles: I think you’re right indeed. It’s pretty similar to the term AI. AI can encompass machine learning, but can encompass some old school handwritten rules, which often work quite well. I mean I don’t want to say that they are bad of course, and NLP is something very similar. Until very recently and even now, we have some very good machine translation systems which are based on rules, especially for languages that are very well resourced such as Spanish or English, or languages that are very similar.
Also, you can have some machine translation systems based on rules which work pretty decently, and then you have the other kind of systems which are based on large amounts of textual data that try to infer patterns from that data and try to learn to translate from the patterns it is inferring.
(35:30)What are some business use cases that you’ve seen a lot of experience with that you think would be useful for people to know?
Sanchis-Trilles: Well, one of the most important and most widespread use case is things like Alexa. Alexa is basically several NLP systems put together: a speech recognition system; you talk into it and it recognizes speech. Then, it does natural language understanding or intent parsing, which is basically understanding what the user wants from that already textual representation, and then it outputs voice, so it is doing again a speech synthesis. So you basically have three NLP systems at least together to build Alexa, which is right now a very important business use case.
You also have simpler use cases, for instance, chatbots, which today are being used quite often to automate some of the customer interaction tasks.
(37:30) Do you think NLP will be a little bit more one off in its value?
I think we are looking into our transformation, but that being said, NLP is hard. Human language is hard, so even if you have two native English speakers speaking together, they might find that they don’t say the things the same ways and they might not understand each other at some points.
Human language is hard, so that means that the way forward in NLP is still very open and if we are able to take up simpler problems, such as, spam or no spam, those kind of problems are relatively simple. If you try to think of a conversational interface, that starts to get harder, and you might have humans having a tough time at doing some of those tasks as well, so leave it to a machine.
What I mean to say with this is that I do see a transformation ahead and I do think that conversational interfaces are going to have an important take, but there is still a lot of hype and these kinds of problems will not be solved in the shortest term, so you cannot expect to have a perfect conversational interface in six months.
Subscribe to our AI in Industry Podcast with your favorite podcast service:
Header Image Credit: Video Blocks