I hope that by the end of this episode of the AI in Industry podcast, you’ll not only be able to hire better data scientists who will be a fit for your business problems and build better data science teams, but also pick the AI applications and use cases that you should bring into your business versus those that you shouldn’t.
I spend a lot of time being the business voice who talks to the technical people and then communicates that bridge across to the other business folks of the world who are interested in artificial intelligence. In this episode, I grab somebody from the other side of the pond, someone with a formal master’s degree-level focus on machine learning, who applies these technical skills in business and has been forced to learn how to speak business.
This episode, we interview Brooke Wenig, the machine learning practice lead at Databricks. Databricks was founded by the folks who created Apache Spark. Those of you who are technically savvy with AI will be familiar with Apache Spark as an open source language for artificial intelligence and distributed computing.
Wenig works with a lot of companies with Databricks. Databricks is now close to 700 folks and helps implement AI applications into, oftentimes, large enterprise environments. Wenig speaks with us this week about what to look for in an actual data scientist and how to find data science folks with the right skills to be able to communicate to business people, not just to work with models. What should people be capable of; how should they be capable of thinking? Hopefully, some of you will have better interview questions by the end of this podcast.
In addition, we ask Brooke about what the value of covering the cutting edge applications of AI is, looking at what’s working in industry. How does that help us in our own business make better decisions?
It’s not just opening up our minds to more possibilities of AI. There are also concerns about how we can make better decisions about what kind of AI projects to adopt. So not just seeing new shiny things, but actually making smarter calls. Brooke has a pretty educated perspective in this take, having seen the inside of a lot of different businesses. I thought she brought some good insights to bear.
This episode is brought to you by Databricks. Databricks is hosting their annual San Francisco Spark+AI Summit from April 3 to 25, which is a whole event intended to bring technical folks and business leaders together with what the cutting edge applications are of artificial intelligence today and how to bring them to life.
Subscribe to our AI in Industry Podcast with your favorite podcast service:
Expertise: Apache Spark, enterprise adoption of AI
Brief Recognition: Wenig earned her MS in Computer Science from UCLA. Prior to joining Databricks, she was an intern at Google, Myfitness Pal, and Splunk.
(03:00) How do you like to conceptualize and explain what a data scientist is when you’re speaking to the enterprise?
BW: A data scientist is someone who has business-level domain knowledge of what the data is, can build predictive models on it, and can communicate the models and the importance of those models to business leaders to ultimately drive decisions for that company. It very much is a cross-functional role and a very interdisciplinary role. It involves both skills in engineering, math and statistics, as well as communication.
(03:30) What do you mean when you say communication?
BW: The first step is getting buy-in from the business leaders. For example, I know a lot of companies struggle to even get data science teams because they don’t understand the importance of data science. Once they’ve got buy-in, now they need to make models that will actually drive business decisions. There’s no point in me spending hours every day mucking around with scikit-learn in Keras to build models that nobody’s actually going to use at the end of the day, so that’s why communication is so important: so their work can be adopted by the rest of the company.
Another aspect of why communication is very important is to work with other teams and understand what the data even means. If a data scientist doesn’t understand their data, they’re not going to make a model that is best-fitted for that data.
The first question I’d ask before I even start looking at the data is, “what is the business problem that you’re trying to solve?” Is it demand forecasting; is it resource forecasting? Then when we talk about the problem we’re trying to solve, can machine learning even solve it, or could you simply solve that problem with better technology?
So as part of that, the very first thing I do when I start machine learning problems is understand, “what is a baseline metric?” For example, if I always predict the average, what would be my accuracy in the case of fraud detection, or what would be my precision and my recall? So establishing what success is is the most important thing.
Then, after we establish success, we can start looking at the data and see those constraints in play. For example, we never have more than 200 of [something]. Hey, what are these null values doing here? Sometimes they can be corrupted data, but sometimes they can be an indicator of the thing you’re trying to predict at the end.
(10:00) Why do you think that we see so much of that hands-on educational stuff from vendor companies to enterprises?
BW: Yeah, I think it very much depends on the company. Databricks is built on a very technical concept, which is Apache Spark, and it isn’t a technology that many people are familiar with when they’re in school. For example, I went to UCLA, and I only had one class taught using Apache Spark, and that’s because my advisor had helped create ML Lib, which is the machine learning library for Spark, back when he was a postdoc at the AMP lab at UC Berkeley.
So with these new technologies, it’s very hard to get people that are fresh out of school learning them, so when people are already in industry, they’re a lot more hesitant to adopt these new technologies, because they didn’t learn them in school. They’re unfamiliar; they can have a very high learning curve.
That’s why with these vendor companies, they have to come in and provide resources, not just for training, but also consulting and implementation, as well as educating the team members about what the technology is and when and where you should use it. Spark is fantastic, but it shouldn’t be used everywhere.
(11:30) Do you think that these very hands-on wings of vendor companies are going to shrink in the next five years as more of these companies are familiar with the lingo, have data scientists on board?
BW: I actually think it will increase rather than decrease. Throughout the US, there’s a huge shortage of qualified data scientists, so because of that and the fact that a lot of companies aren’t making the most out of data-driven decisions, those two paired together, I think, will drive a lot more need, actually, for help from vendor companies on “what is AI” and “what is ML.”
A lot of it is hype, and so understanding what can actually be done to solve the business problem versus what is hype, I think, is very important.
(14:30) Why it is important to know what’s working now in industry? What do you see as the primary value there?
BW: I think it’s very important to stay on the cutting edge of what’s happening in industry, and to some extent, research. The reason why I say industry is because there are tons of new open source projects out there. You can find them on GitHub, but if they have very few active developers, then that project might not be maintained, and then you rely on something that’s no longer going to be actively developed.
So I think it’s very important to stay current with what’s happening in industry. If you’re competing for the same space, for example, with online retail, if your model is 10% better, that could translate into millions or billions of additional dollars of revenue. So understanding what is out there and understanding the trade-offs of, “what happens if I switch to using this new approach? Do I spend two years of dev time, or is that two weeks of dev time?” So the investment effort, both in terms of human hours and in terms of cost, are very important.
At Databricks, we used an open source platform called Horovod, which Uber had open sourced, because Uber was using Horovod in production for distributed training of their deep learning models, and we wanted to use a framework that was already known and tested by one very reputable company.
The other thing that I would add to that is understanding the data that these different companies have applied those models to. For example, if you are trying to do some NLP problem and your textual language is very different than what the model was trained on (for example, the model was trained on Twitter data, which has a character limit, I think, of 140 characters), you can’t directly take that model and apply it to your Amazon reviews dataset, for example, which might have a much longer character description length for a review.
(18:00) My guess is, Brooke, if you want to understand that transferability, probably the people who should be looking at these new algorithms should be either the unicorn person that can do both or probably a data scientist and a purely business functional person because it sounds like both people might be able to find a reason why this doesn’t transfer to to a company, right?
Yeah, definitely. As a data scientist, I’m often paired up with a data engineer when we start a project with the customer. For example, they’ll help do the data preparation. I’ll build the model. Then, together, we work on the deployment considerations, because as you had said, “do we have the compute resources needed? Does this need to work in a streaming environment, or batch pre-processing?”
Having all of the different personas in the room can help you make a decision much better than if you have a single person working on that project, and then they spend six months on it, realizing at the end, the solution’s not suitable for the deployment considerations.
Subscribe to our AI in Industry Podcast with your favorite podcast service:
This article was sponsored by Databricks, and was written, edited and published in alignment with our transparent Emerj sponsored content guidelines. Learn more about reaching our AI-focused executive audience on our Emerj advertising page.
Header Image Credit: Resolution Media