Episode Summary: What does it mean to tune an algorithm, how does it matter in a business context, and what are the approaches being developed today when it comes to tuning algorithms? This week’s guest helps us answer these questions and more. CEO and Co-Founder Scott Clark of SigOpt takes time to explain the dynamics of tuning machine learning algorithms, goes into some of the cutting-edge methods for getting tuning done, and shares advice on how businesses using machine learning algorithms can continue to refine and adjust their parameters in order to glean greater results.
Expertise: Optimal learning techniques
Brief Recognition: Scott Clark is co-founder and CEO of SigOpt, a SaaS startup for tuning complex systems and machine learning models. Before SigOpt, Clark worked on the Ad Targeting team at Yelp leading the charge on academic research and outreach with projects like the Yelp Dataset Challenge and open sourcing MOE. He holds a PhD in Applied Mathematics; an MS in Computer Science from Cornell University; and BS degrees in Mathematics, Physics, and Computational Physics from Oregon State University. Clark was chosen as one of Forbes’ 30 under 30 in 2016.
Current Affiliations: Cofounder and CEO of SigOpt
1 – Every problem is sufficiently unique.
Even one specific business use case might look different than another, and a model that might work well for one company might not perform well for another, even within the same industry. Clark recommends opening up to and exploring different algorithmic options for a given problem, which is now easier to do thanks to better integration of tools across open-source platforms.
The following is a condensed version of the full audio interview, which is available in the above links on Emerj’s SoundCloud and iTunes stations.
(2:00) How do you in layman’s term define what tuning is in ML and get those with perhaps a business background to understand the concept?
Scott Clark: Every machine learning (ML) model is attempting to take a bunch of data about something…and learn some structure or some results about that data in order to make some decision or prediction…all of these ML algorithms have these tunable parameters that affect how they go about learning these structures or rules, called hyper-parameters or parameters of the model itself. It could be something like the number of trees within a decision tree-based algorithm, the number of layers in a deep learning system, but they tend to crop up in any machine learning algorithm, and they have a huge affect into how quickly and accurately these ML systems are able to extract that information and provide this model, which hopefully in turns provides business value.
(3:20) Maybe there’s a business problem where you could make an analogy as to what those parameters might be in a real-world example.
SC: They tend to crop up all over the place—the number of trees in a random forest, the kernel of a support vector machine, the learning rate of a deep learning system, but it could also be features and parameters in who you’re actually digesting the data into the ML algorithms; you could imagine in a stock trading algorithmic trading strategy, you may want to look at some trailing average of specific equity and have that be fed into your deep learning system; should I look at the last 5 days , 7 days, 10 days? This is sort of a tunable nob that you have as you’r’e building up this larger architecture.
(5:55) Is this border-line infinite in how many parameters there are to tweak?
SC: This is where domain expertise plays an incredible role; typically ML models have from 1 or 2 from something simple like a random forest, up to maybe a few dozen tunable hyper-parameters in a complex Tensor flow pipeline. That tends to be relatively well-bounded and defined by the individual algorithm that you’ve decided to apply to your problem, but these higher-level hyper-parameters can kind of grow arbitrarily large, but the domain expertise is usually including that rolling average of that equity for a specific reason…making that decision and doing that feature engineering is incredibly domain-specific, and for now humans tend to be the best at doing this type of problem, actually finding the correct nobs and levers and settings for all of these individual parameters is something that computers and machines is better at—humans are great at being creative, but they’re terrible at doing 20-dimensional optimization in their head.
(8:08) Is it normally people (or machines) making those modular tweak…in most ML applications and models today?
SC: There’s a couple of different standard approaches that people tend to take—one is doing that very expensive manual fine tuning, once again it requires that expert to be in the loop…other approaches that people tend to take and are kind of advocating across TensorFlow or all the popular libraries is an exhaustive grid search, where you basically lie down a grid of possible configurations and try them all and then at the very end, see which one performed the best; this is extremely time-consuming and expensive…not necessarily the most efficient way to get to the peak…it turns out this subfield of operations research called optimal learning or bayesian optimization..is a really efficient way to sample these time-consuming, expensive algorithms.
(10:49) Go into a little bit of this optimization technology and where this is headed in making a difference.
SC: The whole concept behind optimal learning is making the most intelligent trade-off you can between exploration, which is learning more about how all these parameters interact….and exploitation, which is leveraging that historical information about how previous configurations have performed to really maximize that objective you’re shooting for; so, by trading off this exploration and exploitation, you’re able to extremely and efficiently sample this otherwise intractably large space and find these global optima.
(12:29) With Yelp and advertising (for example), maybe you want to tinker with new ways to step up the advertising algorithms to garner higher revenue and deliver your results, but you don’t want to do so much of that experimenting that you lose track of your proven winners that are going to pay payroll.
SC: Definitely, and the idea behind this is these large ML pipelines tend to be very time-consuming and expensive to evaluate different configurations. In that kind of classic multi-arm bandit analogy that you were drawing on before with the slot machines, you can just go up to a slot machine, pull the lever and you immediately see your output, but what if pulling that lever entailed spinning up a giant AWS cluster and then 24 hours later, you get back some accuracy metric; you would need to be very efficient in which levers you pull…
Humans are great at being creative, but they’re terrible at doing 20-dimensional optimization in their head.
(14:32) What are the fundamentals of this science of optimization…where are these new approaches taking us now?
SC: A lot of the methods applied revolve around this path of doing sequential model-based optimization—I’ll break that down: sequential in the sense that you’re leveraging historical information in order to make your next decision; model-based in the sense that you’re taking that historical information…and you’re trying to build up a model of how all these parameters interact and what response surface you can expect in un-sampled areas; leverage that model in order to try and find the points that have the highest probability of improvements…you sample those and then repeat the process, you update the model, which allows you to make a next decision, and you repeat. What we’re building at SigOpt is an ensemble of all these types of models, but in general the science allows you to take a specific model, like the Gaussian process, a specific action model like expected improvement, and go through this optimization loop to most efficiently find the best configuration to these complex systems.
(18:40) Where do you often see the easiest swings of improving the tuning and improving the results…where do you often see that low-swinging opportunity here?
SC: I think there’s two areas where you can make immediate impact: one is by doing this hyper-parameter tuning or configuration tuning of these models, but beyond that, usually what we find when we work with various firms is that people have a specific tool that they really enjoy using, whether that’s something like logistic regression or a random forest, and a lot of these machine learning frameworks, something like Psychic Learn or TensorFlow, provide the exact same interface between tools so that you can very readily try different models.
If the only thing in your tool is kit a hammer, then everything looks like a nail, but maybe the screwdriver might work better in certain applications. So one thing I would recommend people do is instead of just applying their favorite tool to every job…maybe try an ensemble approach or go ahead and try that gradient-boosted method…and you may see a significant uptick by just upgrading the sophistication of your models.