Amazon Machine Learning Exec: 3 Tips for Working with Sparse Models

Daniel Faggella

Daniel Faggella is Head of Research at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and leading enterprises, Daniel is a globally sought-after expert on the competitive strategy implications of AI for business and government leaders.

Amazon Machine Learning Exec: 3 Tips for Working with Sparse Models

At the recent KDD2016 (knowledge discovery and data mining) conference in San Francisco, Managing Director at Amazon Development Center Germany GmbH and Director of Amazon Machine Learning Ralf Herbrich discussed three lessons that he’s learned while working with sparse machine learning models at scale.

Before joining Amazon, Herbirch was one of the inventors of TrueSkill ranking and matchmaking system in Xbox 360 Live (especially in the Halo video game series). He also co-invented the click-prediction technology used in Microsoft Bing’s online advertising system. In his years working with these algorithms at large scale, Herbrich shared three lessons that he believes are key to implementation in a product or service:

1 – Resource constraints matter. You have a finite amount of resources, and time is especially important. When creating an online service like Facebook (where Herbrich also worked for one year), he says, “you have to obey the maximum time, you cannot exceed when you compute…you typically have no more than 5 milliseconds, then results have to be delivered to the user,” meaning that these algorithms have under one second to generate a new view. Accuracy is important, but performance in time is make-or-break for consumer applications that must work immediately on all devices.

This means the algorithm will be constrained to a limited number, about 50 lookups per service, which can be balanced by minimizing the number of candidates you’re trying to score, says Herbrich. “You may have the most accurate algorithm, but it’s not practical if the web page cannot be run.”

Then there’s cost, which in economics 101 is considered as revenue minus cost equals profit. Continuing with the Facebook example, this translates into how much revenue you get from people paying for ads minus the cost for running Facebook and data centers. Facebook has close to 1B users visiting daily and 1500 entities that have to be computed in order to show 300 new stories for the average user. You can read our previous interview with Hussein Mehanna for more on Facebook’s machine learning efforts to personalize it’s massive user base).

This means there’s a 1.5 trillion score that needs to be computed on a single day, and if you use your entire revenue the amount that you’re allowed to spend in a single computation is 3 micro dollars. This is an important consideration when you’re thinking about using accurate algorithms but the cost is too high. If each Facebook interaction was involved directly in revenue generation, almost all costs would be offset; however, with so much activity and relatively little direct tie to Facebook’s revenue, algorithms have to be used to husband their compute resources carefully.

“Results constraints really do matter and drive the choice of algorithms…they are as important as success metrics that drive pure actives.”

2 – Scale inference algorithms. Staying with Facebook, there are 300 billion examples to train from in any single day, a task that is no longer physically viable and why it becomes necessary to resort to factor graphs. Using a single parameter for a conditional node, it’s possible to group factors that handle messages from data and those that handle messages to and from parameters and embody these in a computational service. The top groupings of factors are services that handle parameters, and anything that goes to and from changes the parameters; bottom factors handle all computation to summarize information in the data. You can then use this macro data to communicate with caches, which is useful for each data processing site. You need to know the rate of change of every parameter and revise your protocols if a data source can’t handle a particular change.

3 – Combining sparse and dense models. Sparse datasets are generated from areas involving human-computer interaction, including text and language processing and human-human-interaction (like on social networks). Nominal data is so prevalent that you usually have the user ID (category of product, hour of day, etc.) that you can use to describe and end up with the same number of values. While this is often not powerful enough to get a generalization, if you have the time of day you can bucket this data – also known as clustering  – to learn data boundaries and form segments.

If you have more than one dimension, you can do an anagram, basically pictures in a grid. You end up with a lot of squares in three dimensions, a popular approach because of its relative simplicity in description – you only need to name two categorical variables that you cross. You end up with a large number, which is a positive if you want granularity and a joint effect. On the downside, you can end up with generalizations in areas of space with not much coverage. One way to balance this outcome is to split a figure only once in the second dimension, using a decision tree based on dense data. This is a successful approach because it can be used to learn features. Each of these trees becomes a spare data generator, a path of observation.

“One of the powers of the sparse data models is that they are interpretable. If you’re making high tech decisions (in the medical, for example), then knowing what your model does is as important as having accurate data.”

— — —

Herbrich groomed the audience with a background on fundamental machine learning concepts before diving into his key takeaways. In the machine learning “flow”, physical sensory input (via keyboard, microphone, camera, etc.) is interpreted by code, which generates “three flavors” of data – raw data, unique identifiers, and remote communications.

Resulting data sets are categorized by “data field scales.” Most data analyzed today is nominal, in which numbers are used as labels (without quantitative value) for groups i.e. 1 for category A, 2 for category B, and so on. This data can then be put into graphical (probabilistic) models, specifically a factor graph, which illustrate the dependent structure (subject to change) of random variables.

Factor graphs are a popular representation for machine learning algorithms, and when the factor graph is a tree, the max- (or sum) product message passing (also called belief propagation) can produce optimal scheduling of message updates by passing messages about variables back and forth between its leaves and the root. The sum of its products becomes a product of sums of all messages from neighboring factors, providing streamlined performance insights. Herbrich described this approach as almost a blueprint for machine learning algorithms.

Subscribe