[seopress_breadcrumbs]

How Data Lakes Support ML in Industry – with Cloudera’s Amr Awadallah

•

February 26, 2017

How Data Lakes Support ML in Industry - with Cloudera's Amr Awadallah

Episode Summary: If you’re going to apply machine learning (ML) in a business context, you need a lot of data, and algorithms across the board perform better with more recent, rich, and relevant data. Today, there are companies whose entire business models are predicated on helping others make sense of and use of this type of information, as more entities look for the first place to apply ML in their organization. In this episode, we speak with the CTO and Co-Founder of one such company—Palo Alto-based Cloudera. CTO Amr Awadallah, PhD, speaks with us this week about where he sees “data lakes” (or “data hubs”, Cloudera’s preferred term) and warehouses play an important role in ML applications in business. Based on his experiences helping a variety of companies in many countries set up data lakes, Amwadallah is able to distill and communicate these uses in three broad categories that apply across industries as companies look to apply ML applications to solve tough problems and ask more complex questions using unstructured data.

Expertise: Internet infrastructure/systems architecture, distributed systems, advanced analytics and data mining, search (as in web search), application layer protocol design

Brief Recognition: Before co-founding Cloudera in 2008, Amr Awadallah was an entrepreneur-in-residence at Accel Partners. Prior to joining Accel, he served as vice president of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds bachelor’s and master’s degrees in Electrical Engineering from Cairo University, Egypt, and a doctorate in Electrical Engineering from Stanford University.

Current Affiliations: CTO and Co-founder of Cloudera

Big Ideas:

1 – Data hubs enable building and scaling “segment-of-one” data models.

Personalized segmentation at scale helps correct the data world’s earlier challenge of false positives.

2 – Rich and relevant data inputs is more important than the complexity of the machine learning algorithm.

This type of big, meaningful data is helping change the way that businesses and industries operate, as well as consumer behavior.

Interview Highlights:

The following is a condensed version of the full audio interview, which is available in the above links on Emerj’s SoundCloud and iTunes stations.

(2:09) How do you folks draw a distinction between the definition of data warehouse and this newer terms that people may have heard some buzz around, the “data lake”?

Amr Awadallah: The industry overall does refer to it as data lake, we at Cloudera use the term data hub instead….but they are really the same thing…the main difference really is the enterprise warehouses were focused on SQL, SQLl was their genesis, and the characteristics have two sides: one side is the data itself tends to be very structured…and the only way you can really ask questions is using the SQL language itself, so the data lake is essentially an evolution that became a natural progression of the fact that we have so many other types of data besides structured data.

…And they needed a place where they could keep all this data side by side…but then also ask the bigger questions out of that data; we wanted to be able to go beyond SQL for the problems and types of questions that cannot be answered by SQL.

(5:05) Walk us through some of those examples that may have been impossible in SWL but is possible now with these alternative ways to store information.

AA: First, I will say it’s not impossible to do in SQL, which has so many extensions…it’s just very hard, it’s not natural…if you want to do a predictive model…SQL is not really built for that…it becomes much more natural and easier to be using something like Spark or using MapReduce. So, for example, we have a large hospital that we work with and they built a system that can collect all of the signals from a (premature) baby…they analyzed all of the signals coming out of the baby; some of these signals are the temperature, the heart-rate, some of them are motion…and by analyzing these signals and using some nurse experts…they came up with a predictive model that can show, on a screen, words that explain what the baby is going through…so we’re able to put words in the baby’s mouth using the signals that are coming out of their body…

(8:35) Most people are familiar with the way the old databases are done; where are are we tucking all of that information now that’s so hard to quantify in the same basic way as a transaction?

AA: To explain the question, I’d like to go to an analogy that I use to communicate the power of this new platform. I refer to this new platform, the data lake (the data hub as we prefer calling) as the “smartphone of big data”…the power of our platform is that it can take pictures, meaning it can do SQL but it can do a lot more than that at the same time, which is the same with a smartphone. The power and convenience of a smartphone is…once I capture that data, meaning take a picture…I can now do many, many things with that picture; I can process the picture, I can email it, upload it in Instagram, etc. versus when you take a picture with a digital camera, that’s it

…in other words, in this example of the neonatal intensive care, we still use SQL at the beginning to aggregate the different signals coming in…but then we can now shift and use Spark, a distributed processing programming language, to analyze these counts that came out of SQL and produce the predictive model that decides whether the baby is upset wth the light or the temperature or wants to eat, so the power of this platform is really…that it can bring different types of computation together to solve bigger problems.

(11:55) What is the role of a data lake in those bigger ML applications, and are there any other examples that you can walk people through?

AA: This (Cloudera) is a very powerful platform that can be used for many things…across many different industries…that said, we worked with many companies over the last eight years, and three key themes have emerged…and these these themes fall into these three categories: 1) The first one is what we refer to as customer 360, or how can we drive better customer insights, 2) The second one is the IoT, which is an acronym of how we can use our data from the real world to make our products and services much better at what they do, 3) And then the last key category…is how can we lower business risk, and lowering business risk includes a number of sub-use-cases like fraud detection, like cyber security, like risk modeling for economies….

(Tune in to listen to Awadallah’s fascinating business uses cases in-depth, including finding anomalies in lots of data. MasterCard, for example, has one of the largest data lakes—according to Awadallah—and stores the location of customers who are using its mobile app; this information helps the giant payment solutions company to more accurately predict whether a transaction is real or fraudulent in real-time.)

Recommended from Emerj

Emerj: Building Readiness for AI Agents in Healthcare Systems - Raheel Retiwalla

Building Readiness for AI Agents in Healthcare Systems – with Raheel Retiwalla of Productive Edge

This interview analysis is sponsored by Productive Edge and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. Burnout among hospital staff, particularly nurses and physicians, has reached critical levels. A report by the Center…

Riya Pahuja

•

May 8, 2025

Neurobiological and Cybernetic AI for Manufacturing, Part 2 – with Oleg Savin of Unilever

In our current technology-driven era, data is considered extremely valuable. Yet, data often goes unused or underutilized. The reasons vary, but it's certainly not a newly surfaced problem. An article initially published by Harvard Business Review highlights that organizations struggle with managing and analyzing existing data. This problem is more pronounced in manufacturing, where unused…

Sharon Moran

•

May 5, 2025

Artificial Intelligence at Charles Schwab – Two Use Cases

The Charles Schwab Corporation is a leading financial services firm, reporting $10.28 trillion in client assets as of February 2025, a 16% year-over-year increase. In Q4 2024, the company generated $5.3 billion in net revenues (up 20% year-over-year) and $1.8 billion in net income, resulting in $0.94 EPS. Core net new assets reached $114.8 billion…

Riya Pahuja

•

April 28, 2025

Driving Disease Risk Prediction and Preventative Healthcare with AI – with Dan Elton of the National Human Genome Research Institute

The landscape of preventative healthcare and genetic research is "awakening" with data, enabling earlier and more precise disease risk prediction. The evolution is particularly critical as the healthcare industry shifts from reactive treatment to proactive care. Integrating advanced capabilities with genomic data allows researchers and clinicians to analyze vast numbers of genetic variants, providing more…

Emily Smith

•

April 21, 2025

Artificial Intelligence at Foxconn – Two Use Cases

Foxconn, officially Hon Hai Precision Industry Co., Ltd., is a multinational electronics contract manufacturing company headquartered in Taiwan. Founded in 1974, it is renowned for producing consumer electronics for major companies like Apple, Microsoft, and Amazon. As of 2023, it employed approximately 90,221 people globally and reported an estimated annual revenue of $4.1 billion. The…

Riya Pahuja

•

April 14, 2025

Leveraging AI for Better Insurance Outcomes From Risk Management to Customer Care – with Mark McLaughlin of IBM

AI is transforming the insurance industry from reactive claims processing to proactive risk management. Rising competition from fintech and insurtechs and growing consumer demands for personalized, real-time experiences are driving such widespread industry adoption. Academic research highlights how insurers are increasingly using digital technologies and behavioral data to personalize services and influence customer behavior, underscoring…

Emily Smith

•

April 7, 2025

Managing End Point Storage in Hybrid Data Strategies for Financial Services – with Yonas Yohannes of Oracle

Transparency in AI is a major hurdle for businesses, particularly in the financial services industry. Generative AI (GenAI) models, particularly non-deterministic models, are often viewed as “black boxes,” making it difficult to understand the underlying decision-making processes. Due to this black box risk, banks can experience multiple types of AI incidents, including system glitches, data…

Sharon Moran

•

March 31, 2025

NLP Logix’s AI Collaborate 2024: A Look at the Future of GenAI Experiences from Sports to HR

This article is sponsored by NLP Logix and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. Event Title: AI Collaborate 2024 Event Host: NLP Logix Location: Florida, USA Date: November 19-20, 2024 What Happened…

Riya Pahuja

•

March 26, 2025

Artificial Intelligence at Aflac – Two Use Cases

Aflac is a global leader in supplemental health and life insurance, providing financial protection to over 50 million policyholders in the U.S. and Japan. In 2023, Aflac reported an annual revenue of $18.7 billion. With approximately 12,785 employees worldwide, Aflac continues to drive innovation in cancer and medical insurance. Although Aflac's total investment in AI…

Ashwin Telang

•

March 24, 2025

Neurobiological and Cybernetic AI for Manufacturing, Part 1 – with Oleg Savin of Unilever

Modern manufacturing stands to benefit from integrating AI. From improving efficiency and productivity by automating repetitive tasks to reducing unplanned downtime and cutting down on repair costs through predictive maintenance, the potential benefits are numerous. However, integrating AI in manufacturing is not without challenges. A 2019 article for the journal Engineering from the Chinese Academy of…

Sharon Moran

•

March 10, 2025

Understanding AI’s Expanding Role in Drug Discovery and Life Sciences R&D – Liran Belenzon of BenchSci

This interview analysis is sponsored by BenchSci and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. The intricate nature of biological systems significantly complicates drug development. Diseases often involve complex interactions among genes, proteins,…

Riya Pahuja

•

March 4, 2025

Long-Term ROI for GenAI in Healthcare – with Ylan Kazi of Blue Cross Blue Shield North Dakota

Mobile technology, including smartphones and wearable devices, collects health-related data such as physical activity metrics (e.g., step counts, heart rate), sleep patterns, and self-reported health information through surveys and applications. A 2022 research paper from Stanford University shows how these data enable AI systems to monitor health trends, predict potential health issues, and personalize healthcare…

Riya Pahuja

•

March 3, 2025

Search site

Search site

How Data Lakes Support ML in Industry – with Cloudera’s Amr Awadallah

Big Ideas:

1 – Data hubs enable building and scaling “segment-of-one” data models.

2 – Rich and relevant data inputs is more important than the complexity of the machine learning algorithm.

Interview Highlights:

Recommended from Emerj

Building Readiness for AI Agents in Healthcare Systems – with Raheel Retiwalla of Productive Edge

Neurobiological and Cybernetic AI for Manufacturing, Part 2 – with Oleg Savin of Unilever

Artificial Intelligence at Charles Schwab – Two Use Cases

Driving Disease Risk Prediction and Preventative Healthcare with AI – with Dan Elton of the National Human Genome Research Institute

Artificial Intelligence at Foxconn – Two Use Cases

Leveraging AI for Better Insurance Outcomes From Risk Management to Customer Care – with Mark McLaughlin of IBM

Managing End Point Storage in Hybrid Data Strategies for Financial Services – with Yonas Yohannes of Oracle

NLP Logix’s AI Collaborate 2024: A Look at the Future of GenAI Experiences from Sports to HR

Artificial Intelligence at Aflac – Two Use Cases

Neurobiological and Cybernetic AI for Manufacturing, Part 1 – with Oleg Savin of Unilever

Understanding AI’s Expanding Role in Drug Discovery and Life Sciences R&D – Liran Belenzon of BenchSci

Long-Term ROI for GenAI in Healthcare – with Ylan Kazi of Blue Cross Blue Shield North Dakota

Customize Your Experience

How Data Lakes Support ML in Industry – with Cloudera’s Amr Awadallah

Big Ideas:

1 – Data hubs enable building and scaling “segment-of-one” data models.

2 – Rich and relevant data inputs is more important than the complexity of the machine learning algorithm.

Interview Highlights:

Share article

Subscribe to updates

Recommended from Emerj

Building Readiness for AI Agents in Healthcare Systems – with Raheel Retiwalla of Productive Edge

Neurobiological and Cybernetic AI for Manufacturing, Part 2 – with Oleg Savin of Unilever

Artificial Intelligence at Charles Schwab – Two Use Cases

Driving Disease Risk Prediction and Preventative Healthcare with AI – with Dan Elton of the National Human Genome Research Institute

Artificial Intelligence at Foxconn – Two Use Cases

Leveraging AI for Better Insurance Outcomes From Risk Management to Customer Care – with Mark McLaughlin of IBM

Managing End Point Storage in Hybrid Data Strategies for Financial Services – with Yonas Yohannes of Oracle

NLP Logix’s AI Collaborate 2024: A Look at the Future of GenAI Experiences from Sports to HR

Artificial Intelligence at Aflac – Two Use Cases

Neurobiological and Cybernetic AI for Manufacturing, Part 1 – with Oleg Savin of Unilever

Understanding AI’s Expanding Role in Drug Discovery and Life Sciences R&D – Liran Belenzon of BenchSci

Long-Term ROI for GenAI in Healthcare – with Ylan Kazi of Blue Cross Blue Shield North Dakota

This Content is Exclusive to Emerj Plus Members

In-Depth Analysis

Exclusive AI Capabilities Matrix

Exclusive AI White Paper Library

Best Practices and executive guides

Register

Customize Your Experience