[seopress_breadcrumbs]

Scaling AI with Storage Efficiency – Emerj AI Leader Insight

•

May 29, 2025

This article is sponsored by Pure Storage and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.

As enterprises race to implement AI, most hit a bottleneck that’s hiding in plain sight: inefficient storage infrastructure.

While large language models and powerful GPUs dominate headlines, the real engine behind scalable AI lies in how data is stored, moved, and managed across systems. Without a high-performance, compliant, and well-governed data pipeline, even the most advanced AI models stall out.

According to IDC, by 2025, approximately 80% of worldwide data is projected to be unstructured. The ratio is primarily thanks to formats such as text, images, videos, and audio. These are data types that are integral to training and operating AI models, especially voluminous large language models (LLMs) and generative AI systems.

Recent research from Ohio State University surveyed metrics for evaluating data readiness, noting that unstructured data requires extensive preprocessing to be suitable for AI training. Their study highlights the importance of transforming unstructured data into structured formats to improve the quality and effectiveness of AI models.

On a recent episode of the Emerj ‘AI in Business’ podcast, Editorial Director Matthew DeMello explores a three-part framework for driving enterprise AI infrastructure through data storage. Insights for this framework come from Shawn Rosemarin, VP of R&D in Customer Engineering at Pure Storage, who unpacks the core challenges slowing AI adoption and how enterprises can unlock efficiency at scale.

Their conversation highlights that, to break the above bottlenecks, organizations must rethink storage as a strategic asset – not just a backend utility. High-throughput, scalable architectures with metadata tagging, compliance safeguards, and proximity to compute environments are key.

The following article examines three steps of the data storage adoption framework Shawn shared with the Emerj executive podcast audience:

Step 1 – Audit and centralize data for AI readiness: Creating a unified, high-performance data foundation by inventorying assets, migrating to scalable storage, and connecting fragmented sources.
Step 2 – Build a compliant, metadata-driven pipeline: Enabling secure, scalable AI by enriching data with context, enforcing access controls, and planning for real-time governance.
Step 3 – Maximize infrastructure ROI with performance alignment: Driving business value from AI by eliminating GPU idle time, engineering for end-to-end workloads, and building infrastructure that delivers measurable outcomes.

Guest: Shawn Rosemarin, VP of R&D in Customer Engineering, Pure Storage

Expertise: Storage, Sales, Customer Engineering

Brief Recognition: Shawn has over 25 years of experience in the technology industry. He has held key leadership roles, including Worldwide Vice-President of Systems Engineering at Pure Storage and Senior Vice President and CTO at Hitachi Vantara. Earlier in his career, he was a Chief Technologist at VMware, where he developed a strong foundation in integrating technology with business needs. Shawn holds a Bachelor of Commerce in Management Information Systems from Queen’s University.

STEP 1: Auditing and Centralizing Data for AI Readiness

As the conversation turns from challenges to solutions, Shawn discusses the increasing complexity and cost of building AI infrastructure, particularly for enterprises. He starts by highlighting that one of the most expensive components in the setup is GPUs, which are critical for training and running AI models.

These chips are advancing at a breakneck pace, much faster than what we saw during the PC era, even outpacing traditional growth patterns like Moore’s Law. The upside is that GPUs can now process data at remarkable speeds, but the downside is their high cost.

The problem arises when organizations invest in powerful GPUs but can’t supply them with data fast enough to keep them running at full capacity. If the storage systems (which feed data to the GPUs) are too slow or inefficient, the GPUs sit idle, wasting expensive resources. That’s why storage becomes a critical piece of the AI pipeline: the system must be fast and efficient enough to keep the GPUs fully utilized.

But it’s not just about speed and performance. Shawn emphasizes that governance and compliance are just as important. The data that fuels AI, what he calls digital gold, is spread across many different systems and environments. Some of that data may contain personally identifiable information (PII) or be subject to regulations like HIPAA or GDPR.

He offers an example of why organizations must ensure that any sensitive data used in AI models follows all legal and regulatory standards:

“I’ll use a real scenario: If I go train a bunch of data that includes Matt DeMello, and then all of a sudden, you say, ‘I have a right to be forgotten. Pull all of my data out of your systems.’

Well, now I have to look at all the training models and all the inference that would have been gained from that. I have to re-RAG – retrieval augmented generation – all those systems, having removed maths PII. And I have to be able to do that in real-time to ensure that I don’t get hit with a fine down the road.”

– Shawn Rosemarin, VP of R&D in Customer Engineering at Pure Storage

STEP 2: Building a Compliant, Metadata-Driven Pipeline

Shawn adds that the hidden complexity behind building AI systems lies in focusing on how data is collected, connected, and made usable for training AI models, especially in large enterprises.

He begins by pointing out that organizations have spent decades collecting digital data spread across hundreds of systems. But the challenge now isn’t just having the data, it is in streamlining how that data flows into AI systems. For example, to build a comprehensive digital profile of a customer, the system would need to gather data from multiple sources and accurately link it.

In the past, unique IDs like employee numbers made linking data straightforward. Today, many systems don’t have that consistency, so you might have to guess if different pieces of data “look like” they belong to individuals. If you guess wrong, you end up with inaccurate results.

Shawn points out that once the data is assembled, AI adoption leaders also need to ensure that any sensitive information, such as Social Security numbers or payment details, is flagged adequately as confidential. That way, it only appears in AI responses for authorized individuals.

The degree of access control Shawn outlines is managed using metadata, which he offers a simple definition for using the example of photo files: Just like you can see who took a photo and when by checking its metadata, enterprise systems attach metadata to every piece of data to describe who it’s about, how private it is, and how it should be used.

Shawn also insists that metadata makes it easier to search and manage data efficiently. Instead of scanning every file to find information about an individual or topic, the system checks a metadata table, sees which files are tagged with his name, and retrieves only those. It enables faster, more accurate AI-driven responses, but it depends on having a solid metadata and data pipeline in place to start.

Shawn closes by emphasizing that the entire behind-the-scenes operation of collecting, tagging, connecting, managing permissions, and feeding that data into GPUs for training is critical to the performance of AI systems. If the data pipeline isn’t fast or efficient enough, the GPUs don’t get fully used.

That creates financial inefficiency and operational lag. He notes that many legacy storage systems were never designed to handle high levels of complexity or the massive number of small files used in AI workloads, which is forcing the industry to rethink its storage and data infrastructure from the ground up.

STEP 3: Maximizing Infrastructure ROI with Performance Alignment

Enterprises have spent years digitizing their data, but that alone isn’t enough. The real challenge and opportunity is figuring out what data is usable and how to prepare it for AI. Here’s how Shawn explains it:

First, take stock of what data you actually have: Companies have spent decades moving from paper to digital, but now’s the time to audit it. What do we have? Where does it live? And more importantly — how much of it is actually usable?
Most data isn’t AI-ready: Human-recorded data often lacks structure and context, making it difficult for AI to process. For example, a doctor’s handwritten notes make sense to another doctor, but an AI model won’t know what to do with them.
The data needs to be connected. Once you know what you have, the next challenge is: how do you connect it all? It requires heavy lifting, migrating massive datasets, and unifying them in a central location.
Centralizing matters because data gravity is real: “Data gravity” means data should live as close as possible to the systems that need it. The closer the data is to the AI infrastructure (especially GPUs), the faster you can move and use it.
You need the right storage system to make it all work: Even once the data is centralized, it must be stored in a way that supports both training and inference — and it has to work at scale, long-term, without blowing up your budget.
It is a business challenge, not a science project: Companies can’t treat AI like a lab experiment. The infrastructure needs to deliver tangible outcomes for customers, employees, or citizens, and the cost of using AI must be justified by the value it returns.

Recommended from Emerj

Responsible AI That Scales Across Customer Workflows – with Miranda Jones of Emprise Bank

As generative AI (GenAI) reshapes industries, family-owned, community banks must balance the opportunity to improve customer workflows with the need to maintain ethical standards, protect data privacy, and ensure regulatory compliance. This issue is particularly pronounced in the financial services sector, where trust is crucial and regulatory scrutiny is intense. According to a 2023 report…

Sharon Moran

•

October 6, 2025

Realizing the Value of Enterprise AI from Retail to BFSI- with Leaders from Amazon and Turing

This interview analysis is sponsored by Turing and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. Despite operating in vastly different markets, retailers and BFSI (banking, financial services, and insurance) firms are racing to…

Marilie Fouche

•

October 1, 2025

Artificial Intelligence at Lincoln Financial Group – Two Use Cases

Lincoln Financial Group (LFG), headquartered in Radnor, Pennsylvania, is a major U.S. financial services firm offering life insurance, annuities, retirement planning, and group protection. The company operates multiple offices nationwide and employs over 10,000 people as of 2024. For the year ended December 31, 2024, Lincoln Financial reported annual revenue of $18.44 billion and net…

Riya Pahuja

•

September 29, 2025

Lessons from Running Massive Online Degrees at Scale – with Aaron Demory of Fearlus

In both the US and Canada, a significant portion of the workforce is approaching retirement, putting vast amounts of tribal knowledge at risk. In Canada, baby boomers born between 1955 and 1965 are retiring, with the last cohort turning 65 in 2030, causing a decline in overall labor force participation and knowledge loss. Approximately 22%…

Matthew DeMello

•

September 22, 2025

AI as a Catalyst for Supply Chain and Workforce Transformation – with Kuo Zhang of Alibaba.com

Small businesses and enterprises alike are running into similar roadblocks when it comes to deploying AI at scale and developing resilience in today's global supply chains. While many leaders understand the urgency, their organizations often face structural, cultural, and logistical barriers to implementation. According to the U.S. Census Bureau's Small Business Pulse Survey, 38.8% of…

Matthew DeMello

•

September 18, 2025

Artificial Intelligence at Bayer

Bayer is a global life sciences company operating across Pharmaceuticals, Consumer Health, and Crop Science. In fiscal 2024, the group reported €46.6 billion in sales and 94,081 employees, a scale that makes internal AI deployments consequential for workflow change and ROI. The company invests heavily in research, with more than €6 billion allocated to R&D…

Emily Smith

•

September 15, 2025

CoCreate 2025: Driving Supply Chain Resilience with New Agentic AI Tools

Event Title: CoCreate 2025 Event Host: Alibaba.com Location: Las Vegas, NV, US Date: September 4-5 Team Member: Matthew DeMello, Emerj AI Research Editorial Director What Happened CoCreate 2025, Alibaba.com’s flagship sourcing and entrepreneurship event, convened global leaders from across supply chains, technology, and commerce in Las Vegas. With more than 200 networking sessions, 100 industry…

Matthew DeMello

•

September 11, 2025

Balancing Trade-Offs in Hybrid Cloud and the Infrastructure Behind Scalable AI – with Jason Hardy of Hitachi Vantara

This interview analysis is sponsored by Hitachi Vantara and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. Organizations across various industries are making significant investments in enterprise AI capabilities to enhance their efficiency and…

Riya Pahuja

•

September 11, 2025

Reimagining Customer Experiences with AI-Driven Conversations – with Leaders from Cognigy and Prudential Financial

This article is sponsored by Cognigy and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. Repetitive administrative tasks continue to be a significant source of employee burnout across various industries. In healthcare, as Microsoft’s…

Riya Pahuja

•

September 9, 2025

Artificial Intelligence at Fifth Third Bank

Fifth Third Bank, a leading regional financial institution with over 1,100 branches in 11 states, operates four main businesses: commercial banking, branch banking, consumer lending, and wealth and asset management. Founded in 1858 and headquartered in Cincinnati, the bank has assets in excess of $211 billion. During the first quarter of 2025, Fifth Third Bank…

Sharon Moran

•

September 8, 2025

Navigating the Build vs. Buy Conversation in Service and Manufacturing Spaces – with Leaders from Aquant, Generac, Lexmark, Electrolux, Danaher, and Comfort Systems USA

This article is sponsored by Aquant and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page. In high-stakes field service sectors, such as manufacturing heavy machinery or critical medical devices like hospital ventilators, equipment failure…

Matthew DeMello

•

September 4, 2025

Breaking Down AI’s Role in Genomics and Polygenic Risk Prediction – with Dan Elton of the National Human Genome Research Institute

While protein sequencing efforts have amassed hundreds of millions of protein variants, experimentally determined structures remain exceedingly rare, lagging far behind the number of unresolved structures. The 2024 UniProt knowledgebase catalogs approximately 246 million unique protein sequences, yet the Worldwide Protein Data Bank holds just over 227,000 experimentally determined three-dimensional structures — covering less than…

Ashwin Telang

•

September 1, 2025

Search site

Search site

Scaling AI with Storage Efficiency – Emerj AI Leader Insight

STEP 1: Auditing and Centralizing Data for AI Readiness

STEP 2: Building a Compliant, Metadata-Driven Pipeline

STEP 3: Maximizing Infrastructure ROI with Performance Alignment

Recommended from Emerj

Responsible AI That Scales Across Customer Workflows – with Miranda Jones of Emprise Bank

Realizing the Value of Enterprise AI from Retail to BFSI- with Leaders from Amazon and Turing

Artificial Intelligence at Lincoln Financial Group – Two Use Cases

Lessons from Running Massive Online Degrees at Scale – with Aaron Demory of Fearlus

AI as a Catalyst for Supply Chain and Workforce Transformation – with Kuo Zhang of Alibaba.com

Artificial Intelligence at Bayer

CoCreate 2025: Driving Supply Chain Resilience with New Agentic AI Tools

Balancing Trade-Offs in Hybrid Cloud and the Infrastructure Behind Scalable AI – with Jason Hardy of Hitachi Vantara

Reimagining Customer Experiences with AI-Driven Conversations – with Leaders from Cognigy and Prudential Financial

Artificial Intelligence at Fifth Third Bank

Navigating the Build vs. Buy Conversation in Service and Manufacturing Spaces – with Leaders from Aquant, Generac, Lexmark, Electrolux, Danaher, and Comfort Systems USA

Breaking Down AI’s Role in Genomics and Polygenic Risk Prediction – with Dan Elton of the National Human Genome Research Institute

Customize Your Experience

Scaling AI with Storage Efficiency – Emerj AI Leader Insight

STEP 1: Auditing and Centralizing Data for AI Readiness

STEP 2: Building a Compliant, Metadata-Driven Pipeline

STEP 3: Maximizing Infrastructure ROI with Performance Alignment

Related Posts

Share article

Subscribe to updates

Recommended from Emerj

Responsible AI That Scales Across Customer Workflows – with Miranda Jones of Emprise Bank

Realizing the Value of Enterprise AI from Retail to BFSI- with Leaders from Amazon and Turing

Artificial Intelligence at Lincoln Financial Group – Two Use Cases

Lessons from Running Massive Online Degrees at Scale – with Aaron Demory of Fearlus

AI as a Catalyst for Supply Chain and Workforce Transformation – with Kuo Zhang of Alibaba.com

Artificial Intelligence at Bayer

CoCreate 2025: Driving Supply Chain Resilience with New Agentic AI Tools

Balancing Trade-Offs in Hybrid Cloud and the Infrastructure Behind Scalable AI – with Jason Hardy of Hitachi Vantara

Reimagining Customer Experiences with AI-Driven Conversations – with Leaders from Cognigy and Prudential Financial

Artificial Intelligence at Fifth Third Bank

Navigating the Build vs. Buy Conversation in Service and Manufacturing Spaces – with Leaders from Aquant, Generac, Lexmark, Electrolux, Danaher, and Comfort Systems USA

Breaking Down AI’s Role in Genomics and Polygenic Risk Prediction – with Dan Elton of the National Human Genome Research Institute

This Content is Exclusive to Emerj Plus Members

In-Depth Analysis

Exclusive AI Capabilities Matrix

Exclusive AI White Paper Library

Best Practices and executive guides

Register

Customize Your Experience