[seopress_breadcrumbs]

Scaling AI with Storage Efficiency – Emerj AI Leader Insight

This article is sponsored by Pure Storage and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.

As enterprises race to implement AI, most hit a bottleneck that’s hiding in plain sight: inefficient storage infrastructure.

While large language models and powerful GPUs dominate headlines, the real engine behind scalable AI lies in how data is stored, moved, and managed across systems. Without a high-performance, compliant, and well-governed data pipeline, even the most advanced AI models stall out.

According to IDC, by 2025, approximately 80% of worldwide data is projected to be unstructured. The ratio is primarily thanks to formats such as text, images, videos, and audio. These are data types that are integral to training and operating AI models, especially voluminous large language models (LLMs) and generative AI systems.

Recent research from Ohio State University surveyed metrics for evaluating data readiness, noting that unstructured data requires extensive preprocessing to be suitable for AI training. Their study highlights the importance of transforming unstructured data into structured formats to improve the quality and effectiveness of AI models.

On a recent episode of the Emerj ‘AI in Business’ podcast, Editorial Director Matthew DeMello explores a three-part framework for driving enterprise AI infrastructure through data storage. Insights for this framework come from Shawn Rosemarin, VP of R&D in Customer Engineering at Pure Storage, who unpacks the core challenges slowing AI adoption and how enterprises can unlock efficiency at scale.

Their conversation highlights that, to break the above bottlenecks, organizations must rethink storage as a strategic asset not just a backend utility. High-throughput, scalable architectures with metadata tagging, compliance safeguards, and proximity to compute environments are key.

The following article examines three steps of the data storage adoption framework Shawn shared with the Emerj executive podcast audience:

  • Step 1 – Audit and centralize data for AI readiness: Creating a unified, high-performance data foundation by inventorying assets, migrating to scalable storage, and connecting fragmented sources.
  • Step 2 – Build a compliant, metadata-driven pipeline: Enabling secure, scalable AI by enriching data with context, enforcing access controls, and planning for real-time governance.
  • Step 3 – Maximize infrastructure ROI with performance alignment: Driving business value from AI by eliminating GPU idle time, engineering for end-to-end workloads, and building infrastructure that delivers measurable outcomes.

Guest: Shawn Rosemarin, VP of R&D in Customer Engineering, Pure Storage

Expertise: Storage, Sales, Customer Engineering

Brief Recognition: Shawn has over 25 years of experience in the technology industry. He has held key leadership roles, including Worldwide Vice-President of Systems Engineering at Pure Storage and Senior Vice President and CTO at Hitachi Vantara. Earlier in his career, he was a Chief Technologist at VMware, where he developed a strong foundation in integrating technology with business needs. Shawn holds a Bachelor of Commerce in Management Information Systems from Queen’s University.

STEP 1: Auditing and Centralizing Data for AI Readiness

As the conversation turns from challenges to solutions, Shawn discusses the increasing complexity and cost of building AI infrastructure, particularly for enterprises. He starts by highlighting that one of the most expensive components in the setup is GPUs, which are critical for training and running AI models. 

These chips are advancing at a breakneck pace, much faster than what we saw during the PC era, even outpacing traditional growth patterns like Moore’s Law. The upside is that GPUs can now process data at remarkable speeds, but the downside is their high cost.

The problem arises when organizations invest in powerful GPUs but can’t supply them with data fast enough to keep them running at full capacity. If the storage systems (which feed data to the GPUs) are too slow or inefficient, the GPUs sit idle, wasting expensive resources. That’s why storage becomes a critical piece of the AI pipeline: the system must be fast and efficient enough to keep the GPUs fully utilized.

But it’s not just about speed and performance. Shawn emphasizes that governance and compliance are just as important. The data that fuels AI, what he calls digital gold, is spread across many different systems and environments. Some of that data may contain personally identifiable information (PII) or be subject to regulations like HIPAA or GDPR. 

He offers an example of why organizations must ensure that any sensitive data used in AI models follows all legal and regulatory standards:

“I’ll use a real scenario: If I go train a bunch of data that includes Matt DeMello, and then all of a sudden, you say, ‘I have a right to be forgotten. Pull all of my data out of your systems.’

Well, now I have to look at all the training models and all the inference that would have been gained from that. I have to re-RAG – retrieval augmented generation – all those systems, having removed maths PII. And I have to be able to do that in real-time to ensure that I don’t get hit with a fine down the road.”

– Shawn Rosemarin, VP of R&D in Customer Engineering at Pure Storage

STEP 2: Building a Compliant, Metadata-Driven Pipeline

Shawn adds that the hidden complexity behind building AI systems lies in focusing on how data is collected, connected, and made usable for training AI models, especially in large enterprises.

He begins by pointing out that organizations have spent decades collecting digital data spread across hundreds of systems. But the challenge now isn’t just having the data, it is in streamlining how that data flows into AI systems. For example, to build a comprehensive digital profile of a customer, the system would need to gather data from multiple sources and accurately link it. 

In the past, unique IDs like employee numbers made linking data straightforward. Today, many systems don’t have that consistency, so you might have to guess if different pieces of data “look like” they belong to individuals. If you guess wrong, you end up with inaccurate results.

Shawn points out that once the data is assembled, AI adoption leaders also need to ensure that any sensitive information, such as Social Security numbers or payment details, is flagged adequately as confidential. That way, it only appears in AI responses for authorized individuals.

The degree of access control Shawn outlines is managed using metadata, which he offers a simple definition for using the example of photo files: Just like you can see who took a photo and when by checking its metadata, enterprise systems attach metadata to every piece of data to describe who it’s about, how private it is, and how it should be used.

Shawn also insists that metadata makes it easier to search and manage data efficiently. Instead of scanning every file to find information about an individual or topic, the system checks a metadata table, sees which files are tagged with his name, and retrieves only those. It enables faster, more accurate AI-driven responses, but it depends on having a solid metadata and data pipeline in place to start.

Shawn closes by emphasizing that the entire behind-the-scenes operation of collecting, tagging, connecting, managing permissions, and feeding that data into GPUs for training is critical to the performance of AI systems. If the data pipeline isn’t fast or efficient enough, the GPUs don’t get fully used. 

That creates financial inefficiency and operational lag. He notes that many legacy storage systems were never designed to handle high levels of complexity or the massive number of small files used in AI workloads, which is forcing the industry to rethink its storage and data infrastructure from the ground up.

STEP 3: Maximizing Infrastructure ROI with Performance Alignment

Enterprises have spent years digitizing their data, but that alone isn’t enough. The real challenge and opportunity is figuring out what data is usable and how to prepare it for AI. Here’s how Shawn explains it:

  1. First, take stock of what data you actually have: Companies have spent decades moving from paper to digital, but now’s the time to audit it. What do we have? Where does it live? And more importantly — how much of it is actually usable?
  2. Most data isn’t AI-ready: Human-recorded data often lacks structure and context, making it difficult for AI to process. For example, a doctor’s handwritten notes make sense to another doctor, but an AI model won’t know what to do with them.
  3. The data needs to be connected. Once you know what you have, the next challenge is: how do you connect it all? It requires heavy lifting,  migrating massive datasets, and unifying them in a central location.
  4. Centralizing matters because data gravity is real: “Data gravity” means data should live as close as possible to the systems that need it. The closer the data is to the AI infrastructure (especially GPUs), the faster you can move and use it.
  5. You need the right storage system to make it all work: Even once the data is centralized, it must be stored in a way that supports both training and inference — and it has to work at scale, long-term, without blowing up your budget.
  6. It is a business challenge, not a science project:  Companies can’t treat AI like a lab experiment. The infrastructure needs to deliver tangible outcomes for customers, employees, or citizens, and the cost of using AI must be justified by the value it returns.

Share article

Subscribe to updates

Subscribe to weekly email with our best articles Financial Services updates that have happened in the last week.

Recommended from Emerj

Close the CTA

Stay Ahead of the Machine Learning Curve

Join over 20,000 AI-focused business leaders and receive our latest AI research and trends delivered weekly.

This Content is Exclusive to Emerj Plus Members

You’ve reached a category page only available to Emerj Plus Members.

Members receive full access to Emerj’s library of interviews, articles, and use-case breakdowns, and many other benefits, including:

In-Depth Analysis

Consistent coverage of emerging AI capabilities across sectors.

Created with Sketch.

Exclusive AI Capabilities Matrix

An explorable, visual map of AI applications across sectors.

Created with Sketch.

Exclusive AI White Paper Library

Every Emerj online AI resource downloadable in one-click

Created with Sketch.

Best Practices and executive guides

Generate AI ROI with frameworks and guides to AI application

View membership options

Register