This interview analysis is sponsored by Deloitte and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.
Many enterprises discover that “AI readiness” on paper doesn’t always translate to value in production. Across the EU, only about 13.5% of firms report using AI (41% among large firms), highlighting a vast adoption gap even before scale challenges begin. Meanwhile, the same Eurostat business survey attributes slow diffusion to system-level barriers: weak digital readiness, unclear ROI, and governance challenges that can prevent promising pilots from graduating to platforms.
Still, reputable public institutions across media and academia continue to emphasize that AI’s upside is real. Studies from the Financial Stability Board and IMF point to meaningful productivity gains across sectors and broad macroeconomic impact — yet value will accrue unevenly to organizations that solve operational and governance bottlenecks first.
In short, execution quality — defined by systems, controls, and culture — is one decisive variable. Access to models is a commodity, where the ability to govern them is the scarcity.
This article distills how enterprise leaders can move from scattered pilots to durable advantage, drawing on a conversation between Deborah Golden, U.S. Chief Innovation Officer at Deloitte, and Emerj CEO and Head of Research Daniel Faggella, featured on Emerj’s ‘AI in Business’ podcast.
In the process, we will examine two critical insights from their discussion for business leaders driving enterprise AI adoption across industries:
- Rotating leadership operating system enables AI scale: Best practices for shifting between protecting experiments and budgets, connecting AI’s uncertainty to business outcomes, and removing institutional barriers that help move from pilots to enterprise-wide transformation.
- Purpose-built sandboxes turn failure into learning: How to design sandboxes with defined hypotheses, guardrails, and success criteria that ensure trial-and-error experimentation becomes a structured tool for accelerating innovation.
Listen to the full episode below:
Guest: Deborah Golden, U.S. Chief Innovation Officer, Deloitte
Expertise: Enterprise Innovation, Security Leadership, Change Management, Cross-Industry Risk
Brief Recognition: Deborah Golden is the U.S. Chief Innovation Officer at Deloitte, leading enterprise-wide innovation strategy and transformation initiatives. Prior to her current role, she served as the U.S. Cyber and Strategic Risk leader, driving large-scale security and resilience programs across sectors. Deborah earned her Master’s degree in Information Technology from George Washington University, as well as an undergraduate from Virginia Tech, and is widely recognized for her leadership in inclusive innovation, systems thinking, and cultural change.
A Rotating Leadership Operating System Enables AI Scale
Enterprises do not struggle to deploy AI because of a lack of clever models, but instead because the legacy operating model was never designed to support probabilistic, learning systems, Golden argues.
She frames this as a leadership problem before being a technical one, one in which a new operating system for leadership is emerging where executives must oscillate among three functions — Shield, Translator, and Enabler — so the organization has permission, clarity, and momentum at the right moments.
The power of this framework is that it converts vague mandates (“support AI”) into specific executive behaviors that remove predictable blockers:
“AI isn’t a linear IT install; it’s a probabilistic system that collides with deterministic processes. Leaders have to wear three hats on purpose: Shield early-stage work so learning is safe, translate uncertainty into business outcomes the board understands, and then enable scale by clearing policy and process friction. If any one of those is missing, pilots don’t become platforms.”
– Deborah Golden, U.S. Chief Innovation Officer at Deloitte
She emphasizes to leaders that adopting the entire operating system – Shield, Translator, and Enabler – is necessary to succeed in AI infrastructure. Like a three-legged stool, the entire structure collapses if one piece is missing.
Be the Shield when exploration is fragile, Golden advises. Early-stage discovery and prototyping are where fragile ideas can fail under bureaucracy, optics, and fear, Golden says. The Shield helps create psychological safety and budget safety for compliant experiments, codifying that learning is a first-class outcome, Golden notes.
In practice, this means publishing a one-page Leadership Compact that provides cover for experiments that follow the playbook: documented hypotheses, guardrails, audit logs, and cost caps, Golden explains. It also means measuring:
- Time-to-yes (or, the number of days from an idea entering intake to the moment the experiment actually starts — the shorter, the better)
- Tracking how many days and approvals a compliant experiment requires, and committing to cut that time each quarter
The Shield posture is particularly important in regulated domains where organizations often prefer to minimize avoid variance; the Shield reframes variance as bounded learning that can de-risk future rollouts, Golden argues.
Leaders should then switch to a Translator as proposals seek resources, Golden recommends. AI’s uncertainty is unnerving in boardrooms accustomed to deterministic ROI, and the Translator turns “AI can do a lot” into a crisp narrative that non-technical stakeholders can act on:
- What the business goal is
- What pathways to value there are
- What evidence gates are plausible
- What risks and limits are relevant
A practical instrument here that Golden recommends to business leaders is the Outcome Charter, attached to every use case that lists:
- Three business KPIs, for example, decision-cycle time, first-contact resolution, margin lift
- Two hygiene KPIs (runtime cost per 1,000 requests, policy violations), plus acceptance criteria for each investment tranche.
The Translator, Golden says, reports movement on business outcomes—not just model metrics—and turns experimental noise into executive sensemaking: what was learned, what stays, what’s killed, and what’s next. She suggests tying updates to 2–3 outcome KPIs (e.g., decision-cycle time, first-contact resolution, margin lift) so non-technical stakeholders can judge progress and tranche funding with confidence.
At launch and scale, leaders must become Enablers, Golden adds: clear policy and process friction, shorten approvals, update data-access rules, and push decision rights to the edge so fused teams can ship.
She also recommends replacing 90-day access SLAs with weekly gates, and running blameless postmortems so failures become fuel by asking questions like, “What did we learn? What will we change by Friday?”
Crucially, leaders oscillate through the Shield, Translator, and Enabler stages, Golden notes. Declare the current mix to your team, she advises, review it regularly, and realign to the portfolio — avoiding the anti-pattern of “sponsoring” AI in words while leaving real blockers in place. She also emphasizes writing the mix down and inspecting it in quarterly reviews.
Purpose-Built Sandboxes Turn Failure into Learning
Golden notes that many enterprises celebrating “experimentation” often create sandbox environments that are indistinguishable from production or, lack clear structure with missing hypotheses, spend caps, lineage, or definition of completion.
Her counsel is to make sandboxes intentional and auditable, so that failure becomes intelligent — bounded, captured, and reusable. A well-designed sandbox is a bridge from idea to production, from risk to resilience, and from novelty to reusable capability, Golden argues.
Start by declaring intent, she advises the executive podcast audience. Golden distinguishes three legitimate intents, each with different artifacts and governance:
“A sandbox is only valuable if it’s designed on purpose. Declare the hypothesis, cap the spend, log every prompt and response, and define rollback before you start. Otherwise you’re not learning, you’re wandering. And wandering at scale looks like waste to the board and to regulators.”
– Deborah Golden, U.S. Chief Innovation Officer at Deloitte
1. R&D Iteration Sandbox
Purpose: Improve prompts, retrieval strategies, fine-tunes, guardrails, or agent flows through time-boxed tests.
The “core artifact” Golden refers to here, is the principal document required for any experiment in the sandbox. For R&D Iteration, Golden emphasizes that the core artifact is called an “Experiment Card.”
Another way to think of an Experiment Card is a one-page description that makes the test auditable and finite. The R&D Iteration card must include:
- Hypothesis: What you expect to happen and why (or, the testable claim).
- Data scope: Exactly which data the experiment can touch, noting if any data is synthetic (fake but realistic) or anonymized (identifiers removed).
- Expected effect size: How big of an improvement you anticipate (e.g., “reduce handle time ~10%”).
- Success and stop criteria: Clear thresholds to declare the test a win (continue/scale) or to halt it (fail/iterate).
- Spend cap: A hard budget limit so costs can’t run away.
- Rollback conditions: Predefined triggers and steps to revert to the prior safe state if something goes wrong.
2. Risk/Resilience Hardening Sandbox
Purpose: Validate security, backup, and recovery controls and exercise incident runbooks in controlled conditions.
The core artifact is a Risk Exercise Plan mapped to specific threats (prompt injection signatures, data poisoning, PII leakage, jailbreak attempts).
Governance includes attack-surface monitoring, synthetic PII beacons, policy-violation alarms, and the capture of mitigation timelines. Done means: documented evidence of controls working (or gaps discovered), updates to the risk register, and a prioritized remediation plan — with owners and dates.
3. Super-User Training Sandbox
Purpose: Give power users a near-real environment to learn new workflow patterns before enterprise rollout.
The core artifact is a Learning Plan with objectives by role, sample data bundles, and assessment rubrics.
Governance includes role-based access, read-only protections where appropriate, content filters, and an audit export that can be shared with compliance and HR for certification. Done means: users pass competency checks, feedback is captured, and user guidance (playbooks, quick starts) is updated.
Golden also advocates for applying design principles across all intents. To do so, she suggests the following process:
- First, bound the problem and the costs: Every experiment carries a budget cap and telemetry for runtime cost (for example, cost per 1,000 calls), so finance sees experiments as managed investments, not opaque spend.
- Next, capture lineage by default: Prompts, responses, guardrail triggers, and data sources (including masking) are logged automatically to enable postmortems and reuse.
- Third, define reversibility: Clear rollback criteria and automated rollback scripts make forward motion safer — and faster — because teams know how to exit.
- Finally, make it inspectable: A shared Sandbox Register lists every active sandbox with intent, owner, status, spend to date, and links to Experiment Cards and artifacts, so compliance can click through and executives can see velocity and cost at a glance.
Once these frameworks are in place, Golden emphasizes that the shift from hesitation to discipline is the cultural transformation leaders must sponsor throughout the experimentation process.
Many organizations resist sandboxes due to a myth of “zero-failure,” budget optics (“idle spend”), or compliance anxiety; she sympathizes but argues that purpose-built design flips each concern:
- Intelligent failure is cheaper than production failure.
- Cost is capped and visible.
- Artifacts (Model Cards, Decision Records, Experiment Logs) make auditors and regulators more — not less — comfortable.
- Sandboxes sharpen evidence gates for tranche funding from discovery phases through limited rollout and onto scaling, so each gate requires artifacts showing what was learned, how risks were mitigated, and what economics look like at the next step.
Golden cites the following cross-industry examples of her suggested frameworks in action:
- Banking: A conversational dispute-resolution assistant is tested in an R&D sandbox with synthetic customer dialogues and masked data. Spend capped at $8K for a two-week sprint. Success criteria: reduce average handling time by 12% on representative flows without raising escalations. Artifacts and results roll into a limited rollout for one card product; the Risk sandbox separately exercises prompt-injection defenses.
- Life Sciences: Document intelligence for pharmacovigilance runs in a hardening sandbox to validate redaction, lineage, and human-in-the-loop controls against simulated adverse event reports. Gaps discovered lead to policy updates before a regulator asks.
- Manufacturing: Maintenance planners in a Super-User Training sandbox learn a recommendations tool using historical work orders; competency checks are required before granting production access, reducing adoption friction and unforced errors.
Before closing, Deborah emphasizes to the enterprise audience that operational cadence makes sandboxes work. She recommends the following regimen of team meetings to keep pace with project management:
- Weekly Experiment Stand-Ups, even as short as 10 minutes to confirm hypotheses, scope creep, and spend.
- Monthly AI Ops Reviews to examine drift, cost SLOs, incidents, and blocker removal.
- Quarterly Portfolio Reviews to merge redundant efforts and scale the winners.
No matter what the regimen, Golden notes that all meetings should link directly to sandbox artifacts, so discussions are grounded in evidence, not anecdotes.
How this de-risks scale and accelerates value is straightforward, Golden highlights:
- With hypotheses and guardrails explicit, experiments avoid meandering.
- With lineage and logs, failures pay dividends as institutional memory.
- With spend caps and dashboards, finance sees managed risk.
- With artifacts, compliance sees demonstrable control.
- With a register and cadence, leadership sees portfolio movement instead of scattershot projects.
The end result matches her principle: sandboxes are not about avoiding failure — they are about making failure intelligent, so learning compounds and production can get safer, faster, and cheaper over time, Golden concludes.


















