drj logo

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Name*
Zip Code*
Please enter a number from 0 to 100.
Strength indicator
I agree to the Terms of Service and Privacy Policy*
Yes, of course I want to receive emails from DRJ!

Already have an account? Log in

drj logo

Welcome to DRJ

Already registered user? Please login here

Login Form

Register
Forgot password? Click here to reset

Create new account
(it's completely free). Subscribe

x
Skip to content
Disaster Recovery Journal
  • EN ESPAÑOL
  • SIGN IN
  • SUBSCRIBE
  • THE JOURNAL
    • Why Subscribe to DRJ
    • Digital Edition
    • Article Submission
    • DRJ Annual Resource Directories
    • Article Archives
    • Career Spotlight
  • EVENTS
    • DRJ Spring 2026
    • DRJ Fall 2026
    • DRJ Scholarship
    • Tracey Rice Memorial Scholarship
    • Other Industry Events
    • Schedule & Archive
    • Send Your Feedback
  • WEBINARS
    • Upcoming Webinars
    • On Demand
  • MENTOR PROGRAM
  • RESOURCES
    • New to Business Continuity?
    • White Papers
    • DR Rules and Regs
    • Planning Groups
    • DRJ Glossary of Business Continuity Terms
    • Careers
    • The BCI Partnership
  • ABOUT
    • About DRJ
    • 2026 Media Kit
    • Board and Committees
      • Executive Council Members
      • Editorial Advisory Board
      • Career Development Committee
      • DEI
      • Glossary Committee
      • Rules and Regulations Committee

Survey: When AI Factories Fail, 6 in 10 Enterprises Cannot Tell You Why

by Jon Seals | May 13, 2026 | | 0 comments

New Virtana Study Finds Enterprises Scaling AI Faster Than They Can Govern It

PALO ALTO, Calif. — Two-thirds of enterprises are running AI infrastructure without system-level visibility, creating a fragile foundation beneath rapidly expanding AI deployments. New research from Virtana found that as AI adoption accelerates, a new operational reality is emerging: innovation is outpacing control. 

The AI Factory Reality Check study, based on 788 US enterprise decision-makers, examines how AI factories operate under real conditions. More than half of respondents surveyed are already scaling AI across teams without addressing the system-level observability required to understand and control AI. The study documents a widening disconnect between AI factory expansion and the operational foundation needed to sustain it. 

“Modern enterprises, including banks, telcos, insurers and airlines, are increasingly dependent on AI-driven services. As a result, one of the greatest risks to the business is any disruption across these AI systems, where failures across applications or underlying infrastructure directly translate into business impact,” said Paul Appleby, CEO of Virtana. “AI systems function as interconnected systems, where infrastructure, data pipelines, token consumption, and model behavior continuously influence outcomes. Yet most organizations still monitor these elements in silos. Without system-wide understanding of these dependencies, they cannot explain how outcomes are produced, control cost, or determine whether those outcomes can be trusted.” 

Enterprise AI Has Scaled. Control Has Not. 

Enterprise AI has moved beyond pilots into at-scale operations. Fifty-four percent of organizations are already scaling AI across teams, while another 23% are managing production workloads alongside infrastructure expansion. At the largest enterprises, particularly those above $10 billion in revenue, this creates systems that are increasingly difficult to understand and control. 

As AI factories scale, system-level observability is not keeping pace. Organizations are expanding AI without the visibility required to understand performance, control cost, or manage risk across the full stack. Instead, critical investments in the operational foundation are being deferred:  

  • 56% percent of enterprises are deferring legacy infrastructure modernization 
  • 54% are deprioritizing cost optimization initiatives 

At the same time, cost pressures are forcing enterprises to continuously reconfigure their AI systems, often without the visibility to understand the impact of those changes. Eighty percent of enterprises report that the cost of premium AI hardware is reshaping infrastructure decisions. In response: 

  • 60% are shifting workloads across hybrid environments  
  • 58% are accelerating consolidation to improve per-unit efficiency  

These are structural changes to live systems under load. Each shift alters dependencies, resource contention, and performance characteristics across the stack. 

“Without system-level observability, organizations cannot determine how these changes affect outcomes, cost, or reliability. As a result, they are continuously optimizing AI systems they do not fully understand, introducing risk with every change,” continued Appleby. 

Inside the AI Factory, Visibility Is the Missing Variable 

As AI factories scale, visibility is emerging as the missing variable in understanding and controlling system behavior. The research shows that as enterprises expand AI, disparities in system understanding and operational control are becoming more pronounced: 

  • 66% of enterprises are operating AI infrastructure without reliable performance baselines 
  • Only 34% describe AI workload performance as highly predictable  
  • That drops to 25% at organizations with more than 50,000 employees  

This lack of visibility extends into incident response: 

  • 59% cannot automatically identify root cause across infrastructure domains when an alert fires  
  • 25% still rely on manual investigations across disconnected consoles as their first response  

When AI systems break, they do not fail cleanly. System understanding degrades, forcing teams into reactive analysis while high-cost GPU capacity sits underutilized, issues compound, and outcomes can no longer be fully explained or controlled. 

“These are not abstract concerns,” continued Appleby. “As AI becomes core enterprise infrastructure, a clear divide is emerging between organizations that understand how their systems produce outcomes and those that cannot explain or control them. Without visibility across models, tokens, GPUs, and infrastructure, teams absorb hidden cost, performance gaps, and ungoverned risk. Those that understand their systems gain end-to-end visibility and control so they can optimize cost in real time, ensure reliable performance, and prove outcomes. The result is declining resilience, eroding trust, and constrained growth as AI becomes infrastructure that must be governed and optimized at scale.” 

ROI Visibility Is the Prerequisite Enterprises Cannot Defer
The study reveals a disconnect between how AI systems operate and how they are observed. A 17-point gap exists between Infra/SRE practitioners and executives on automated root cause capabilities: 

  • 69% of Infra/SRE teams report lacking automated cross-domain root cause  
  • 52% of executives report the same  

This gap reflects a broader breakdown in system-level observability, where critical signals remain fragmented across the stack: 

  • 57% cite cost and efficiency metrics as a top challenge  
  • 56% cite GPU utilization tracking  
  • 52% cite data pipeline visibility  

These challenges span business outcomes, AI infrastructure, and data dependencies, yet are still managed in isolation. 

GPU cost and utilization remains the most difficult operational challenge for 35% of enterprises, with impact varying by role: 

  • 39% of executives experience it as financial accountability pressure  
  • 36% of architects cite integration complexity in distributed environments  
  • 22% of Infra/SRE teams face it as a scaling and reliability challenge  

This variation reflects how different parts of the organization see different fragments of the same system, without a unified view of cause and effect. 

Across all roles and revenue bands, enterprise priorities are consistent: 

  • 38% need unified visibility across AI and infrastructure layers  
  • 32% need AI-driven root cause analysis without manual correlation  

Together, these priorities point to a single requirement: system-aware observability that connects performance, cost, and outcomes across the full stack. Today, most enterprises are operating AI systems they cannot fully observe or explain. 

Virtana Expands AI Factory Observability with Dell Technologies Partnership 

Also announced today, Virtana is extending its Agentic AI-powered observability platform to support Dell’s AI Factory infrastructure, bringing system-aware intelligence across the full AI factory stack, from GPUs and infrastructure to models and AI workloads. Now organizations running Dell-based AI factories can apply continuous, cross-domain analysis across the entire execution system.  

Virtana’s autonomous agents correlate GPU utilization, token demand, model behavior, and underlying infrastructure performance in real time, delivering automated, evidence-backed root cause analysis where system complexity is highest. This support moves teams beyond siloed GPU monitoring and fragmented tooling. Instead of chasing signals, operators get clear answers tied to the actual system constraint driving latency, failures, or cost. 

 Resources 

  • Download the AI Factory Reality Check research report  
  • Learn more at virtana.com 
  • Learn more about Virtana AI Factory Observability 
  • Read the blog: AI Factories Are Breaking Traditional Infrastructure—Here’s How We’re Fixing It 
  • Follow Virtana on LinkedIn and X 

Research Methodology  

The AI Factory Reality Check is based on an independent survey of 788 US-based professionals at enterprise organizations actively running, piloting, or planning AI workloads in production, with decision-making or significant influence over IT infrastructure, AI strategy, or technology investment. Respondents include application, service, and AI engineering professionals (307), executive leadership (270), infrastructure, cloud, and reliability engineering teams (120), and architects and platform designers (91). Organizations range from under 1,000 to more than 50,000 employees, spanning revenue bands from under $500 million to more than $10 billion.

Related Content

  1. Foundation for Automated Enterprise DR Using AI
    Foundation for Automated Enterprise DR Using AI
  2. Disaster Recovery Gets a New Backbone
    Disaster Recovery Gets a New Backbone
  3. The State of Disaster Recovery Preparedness 2026
    The State of Disaster Recovery Preparedness 2026

Recent Posts

bowbridge Launches Real-Time Protection for Salesforce, Addressing Cyber Threats in Cloud CRMS and Agentic AI Applications

June 5, 2026

Cybercriminals Are Targeting the FIFA World Cup 2026

June 4, 2026

New Research: AI-Powered Phishing Defenses Made Security Teams Faster, But AI-Generated Attacks Made Defense More Expensive Overall

June 4, 2026

Radiant Logic Extends its IVIP to the Agentic Enterprise with Continuous, Real-Time Risk Scoring

June 4, 2026

Veeam Advances Operational Privacy and AI Governance for the Agentic Era on the DataAI Command Platform

June 4, 2026

Veeam Research Finds AI’s Promise is Colliding with a Data and AI Trust Gap

June 4, 2026

Archives

  • June 2026 (22)
  • May 2026 (67)
  • April 2026 (70)
  • March 2026 (89)
  • February 2026 (76)
  • January 2026 (61)
  • December 2025 (45)
  • November 2025 (58)
  • October 2025 (78)
  • September 2025 (65)
  • August 2025 (59)
  • July 2025 (70)
  • June 2025 (54)
  • May 2025 (59)
  • April 2025 (91)
  • March 2025 (57)
  • February 2025 (47)
  • January 2025 (73)
  • December 2024 (82)
  • November 2024 (41)
  • October 2024 (87)
  • September 2024 (61)
  • August 2024 (65)
  • July 2024 (48)
  • June 2024 (55)
  • May 2024 (70)
  • April 2024 (79)
  • March 2024 (65)
  • February 2024 (73)
  • January 2024 (66)
  • December 2023 (49)
  • November 2023 (80)
  • October 2023 (67)
  • September 2023 (53)
  • August 2023 (72)
  • July 2023 (45)
  • June 2023 (61)
  • May 2023 (50)
  • April 2023 (60)
  • March 2023 (69)
  • February 2023 (54)
  • January 2023 (71)
  • December 2022 (54)
  • November 2022 (59)
  • October 2022 (66)
  • September 2022 (72)
  • August 2022 (65)
  • July 2022 (66)
  • June 2022 (53)
  • May 2022 (55)
  • April 2022 (60)
  • March 2022 (65)
  • February 2022 (50)
  • January 2022 (46)
  • December 2021 (39)
  • November 2021 (38)
  • October 2021 (39)
  • September 2021 (50)
  • August 2021 (77)
  • July 2021 (63)
  • June 2021 (42)
  • May 2021 (43)
  • April 2021 (50)
  • March 2021 (60)
  • February 2021 (16)
  • January 2021 (554)
  • December 2020 (30)
  • November 2020 (35)
  • October 2020 (48)
  • September 2020 (57)
  • August 2020 (52)
  • July 2020 (40)
  • June 2020 (72)
  • May 2020 (46)
  • April 2020 (59)
  • March 2020 (46)
  • February 2020 (28)
  • January 2020 (36)
  • December 2019 (22)
  • November 2019 (11)
  • October 2019 (36)
  • September 2019 (44)
  • August 2019 (77)
  • July 2019 (117)
  • June 2019 (106)
  • May 2019 (49)
  • April 2019 (47)
  • March 2019 (24)
  • February 2019 (37)
  • January 2019 (12)
  • ARTICLES & NEWS

    • Business Continuity
    • Disaster Recovery
    • Crisis Management & Communications
    • Risk Management
    • Article Archives
    • Industry News

    THE JOURNAL

    • Digital Edition
    • Advertising & Media Kit
    • Submit an Article
    • Career Spotlight

    RESOURCES

    • White Papers
    • Rules & Regulations
    • FAQs
    • Glossary of Terms
    • Industry Groups
    • Business & Resource Directory
    • Business Resilience Decoded
    • Careers

    EVENTS

    • Fall 2026
    • Spring 2026

    WEBINARS

    • Watch Now
    • Upcoming

    CONTACT

    • Article Submission
    • Media Kit
    • Contact Us

    ABOUT DRJ

    Disaster Recovery Journal (DRJ) is the leading resource for business continuity, disaster recovery, crisis management, and risk professionals worldwide. With a global network of more than 138,000 practitioners, DRJ delivers essential insights through two annual conferences, a quarterly digital magazine, weekly webinars, and a rich library of online resources at www.drj.com. Our mission is to empower resilience professionals with the knowledge, tools, and connections they need to protect their organizations in a fast-changing world. Join our community by attending our events, subscribing to our publications, and following us on social media.

    LEARN MORE

    LINKEDIN AND TWITTER

    Disaster Recovery Journal is the leading publication/event covering business continuity/disaster recovery.

    Follow us for daily updates

    LinkedIn

    @drjournal

    Newsletter

    The Journal, right in your inbox.

    Be informed and stay connected by getting the latest in news, events, webinars and whitepapers on Business Continuity and Disaster Recovery.

    Subscribe Now
    Copyright 2026 Disaster Recovery Journal
    • Terms of Use
    • Privacy Policy

    Register to win a Free Pass to DRJ Fall 2026 | Resilience In Motion

    Leave your details below for a chance to win a free pass to DRJ Fall 2026 | Resilience In Motion. The winner will be announced on July 30. Join us for DRJ's 75th Conference!
    Enter Now