This interview analysis is sponsored by BigPanda and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.
Modern IT operations are inundated with alerts from various monitoring tools, leading to alert fatigue among IT professionals. The constant barrage of notifications can desensitize teams, causing them to overlook critical issues.
A study published in ACM Computing Surveys by CSIRO, Sydney, Australia, highlights that alert fatigue is a significant challenge in Security Operations Centers (SOCs), often resulting in delayed responses to genuine threats. A Trend Micro survey cited in the same study found that 51% of SOC teams feel overwhelmed by alert volume, with analysts spending over 25% of their time handling false positives.
Additionally, research published in Computer Networks, a leading peer-reviewed journal by Elsevier known for rigorous research in networking and IT systems, indicates that excessive alerts, many of which are non-actionable, can overwhelm administrators, leading to decreased efficiency and increased risk of missing vital incidents.
According to the study, even a typical production system in cloud monitoring environments can generate between 10,000 and 20,000 metrics and several hundred distinct alerts. Compounding the problem, alerts are often labeled only as “critical” or “warning,” regardless of severity. Such hap-hazard labeling for a trivial time-sync issue might carry the same urgency as a Denial of Service attack, making effective triage nearly impossible.
In more positive developments, research from the Distributed and Operating Systems Group, Technische Universität Berlin, Berlin, Germany, highlights that AI for IT Operations (AIOps) is emerging as a solution to the above challenges by leveraging machine learning and big data analytics to automate and enhance IT operations. AIOps platforms can process and analyze large volumes of data in real-time, enabling proactive identification of issues, predictive maintenance, and automated incident responses.
Emerj Editorial Director Matthew DeMello recently sat with Troy Felix, Regional Vice President of Sales Engineering at BigPanda, to discuss how AIOps transforms IT operations by making teams more efficient and proactive.
Their conversation highlights AIOps’ role in reducing alert noise, empowering staff, and improving IT efficiency and productivity. This article examines two critical insights for CX leaders from their conversation:
- Streamlining alert management with AI: Enabling IT teams to quickly analyze data, identify root causes, and automate remediation, reducing manual effort and improving efficiency.
- Improving IT efficiency with AIOps: Implementing AIOps to shift IT teams from reactive to proactive, reducing alert noise, minimizing tickets, and empowering entry-level staff to handle complex tasks.
Listen to the full episode below:
Guest: Troy Felix, RVP of Sales Engineering, BigPanda
Expertise: Sales, Virtualization, Technical Support
Brief Recognition: Troy is a seasoned professional in sales engineering.
Before his role at BigPanda, Troy held several positions at LogicMonitor, including senior manager of Sales Engineering for Managed Service Providers and Manager of Sales Engineering for the West region. Troy completed his Bachelor of Science degree in Biology from UC Santa Barbara in 1999.
Streamlining Alert Management with AI
Troy opens the podcast by explaining the traditional challenge IT teams face in managing and processing vast amounts of alert data. The volume of this data is continually growing, and it is challenging to distinguish what is meaningful (“signal”) from what is irrelevant (“noise”). While observability tools are crucial for capturing the data, the next critical step is determining what is essential.
Traditionally, IT teams would spend hours identifying and correlating issues to pinpoint the root cause. AI, however, can analyze vast amounts of data, utilizing historical context, enriched data, and information from systems like the CMDB (Configuration Management Database) or change data. It can make intelligent predictions about the root cause of the issue and determine which specific team or individuals should address it.
Troy tells the Emerj executive podcast audience that AI has eliminated the need for mass collaboration calls. Instead, it can identify the exact issue and the responsible party. The platform can even automate remediation in more advanced AI-driven operations, invoking pre-established playbooks or workflows to fix the problem. It has unlocked significant efficiencies, making IT operations teams more effective and efficient.
Troy highlights that heavily regulated industries like financial services and banking see the most success with AIOps due to strict compliance, penalties for downtime, and tight SLAs. Healthcare organizations also benefit as AIOps ensures uninterrupted operation for life-saving applications like EMRs and EHRs.
Additionally, he mentions that managed service providers, who deliver IT services to numerous enterprises, are rapidly adopting AIOps to scale their operations efficiently and manage the complexities of providing services across a broad client base.
Improving IT efficiency with AIOps
Troy further explains that implementing an AIOps platform isn’t just about installing the technology and then feeding it alerts. It’s a comprehensive approach that also involves changing how an organization works internally, especially within IT teams. It’s about anticipating alerts, responding effectively, and dealing with operational events in a new, more efficient way.
He emphasizes the importance of aligning people, processes, and technology. Organizations need to transform their approach to handling IT operations by establishing KPIs like ‘Mean Time to Repair’ and benchmarking them to track improvement. In this area, Troy notes that AI can drive down what he calls “noise”-related alerts.
He goes on to assert that AI can decrease alert noise by up to 95%, helping reduce the overall number of tickets submitted to IT service management systems like ServiceNow.
He highlights a significant shift from a reactive, or “firefighting,” approach to IT problems to a proactive one. With AI, the system can detect patterns in event data and warn IT teams of potential issues before they escalate. For example, AI can predict when a series of events might lead to a significant issue, such as a critical application downtime, and alert the team to address it before it becomes a larger problem.
He also states that AIOps is a journey that begins with collecting all relevant data into the system, enabling AI to learn and make intelligent correlations. This data can include everything from configuration management databases to change data and even spreadsheets still used in some enterprises. The more data the AI has, the more innovative and more effective it becomes in understanding complex IT environments and predicting issues before they occur.
In the end, Troy explains the broad benefits of implementing an AIOps platform, highlighting how it positively impacts every level of an organization.
- For CIOs: They benefit because the AIOps platform helps meet business objectives, such as cost reduction and improved uptime.
- For IT Leaders: Managers benefit from having high-performing teams that are not overworked, enabling them to do productive and impactful work without burning out.
- For Frontline Workers: Those in the trenches, handling triage and problem resolution, are more effective and impactful because they have more information at their disposal. AI enables entry-level staff (L1) to handle tasks that previously required higher-level expertise (L2, L3), providing detailed insights, root cause analysis, and specific guidance for remediation.
Troy also highlights the mental and operational benefits of AIOps, including reducing noise in alerts, which prevents unnecessary work and burnout. For example, false positives — or alerts that are not real problems — can be highly demotivating, especially when they disrupt workers at inconvenient times. AI helps reduce false positives, making the workflow healthier and more productive:
“Perhaps most importantly, your senior architects and engineers are no longer stuck in constant firefighting mode. Instead, they’re free to focus on business-facing initiatives that drive real value and move the organization forward — wherever those priorities lie.”
— Troy Felix, RVP of Sales Engineering, BigPanda