AI Ops: Skynet in the Data Centre? - TechInvest Magazine Online

Written by Patrick Hubbard, Head Geek at SolarWinds | Mar 3, 2020 8:49:55 AM

Perhaps the single most common pop culture reference in articles about artificial intelligence is Terminator’s Skynet. And it makes a fantastic Bogeyman—a neural-network, conscious, self-aware hive mind with a singular goal of purging the earth of humans. All the ingredients of a great antagonist and generating, so far, nearly US$2B along the way. So why do so many articles about finally realizing positive, long-promoted benefits of AI in operations reflexively invoke Skynet? AIOps is a new breed of IT infrastructure solutions to autonomously identify operational issues by analysing datasets far larger than humans can, and then make limited human-like decisions. Who wouldn’t want systems capable of diagnosing and fixing themselves?

Perhaps there aren’t many other options to reference. As with most IT buzzwords, the hype around AIOps far exceeds reality. IT managers shouldn’t expect a rise of the machines in the Ops theatre any time soon—and more’s the pity as applied AIOps could offer salvation to the many IT pros currently overwhelmed by complexity and systems noise. But even in its current state of hypeware, there’s value in the AIOps vision—particularly in how it can influence admins to take up new skills and adjust their priorities to be better prepared for the rapidly approaching future.

Cyberdyne Systems Sales Brochure

As an ideology, AIOps represents the answer to the fundamental issue of IT operations: how close to precognisant can IT be? Is it actually possible for machines to know and alert what will go wrong before users notice? The millions of nuanced alerts system admins define, and then proceed to drown in, have represented our best attempt to do so for decades—but manual creation and observation of systems can go only so far. And worse, with systems growing increasingly complex by the minute, manual reassessment and updating of alerts just doesn’t go far enough, if it’s even possible with time-constrained, shrinking Ops teams.

Bringing AIOps online promises to largely eliminate traditional rule-based alerts altogether, replacing them with encyclopaedic all-system data monitoring, and correlations beyond the scope of grep and Excel spreadsheets. It aims to automatically create and maintain the heuristics which drive truly meaningful actions. As it evolves, we can expect to see increasingly autonomous AI platforms which make manual alerts the exception not the rule, and we’ll trust our managements systems to do the “right” thing on their own, and instead alert IT managers only when action needs to be taken. Who wouldn’t want to spin up an instance of that?

However, the truth is Skynet’s core menace is real, and organizations understandably are weary. IT Ops systems won’t (probably) destroy the earth, but their first applications, like security solutions, could have serious consequences if we really don’t understand them. For example, as network attacks become more AI-powered, we’ll increasingly have to turn over security configuration changes to machines who can keep up with attack barrages launched against us with the speed and breadth of AI.

At present, most of the “solutions” out there fall firmly in the category of vapourware, meaning they have no effect on the inundation of alerts SysAdmins face every day. But unlike other types of tech hype, the idea of AIOps itself is a great motivator for discussion and direction to IT pros today. If you know Skynet is coming, it’s handy to be ready for it in advance.

Your Biases. Give Them to Me. Now.

Foremost amongst those learning priorities: proficiency in data. As AI plays a growing role in Ops management, admins will need at least basic competencies in data science, programming (particularly languages like Python), and principles of machine learning. Some of those competencies will help them understand and fix AI systems. Others, particularly those in data science, will ensure they gather sufficient and relevant data for those AI systems to draw valid conclusions—and avoid biased or spurious predictions with the potential of spiralling out of control.

While there are many articles proclaiming machine learning’s breadth of analysis automatically prevents bias, research increasingly proves what we already know in Ops. If you put garbage data in, garbage decisions come out, and it’s possible to end up increasing bias in the overall decisions. The garbage into Skynet was the idea humans were the enemy and unchecked, genocidal bias was the result. AIOps won’t have access to the nuclear button—well let’s hope it won’t—but it will have access to run our most critical systems. From network security to storage to application delivery to orchestration and cost management, we should expect a rush to connect these systems once it becomes affordable and common. If IT pros only have time to research one aspect of AI/ML, it should be selecting data to mitigate bias, which will have the greatest impact in overall service delivery quality.

This may sound like a significant time sink in learning and training for already time-pressed admins. However, IT pros won’t require data science certifications, developer-level skillsets, or other deep investments in these new disciplines. Vendor tools and services, like those offered by SolarWinds, will provide training to develop expertise tied to tools in a much more efficient form. And many vendors, like Azure, AWS, and GCP already offer substantial free learning resources and certification—which every SysAdmin should tap into, no problemo.

Alongside these more technical disciplines, IT pros should consider how they’ll treat and integrate AI in their current roles. As AIOps goes from robot dream to reality, admins will find themselves collaborating with their AI “co-workers.” They’ll become part teacher and part therapist, to trust the seemingly counterintuitive recommendations AI may make and how they’ll support those recommendations in meetings with other lines of business.

Ironically, the areas in which AIOps may shine brightest—complex, variable ones like app management, where AI can draw associations humans might never see—are also those where we struggle the most to understand why the machine makes a recommendation. Rather than sending an engineer back in time, IT pros will need to adapt their proficiencies in data science and machine learning to coach their AI platforms. Over time, doing so will guide AIOps tools increasingly to more accurate and useful conclusions, with more organizational trust, by combining both algorithmic rigour and human expertise.

It’s Not Yet September 29, 2:14 ET

For now, IT pros have time to ramp up their abilities in data science and machine learning, putting those skills to the test. Between BI-powered reporting from existing data lakes, and existing cloud platforms’ learning and certification sandboxes, data understanding, and required data janitorial skills can be developed and tested. They can also apply these skills to how they currently manage and analyse system data, even if doing so doesn’t necessarily involve machine learning algorithms. In the process, they’ll gain a better appreciation for the advantages of expanding monitoring scope and define what they do want in an AIOps platform. Most will naturally seek user-friendly control, with the depth of capabilities to deal with complex app and infrastructure portfolios at scale, minus Judgement Day.

If history tells us anything, it’s that early on, only a few AIOps solutions will provide this happy medium; the majority will either offer incredibly technical capabilities for power users early, or simply repackage existing solutions with a liberal dose of AI jargon later. IT pros should as always be sceptical of any form of vapourware, and that’s where AIOps squarely falls for now for all but the most well-heeled organizations. But as the idea matures into real capabilities, admins who’ve invested in the right skills will be the ones to defer deployment of Skynet 1.0. Because sharp engineers always wait for at least version T-2000.

View full post