The system administrator’s job in a typical business IT department has become more complex and exponentially more difficult since the turn of the 21st century. When business networking started, admins were mostly concerned with serving a static company website, monitoring shared drives, providing limited internet access (with no business use case for streaming video or audio, many companies just blocked it), and perhaps running a VOIP phone systems. Each of these applications has a fairly standard set of monitoring and performance requirements. Assuring that they worked properly was a straight-forward problem.
With the adoption of cloud computing, virtual machines, the internet of things, and live video streaming, network traffic and the task of monitoring and controlling resources has become increasingly more complicated. Managing this new complex infrastructure often requires a great deal of time, attention and manpower.
But the majority of these tasks can be simplified, automated or properly organized by using a artificial intelligence (AI) application based around machine learning. Deploying an AI program to your network that can properly integrate with existing monitoring tools will help track system resource usage, keep components working within expected standards, detect cyberattacks intrusions or disruptions, and monitor uptime. Perhaps most critically, organizing your monitoring tools this way can also help direct and stop “event storms” of alerts that are caused when a small root cause affects dozens or hundreds of devices.
As each new technology arrives on the scene it presents differing demands for performance, bandwidth, latency, availability and connectivity. The result is a mish-mash of priorities and a need to custom manage each protocol or even each device in order to deliver the necessary performance.
An example of the difficulties facing a network administrator is seen when comparing the needs of a large database with a group of internet of things (IoT) devices. The database will be expected to have relatively quick access times and maximize uptime during peak usage periods to serve web apps or other users.
This demands a solid connection and efficient network paths that must be constantly monitored. A drop in availability is important enough to trigger an alert that should be addressed and resolved as quickly as possible.
On the other hand, the IoT device doesn’t need to connect all the time. It may run a low-power WiFi adaptor on batteries or even be mobile. So this device will only connect periodically and may only need network access when it has data to stream. It might tend to drop off the network the rest of the time.
Setting up one flavor of alerting for these two types of devices would result in either missing outages on the database or receiving constant messaging on the IoT device. It’s certainly possible to tweak alert settings on each device, but doing so would be time-consuming and very detailed repetitive work.
The AI Machine Learning approach to this problem is to deploy an application that can adjust to changing conditions and develop a performance metric on its own. This metric will be automatically customized to the specific application and its needs. A large database may need a wider pipeline or more processing power at a given time of high demand. An AI program can measure and use predictive analytics to change that database’s resource allocation when needed.
Automatic Maintenance and Machine Healing
In order to serve its users successfully a network must be secure and have all its resources available when needed. This requires management of the performance of a given machine as well as ongoing monitoring of necessary software upgrades, patches and security configurations.
For some organizations, a brief maintenance window when the entire system is down is acceptable. However, this often requires an interruption of service during working hours or third-shift work on the part of the technical team. Other organizations simply can’t support regular maintenance windows, so they might be tempted to postpone maintenance or deploy overly redundant systems in order to ensure there is no loss in availability.
An AI machine-learning program can solve these issues in two ways. The first is to automate patching and updates in an intelligent manner. With its ability to measure and understand peak and nonpeak usage and use predictive analytics to create a customized maintenance window, the AI can optimize the update and patching schedule to avoid disruption of service. Then it can bring the device down, deploy the patch and bring it back up.
Obviously, that’s also something a human could easily do. But the AI can schedule and perform these tasks in the background with no intervention required. And it can scale up to perform individually optimized scheduled maintenance on thousands of devices across a large network with a coordination of resources that maximizes system availability without a regular attention from the IT department.
The second way an AI system can maximize availability is by closely monitoring the performance of every device throughout its normal operations. When specific services start to hang or take up more processing power than the typical profile, the AI can restart the services or perform a reboot as necessary. This can all happen automatically and help heal the system’s performance before users notice any drop in performance. With a complete timeline of normal conditions, aberrant performance profiles stick out like a sore thumb to the AI and it is able to proactively solve problems before the result in a real impact.
Because the AI system is able to learn its way around the network and record, measure and analyze normal activity on its own, it is perfectly suited to help monitor and notify against potential cyberattacks.
With its measure of normal network performance, bandwidth and availability in place, the AI could detect an unusual spike in requests or other data that represents a denial of service attack, an attempt to brute-force passwords or other types of intrusions. However, rather than simply alert the system administration any time there is an uptick in usage, the AI is also able to compare what it’s seeing with other measured spikes that are caused by authorized activity. This leads to more reliable and intelligent alerting.
Additionally, the AI is able to learn the functions and normal applications, processes and services that run on each device. In the event that malware is installed inside your network, the AI will recognize the process, compare it to expected norms and block any dangerous activity. When it detects something new on a known device, it has a baseline of comparison from that device’s history and the collective history of other similar devices. The machine-learning engine’s ability to learn and understand about new software and new patches lets it recognize a new process and compare it to available information – even including by searching the internet for the name of the executable.
An AI system can also measure and categorize your normal network traffic, using predictive analytics to profile the typical userbase and resource utilization. A cyberattack may look like normal external network activity in some ways, but could have a few specific characteristics used to help mask its source or hide its intent. Machine learning is very useful for finding these very small differences between normal activity and activity that’s not normally seen on your network.
Intelligent Resource Allocation
At one time, most network traffic was essentially treated the same and most devices were set up with a specific set of resources that would support their expected maximum capacity. Now, every application and use case has different needs, but we have much more control over network resources and allocations.
A web server for a retail sales site might need higher bandwidth during peak shopping times but almost none in the middle of the night, while a streaming video service would experience peak usage during prime evening hours and much less in the morning. These factors can be measured, predicted and modified as necessary on the fly with the use of an AI machine learning application.
The AI might also be aware of a regularly scheduled backup task. This would result in the transmission of large amounts of data across the network. Properly tuning and allocating resources would provide the backup task with enough power and bandwidth to efficiently run while other low-activity devices have their available resources tweaked down. Of course, this is all done automatically without intervention from the IT staff. And due the scheduled and expected nature of the backup task, the drop in resources available to other devices does not trigger an alert as it might in the event of a cyberattacks.
In the case of another application such as streaming video, the AI program can deliver a low latency connection with higher bandwidth as necessary. In this case, the AI has already preprofiled the video streaming device or application and adjust parameters and resources on the fly. This provides a seamless performance profile for resource-intensive tasks such as a live video chat by intelligently repurposing the same system resources that were previously dedicated to the backup task.
Predicting and Stopping “Event Storms”
A correctly configured network monitoring tool will be programmed to create alerts and escalate when critical systems lose connectivity or malfunction in some other way. The inherent problem is not only frequent false alarms, but a flood of nearly identical alerts created when a real emergency happens.
If, for example, a network switch goes down, the monitoring tool will immediately trigger an alert for every critical device that is on that switch. That’s potentially hundreds of alerts for a single failure. Immediately after that, other devices realize they can’t talk to the devices on the switch and they too generate alerts.
The resulting “event storm” leads to hundreds or even thousands of support tickets that may need to be manually examined, verified and closed. The available people to review the alerts will be so bogged down with critical notifications that they won’t be able to quickly diagnose the original problem and they could also miss a new critical issue if an unrelated crash occurs elsewhere on the network.
An AI machine learning engine is able to solve this because it has a complete picture of the network and how devices are structured and interacts. It can intercept hundreds or thousands of tickets at a time and organize them according to commonalities. Receiving 200 alerts from machines connected to Switch X and 3,000 alerts from other machines that can’t talk to those 200 machines results in the obvious conclusion that Switch X is down. The AI computes this and generates a single alert telling the IT staff that Switch X needs immediate attention. When Switch X is back up and the ticket resolved, the AI closes the thousands of tickets generated previously.
By conducting automated root cause analysis, the AI has saved time, resources and money. Not to mention a lot of headaches. The true power of this is illustrated when another almost identical alert comes in and a device reports that it can’t talk to a single machine on Switch Y. The AI automatically filters that ticket and passes it directly on to the IT staff because it’s not related to the Switch X incident. In an event storm of alerts and tickets, a human might not notice the difference between the Switch X and Switch Y problems and could close the alert without properly resolving it.
Moving to AI Network Management
The true added value seen with an AI network management tool is that provides an intelligent way to automate tasks that are only increasing in frequency and complexity. As network devices become more complex and have more specialized needs, the ability to manage them with efficiency and limited manpower has become harder and harder. Using an AI application to help manage a network offers a powerful integration between existing monitoring tools, basic automation, and machine learning.
Enterprise Integration’s Digital Robotics Engine (DRE) is an advanced tool brings these functions together. This synthesis is further empowered by the use of a system-agnostic tool that can connect with an open API architecture and interpret data and metrics from nearly any source. DRE provides the flexibility and power to manage networks of any size and integrate with existing monitoring tools an alerting systems. Access our data sheet to find out more about DRE’s abilities.