What is IT Event Correlation?

Por: opservices em 08.10.2013

Event correlation is a technology used for the treatment and correlation of events, seemingly unconnected, in order to identify the root cause of a problem in a computational environment.

In complex IT environments, but especially in networked environments, thousands or even millions of events are generated in a short time, from informational events to critical events. If these events are submitted to a management platform, it is critical to be provide the appropriate treatment to the same in order to expedite the identification of a problem that is currently in place.

 

IT Event Correlation

 
To achieve the effective performance of a telecommunications network, it is essential that all its resources are properly managed and that there is integration between the functional areas, namely fault, performance, configuration, security, and accounting and the different levels of management, in this sense, network element management, network management, service management and business management.

It is critical to isolate the root cause of a failure in this type of technology. When a failure occurs in a network central node, hundreds of events related to the equipment connected to this node are generated, causing a “storm event” (storm event), making it difficult for network operators to identify whether the root cause of failure is related to the central node and not the equipment connected directly or indirectly to the same. A good analyst can support networks, by means of his knowledge of the topology and technology, hence, identifying the root cause of the failure. However, this knowledge, in a general sense, is difficult and costly to obtain. The event correlation technology envisages the “transference” of this knowledge, often dispersed in organizations, to an automated and registered interrelations platform between the ongoing events.

 

10 reasons to invest in event correlation technology

01. Determine real-time which are the root cause of failures/operational problems, which may reflect in the company’s business;

02. Identify and parameterize the impact of failures, including the business services;

03. Technology enables companies to take a proactive rather than reactive attitude;

04. It is possible to distinguish the most common alerts from those which may actually have some significant impact on the businesses;

05. Enables the creation of graphical rules in a drag-and-drop interface (achieved from the English terminology drag and drop) to the events and the conditions for the generation of alarms;

06. Reduction of the operational costs: automate the creation and execution of workflows generating a large reduction in relevant alerts, which, economically, enables the management of the IT environment processes;

07. The event correlation technology envisages the “transference” of this knowledge, often dispersed in organizations, to an automated and registered interrelations platform between the ongoing events;

08. Enables the integration of the management impact on services and the automatic processing of events in order to build a service model that maps IT components in business services;

09. Events are processed and correlated in memory and only after will be stored, rather than just read from a database;

10. Manages dynamic infrastructure businesses of any size and enables the detection, isolation and renders proactive responses to IT infrastructure problems before they affect customers and enterprise services.

Various types of correlation can be identified in terms of the operations performed on the alarms available. The most important of these operations are detailed [Jakobson and Weissman, 1995]:

 
Compression
Consists in the detection, from the viewpoint of the monitoring of the events received within a certain span of time, multiple occurrences of the same event, replacing the corresponding events by a single event, possibly indicating how many times the event occurred during the period of monitoring.

 
Selective suppression
This is the temporary inhibition of alarms related to a certain event, according to criteria – continually evaluated by correlation system – related to the dynamic context of the network management process. The criteria for removal is generally associated to the presence of other events, the temporal relationship between alarms or priorities established by network administrators.

 
Filtering
Consists in the suppression of an event, depending on the values of a set of parameters established previously. In a strict sense, the filtering takes into account only the parameters of the event that is being filtered and this type of correlation may consider any other criteria. In this case, the filtering concept expands and may encompass other types of operations such as compression and suppression.

 
Counting
Counting consists on the generation of a new alarm whenever the number of occurrences of a particular type of event exceed a predetermined threshold.

 
Escalation
This is the operation in which, depending on the operational context, an event is deleted, and it is replaced by another event with higher parameter values. The operational context includes, among other elements, the presence of other events, the temporal relation between these events, the number of occurrences of an event during specific span of time and the priorities set by network administrators.

 
Generalization
Generalization consists in the replacement of an event, depending on the operational context and the respective superclass [Bapat 1994 apud Meira 1997] event. Two main types of generalization can be identified: generalization by conditions simplification and instance-based generalization [Holland et al., 1986 apud Meira, 1997]. In the first case, in order the replace the lower class event for another with a higher class in the classes diagram, one or more conditions required for the identification of the same should be ignored. In the second case, a new event can be generated from the respective connection of information of two or more events that were received.

 
Specialization
This is the generalization of an inverse operation, which involves the replacement of one event for another, corresponding to a sub-class [Bapat 1994 apud Meira 1997]. This operation, based on deductive reasons, does not append new information to those that were already implicitly available in the original events and the database configuration, but is useful in the evidencing of the consequences of an event in a specific management layer which may result in the upper management layers.

 
Temporal Relationship
Operation in which the correlation criterion depends on the order or the period in which they are generated or the events are received. Several temporal relationships can be defined using concepts such as: “after-the”, “in-the-following”, “before-the”, “precedes”, “while”, “begins”, “ends”, “coincides with”, “overlaps the”.

 

Integration
Integration is the generation of a new event from the verification point of the compliance, according to the events received, the complex correlation patterns. The integration operation can also take into consideration the results of other correlations and the results of tests performed on the network.

Let us look at some examples, through a wind correlation platform, you can create a rule that identifies the alarms related to a fault in the central network node, delete all alarms generated from network nodes below the central node, until the central node failure is corrected or reestablished. Another example is the acknowledgement of alarms (snmptraps, for example), Link Down and Link Up (time window) lesser than 30 seconds, in other words, if my management platform receive an alarm that a certain communication link (Link Down) is suffering downtime, but in less than 30 seconds, the link is back to normal (Link Up), perhaps a small and fast failure occurred and in this case, I should not open an incident registry, since it was just a small intermittent failure.

Exploring this concept a little more, we could extend the rules of the latter and create a new rule stating that if the link received this pair of events (Link Down / Link Up) more than X times in Y space of time, an incident should be created stating that there is a burst in the signal quality and the cause of the failure should be investigated since this may indicate a potential problem which might cause service interruption.

Going beyond the application of the event correlation for network events, you can expand the use of this technology to IT events in a very broad sense, especially in terms of information security. For example, failure of a virtual machine (VM) in the cloud may be related to a longer response time of an application in a virtual store. Just as a significant increase in Internet usage (over 95%), normally, during the period with less usage, an event associated with a high number of hits to your website (or a particular application) may be related to a hackers attack and should generate an incident to the information security department.

The examples are numerous and surely these “rules” are scattered in the minds of the professionals in your organization. An event correlation platform is in charge of elaborating these rules to for the proper automation and registration of the same. OpMon platform as well as other open-source platform used for management, such as Nagios and Zabbix natively have means of creating event correlation, even though the creation and maintenance of the same is manual and quite difficult, since they are based on codes and complex configurations. From OpMon platform, version 5.0, OpServices, an additional event correlation module called the EventGuard (EG) will be provided.

The EventGuard is a events correlation graphic platform, strongly integrated to the OpMon. The EG Enables the creation of graphical rules in a drag-and-drop interface to the events and the conditions for the generation of alarms. The EventGuard can process thousands of events simultaneously with high performance, since the proper treatment of events is critical to the effectiveness of the rules established. Using the concept of inverted data bank (up-side-down database), the events are not read from a database, but instead of that, they are processed and correlated to the memory and they are stored only after this process.

 
Do you liked the content presented? If the IT of your company have needs related to events correlation, click the button below and request a proposal;

Compartilhe:

Facebook
Twitter
LinkedIn

JUNTE-SE À NOSSA LISTA E RECEBA
OS NOSSOS CONTEÚDOS.

Entre para nossa lista e receba conteúdos exclusivos