Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Google   


Event Correlation Technologies

News Books Recommended Links Recommended Papers Prolog Prolog in Python SQL Memory based SQL databases
SEC IBM TEC Perl-based event correlation Regex Tivoli State Correlation Engine Demo Humor Etc

Event processing flow includes several stages. Among them

Event correlation is one of the most important parts of event processing flow.  Proper event correlation and filtering is critical to ensuring service quality  and the ability to respond rapidly to exceptional situations. The key to this is having experts encode their knowledge about the relationship between event patterns and actions to take. Unfortunately, doing so is time-consuming and knowledge-intensive.
simple approaches often lead to inflation overload when the system is "crying wolf" way too often and as a result even useful alerts get ignores due to noise level. Correlation of events, while not a panacea, can substantially reduce the load of human operator and this improve chances that a relevant alert will be noticed and reacted in due time.

According to Marcus Ranum,  "Correlation is something everyone wants, but nobody even knows what it is. It's like liberty or free beer -- everyone thinks it's a great idea and we should all have it, but there's no road map for getting from here to there."  Still, a variety of technologies and operations are associated with event correlation:

Stateful correlation is essentially a pattern recognition applied to a narrow domain: the process of identification of  patterns of events often across multiple systems or components, patterns that might signify hardware or software problems, attacks, intrusions, misuse or failure of components.  It can also implemented as specialized database with SQL as a query and peephole manipulation engine.  The most typical operations include but are not limited to

Event correlation is often associated with root cause analysis: the process of determining the root cause of one or more events. For example, a failure situation on the network usually generates multiple alerts but only one of them can be considered to be the root course. This is because a failure condition on one device may render other devices inaccessible. Polling agents are unable to access the device which has the failure condition. In addition, polling agents are also unable to access other devices rendered inaccessible by the error on the original device. Events are generated indicating that all of these devices are inaccessible are essentially spam. All we need is a root cause event.

The most typical event stream that serves as a playground for event correlation is Unix (or other OS) system logs. Log analysis is probably the major application domain of event correlation. For basic introduction into concepts of log analysis see [PDF] Guide to Computer Security Log Management. Unix log provide rich information about state of the system that permits building sophisticated correlation schemes. Essentially each log entry is translatable to the event, although many can be discarded as non-essential.

With log-based evens as a constituent part of the events stream, the number of events in a typical large corporate IT infrastructure or just its Unix part can be quite large. That meant that typically raw events are going via special preprocessing phase that is often called normalization and that somewhat trim the number of events for the subsequent processing.  This is most typical if event is extracted from syslog, but is useful for other powerful event streams as well.

Normalization eliminates minor, non-essential variations and convert all events into standard format, or at least format more suitable for further processing.  During this procedure event is also assigned some unique (often numeric) ID. It some way it is similar to rewriting of envelope in email systems like Sendmail.

Pre-filtering vs. "deep" correlation

It does not make any sense to perform correlation is a single step. It is more productive to use a separate stage for each event stream which is usually called "pre-filtering" (or surface correlation) as opposed to "deep" correlation:

The difference between "pre-filtering and deep correlation means that advertisements  about a particular correlation engine based on the claims that it can process tremendous amount of events per second  (Micromuse used to boast about "thousands of events per second") are pretty stupid and tell something about the quality of the architecture.

With 10K event cache IBM TEC 3.8 (and by extension 3.9) can process around 50 events per second using reasonably optimally split set of rule. Assuming newer 3.2Ghs dual core Intel CPU linux and DB2 this might be getting closer to 100 and is pretty much adequate for most purposes. In a way any speed above 100 events per second probably does not improve the quality of the "deep" correlation engine.

Typical operations implemented by correlation engines

Event correlation  encompasses a variety of technologies and operations. Among them we can distinguish the following (overlapping) methods:

Filtering

Filtering of events is close to spam filtering. It can be viewed as pre-processing technology for the event stream.  It has several forms

 

Compression

This is generalization of duplicate removal and creates a single event from not identical but similar events (for example events that are different is just one particular parameter). It can dramatically lower the load of event correlator. Database-based techniques works well for this category.

Duplicates removal

This is the simplest example of compression but with a unique twist: we replace a specified number of similar events with one, but add to the event a new filed called counter which is incremented each time identical event arrives.  In a way it is both compression and simple generalization.

Despite being very simple to implement it is very useful and should be deployed on low-level correlation stages (pre-filtering) as it can reduce the load on the main correlation engine. 

For example 1,000 "SMTP message cannot be delivered" events become a single events that says "message routing failed 1,000 times."  This for example can be due to spam attack or due to the problem of SMTP gateway but this generalized event is definitely more useful then individual events.

More complex variant of duplication removal can be called automodification and we will discuss it in  the next classification entry.

Aggregation

Creates a new, more generic event from several "low level"  dissimilar events (for similar events the appropriate operation is called compression, see below).  For example port scanning event is typically result of generalization of probes on several ports that fit a certain time and/or host distribution pattern. One of the possible approaches is syntax based methods. Often composite event is called ticket and it extends dynamically incorporating new event that fall into the ticket mask (for example all events that are registered for a particular servicable component). for example in case of netwking event one typical aggregation point is device. So if two interfaces on the device fail all corresponding event are aggregated into the device ticket.

Generalization
 

Generalization is more advanced version of aggregation and involves some hierarchy of events. For example if both events about HTTP and FTP connectivity failures are arrives then reasonable generalization would be connectivity/TCP_stack.

Throttling

This is variant of filtering in which events are reported only after they occur a certain number of times or if event des not disappear a certain interval ("calm down period") is called throttling.  For example, if ping failed it is usually wise to wait a certain interval and repeat the ping before "crying wolf". In case of the calm down period the events reported if any new events that contradicts this one was reported does not arrive within specified period.  For example if ping disappeared and does not reappear in 10 sec the lost connectivity can be reported.

Escalation

Sometimes multiple events reflect a worsening error condition for a system or a resource. For example the initial report about file system that is overloaded can be "file system is greater than 90% full", a second, more severe event when greater than 95% full, and a critical event greater than 98% full. In this case, the event processor does not need to report the file system event multiple times. It can merely increase the severity of the initial event to indicate that the problem has become more critical and needs to be responded to more quickly

Self-censure

This is a form of filtering. If the new, arriving event finds out that an event which is a generalization of the current event is present in the event queue, then the current event is "merged" into this event or ticket and just affects the parameters of generalized event (number of repetitions of particular sub event).

One of the most typical examples of self-censure is blocking messages during server shut-down. In this case the shutdown event automatically 'consume" all incoming events.

Time-linking

This method can be helpful if one event is always followed by several others or if sequence of events suggest particular repeating scenario. There is special discipline called temporal logic that helps thinking about such sequences using special diagrams.   Time-linking is combined with suppression: for example any event during maintenance window can be assigned very low priority or completly filtered out.

Typical examples of time-based relationships can include the following:

See also Interval Temporal Logic

Topology based correlation

In case of netwoking event the most common correlation method is the use of topology. Topology-based correlation presuppose existance of some king of network diagram from which one can infere how two devides are connected.

For example, topology-based correlation permit to suppress the events which occur when elements downstream from a known problem are unreachable).

Overview of correlation methods

While general pattern recognition strategies and expert systems engine probably work, there are several specialized (and faster/simpler) approaches to event correlation:

  1. Predicates based (Prolog). Used in Tivoli TEC up to version 3.9.  While it has potential due to ability to establish "child-parent" relationships (a strong point of Prolog), it proved to be too complex to use. It makes simple things complex and complex things beyond the reach of normal administrators. Because of the complexity such engines have a very mixed success.  While this is a legitimate approach and the Prolog has some advantages as a language for describing complex events hierarchies complexity is a definite drawback and it essentially prevents productive usage of such engines. Will probably be abandoned by IBM in a couple of years, after integrating Micromuse engine into TEC.
     

  2. Syntax-parsing based. For example regex-based like in SEC
     

  3. Ad-hoc rule-based systems.  Ad-hoc systems are usually very primitive and conceptually similar to firewalls. XML is often used for expressing rules.  Can be used as a first stage of event processing before using more complex correlation engine ( IBM's Tivoli gateway State correlation engine belongs to this type)
     

  4. SQL-style operations on dynamic (usually memory-based) window of events (Micromuse engine is the most prominent example here)
     

  5. Statistical anomaly based techniques. Statistical correlation uses special statistical algorithms to determine deviations from normal event levels and other routine activities (for example deviation of frequency of event by two standard deviations from the running average for the last 200 minutes, day or a week).
     
  6. Detecting threats using statistical correlation. This is essentially a threshold based approach and it does not depends directly on the usage of complex statistical metrics although they can help. The advantage of this approach is that it does not require any pre-existing knowledge of the event be detected. It just needs to be abnormal in some statistical metric. Statistical methods may, however, be used to detect pre-defined thresholds after which events became abnormal. The simplest example of such metric is standard deviation -- three deviations usually are enough to consider normally distributed event abnormal. Such thresholds may also be configured based on the experience of monitoring the environment.

All of those techniques can be used in some combinations.  For example SQL style operations make compression (including duplicate removal) a trivial operation, but they have problem with generalization of events. Syntax parsing methods are very powerful for generalization but not so much for time linking.

Perl is a great tool for experimentation with event correlation technologies and it is very easy to imitate any of those approaches in Perl.  In a sense Perl can be viewed as an ultimate rules correlation engine.

Dr. Nikolai Bezroukov


Notes:
  • Those pages are written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
  • This is a Spartan WHYFF (We Help You For Free) site. It cannot replace the best teachers and the best books.
  • The site contain some obsolete pages as it develops like a living tree... Some links on older pages are broken. Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.

Search Amazon by keywords:

Google   
Open directory

Research Index

 

Old News ;-)

[Aug 4, 2008] Create a Correlation Engine for the Log and Trace Analyzer

The Log and Trace Analyzer (LTA) included in the IBM Autonomic Computing Toolkit is used for importing different logs generated by various products and transforming the log entries into the Common Base Event (CBE) format. The infrastructure for the LTA has been open source as part of the Eclipse Hyades project (see Resources for more information). The LTA can also import symptom databases. Log files can be analyzed and correlated against the symptom databases to find a solution for the problem. LTA is used primarily for problem determination because finding the cause of a problem becomes more difficult as the number of products, and the number of servers they run on, increases. The log file from a single product cannot always help in determining the solution for the overall system problem.

To help you understand the importance of correlation, consider the IBM WebSphere Application Server and an IBM DB2 database. These two products can work together as the application server to host the components and the database to store the data, respectively. If an error occurs in the database and, as a result, the application server stops, it is impossible to track down the source of the problem by looking only at the application server logs. The errors recorded in the application server logs might not be descriptive enough to indicate the details of the problem with the database. In this case, you also need to look at the logs generated by the database. You need to correlate the logs of the application server and the database so that the corresponding problem records from both of the logs can be identified. Although the CBE time stamp is precise up to the microsecond, watching the logs of the individual products and determining the problem by looking only at the time stamps becomes complex. Keep in mind that logs might be generated from different time zones, and the clocks on the systems running the application server and the database cannot always be synchronized to milliseconds.

Correlation in the Log and Trace Analyzer is finding the relation between the distributed log records and learning the influence of one log record on another. The log records can be from the same log file or from different log files; this relation between the log records can be based on the different properties or combination of the properties of the CBE. A correlation engine is an Eclipse plug-in of the LTA that shows the correlation between the log records visually in a UML sequence diagram.

This article describes the procedure for building a correlation engine for the LTA. This example correlation engine extends the default time correlation engine already available with the LTA. The existing default time correlation engine correlates log records by exactly matching the time stamp of the CBE events. However, there could be a time delay in milliseconds between the records of two products even though both the products are running on the same system. This correlation engine ignores the milliseconds while correlating the logs of the IBM WebSphere Application Server activity log and the IBM DB2 diagnostic log.

Resources

[Aug 4, 2008] Achieving complex event processing with Active Correlation Technology

Today's diverse interconnected e-business components typically come with a lot of event information generated by touchpoints through log files or event emitters. Correlating event information to derive symptoms, or higher level business conclusions, is fundamental to identifying critical situations that need to be corrected. This article describes the IBM Active Correlation Technology (ACT), which provides built-in patterns that support event correlation and complex event processing.

ACT is a technology that is in the works at IBM. You will see it showing up in our products in the future. At this point, however, ACT is not available to be embedded into your own applications. However, if you understand the benefits that this new technology provides, you'll be better able to understand the direction in which autonomic computing technology is headed. Read this article for a sneak peek at what types of functions you'll be seeing in the future. As always, we like hearing what you think; chime in with your thoughts on the autonomic computing discussion forum in the Resources section of the article.

The article provides a brief overview of ACT, which is a set of modular event correlation components that deliver complex event processing functions, such as:

ACT includes support for events that conform to the Common Base Event specification and other messaging formats. ACT is a technology that is being embedded in different IBM products and offerings.

Benefits

Any customer with a data center, trying to manage a complex IT infrastructure, can benefit from a solution or product that embeds ACT. By using ACT to detect symptoms, customers can:

Resources

[Apr 18, 2008] Application-layer anomaly and misuse ... - Google Patents

[Apr18, 2008] Method and apparatus for identifying ... - Google Patents

Method and apparatus for identifying problems in computer networks

US Patent Issued on April 11, 2006

Inventor(s)

Assignee

Application

No. 10108962 filed on 2002-03-28

Current US Class

714/57 , Error forwarding and presentation (e.g., operator console, error display) 714/47 Performance monitoring for fault avoidance

Examiners Attorney, Agent or Firm US Patent References Abstract
A network appliance for monitoring, diagnosing and documenting problems among a plurality of devices and processes (objects) coupled to a computer network utilizes periodic polling and collection of object-generated trap data to monitor the status of objects on the computer network. The status of a multitude of objects is maintained in memory utilizing virtual state machines which contain a small amount of persistent data but which are modeled after one of a plurality of finite state machines. The memory further maintains dependency data related to each object which identifies parent/child relationships with other objects at the same or different layers of the OSI network protocol model. A decision engine verifies through on-demand polling that a device is down. A root cause analysis module utilizes status and dependency data to locate the highest object in the parent/child relationship tree that is affected to determine the root cause of a problem. Once a problem has been verified, a ??case? is opened and notification alerts may be sent out to one or more devices. A user interface allows all objects within the network to be displayed with their respective status and their respective parent/child dependency objects in various formats.

Claims

What is claimed is:

1. In a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprising:

(a) a performance poller for sending performance queries to the plurality of monitored objects and for receiving responses therefrom;

(b) a status poller for sending fault queries to the plurality of monitored objects and for receiving responses thereto;

(c) a fault trapper for receiving fault traps generated by the monitored objects; (e) a database for storing data relating to the monitored objects and the status thereof, wherein the database stores a plurality of virtual state-machines relating to the monitored objects; and
(f) a case management module for receiving case management requests from the decision engine.
2. The apparatus of claim 1 wherein (f) comprises:
(f1) means for presenting data relating to the monitored objects and status thereof.
3. The apparatus of claim 1 wherein the performance poller is further configured to receive performance data requests from a requestor external to the apparatus and for generating a response to the performance data requests.
4. The apparatus of claim 1 wherein the performance poller receives management data from external sources.
5. The apparatus of claim 1 wherein the status poller is further configured to receive fault data requests from a requester external to the apparatus and for generating a response to the fault data requests.
6. The apparatus of claim 1 wherein fault trapper receives management data from external sources.
7. The apparatus of claim 1 wherein the case management module is further configured to receive case management requests from a requester external to the apparatus and for generating a response to the case management requests.
8. The apparatus of claim 1 wherein (d) further comprises:
(d1) a decision processor responsive to the decision requests and configured to send a object query to the database and for a receiving a object response from the database.
9. The apparatus of claim 1 wherein (d) further comprises:
(d2) a case generator responsive to generation requests from the decision processor and configured to generate case management requests to the case management module.
10. The apparatus of claim 1 wherein (f) further comprises:
(f1) a case management module responsive to the case management requests and configured to send a case management request query to the database and for a receiving a case management request response from the database.
11. The apparatus of claim 10 wherein (f) further comprises:
(f2) an escalation engine configured to send an escalation query to the database and for a receiving an escalation response therefrom.
12. The apparatus of claim 11 wherein (f) further comprises:
(f3) a notification engine responsive to the notification requests from the case management module and the escalation engine and configured to send a notification query to the database and for a receiving a notification response from the database and further configured to generate notifications to a presentation device external to the apparatus.
13. The apparatus of claim 1 further comprising:
(g) an on demand status poller for sending queries to monitored objects identified by the decision engine and for receiving responses thereto.
14. In a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprising:
(a) a poller for sending queries to the plurality of monitored objects and for receiving responses therefrom;
(b) a trap receiver for receiving traps generated by the monitored objects;
(c) a decision engine responsive to decision requests from any of the trap receiver and poller indicating that one of the plurality of monitored objects has abnormal status, the decision engine further configured to send a verification query to said one of the plurality of monitored objects identified in the decision request and for receiving a response to the verification query from said one of the plurality of monitored objects confirming or denying abnormal status thereof;
(d) a memory for storing data relating to status of the monitored object, wherein the memory stores a plurality of virtual state-machines relating to the monitored objects; and
(e) a case management module for receiving requests from the decision engine to open a case related to a monitored object and for presenting data relating to the case.
15. The apparatus of claim 14 wherein (a) further comprises:
(a1) a status poller for sending queries to the plurality of monitored objects and for receiving responses thereto.
16. The apparatus of claim 14 wherein (a) further comprises:
(a1) a performance poller for sending performance queries to the plurality of monitored objects and for receiving responses thereto.
17. The apparatus of claim 14 wherein (a) further comprises:
(a1) an on demand status poller for sending queries to monitored objects identified by the decision engine and for receiving responses thereto.
18. In a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprising:
(a) means for monitoring the status of the plurality of monitored objects over the computer network;
(b) means, coupled to the means for monitoring, for receiving data indicating that the status of a monitored object, and, if the data indicating that the status of a monitored object is not normal, for sending a verification request to the monitored object requesting verification of abnormal status and for receiving from the monitored object data confirming or denying abnormal status thereof;
(c) a memory for storing data relating to the status of the monitored objects wherein the memory stores a plurality of virtual state-machines relating to the monitored objects; and
(d) means, coupled to the memory, for presenting data relating to the monitored objects.
19. In an apparatus operatively coupled over a computer network to a plurality of device or process objects, a method comprising:
(a) monitoring the status of the plurality of monitored objects;
(b) receiving data indicating the status of a monitored object;
(c) storing data relating to the status of the monitored objects in memory;
(d) if the data indicating the status of a monitored object is not normal, sending a verification request to the monitored object requesting verification of abnormal status and receiving from the monitored object data confirming or denying abnormal status thereof;
(e) initializing a case relating to a monitored object having a verified status other than normal; and
(f) maintaining in memory a list of all monitored objects, wherein selected of the plurality of monitored objects have parent/child dependency relations.
20. The method of claim 19 further comprising:
(g) alerting a device external to the apparatus that the status of the monitored object has been verified as not normal.
21. The method of claim 20 wherein (f1) comprises:
(g1) alerting an external device operatively coupled to the apparatus over a packet switched network that the status of the monitored object has been verified as not normal.
22. The method of claim 20 wherein (f1) comprises:
(g1) alerting an external device operatively coupled to the apparatus over a circuit switched network that the status of the monitored object has been verified as not normal.
23. The method of claim 19 further comprising:
(g) providing a device external to the apparatus with access to the data relating to the status of the monitored objects in memory.
24. The method of claim 19 wherein (f) further comprises:
(f1) maintaining in memory data identifying the parent/child dependency relations among a plurality of monitored objects.
25. The method of claim 24 wherein the data identifying the parent/child dependency relationship among a plurality of monitored objects is defined in memory with one or more Boolean expressions.
26. The method of claim 24 wherein (d) comprises:
(d1) identifying the highest parent object in the parent/child dependency relation that has a status other than normal.
27. The computer program product for use with an computer system operatively coupled over a computer network to a plurality of device or process objects, the computer program product comprising a computer useable medium having embodied therein program code comprising:
(a) program code for monitoring the status of the plurality of monitored objects;
(b) program code for receiving data indicating the status of a monitored object;
(c) program code for storing data relating to the status of the monitored objects in memory;
(d) program code for sending a verification request to the monitored object requesting verification of abnormal status and for receiving from the monitored object data confirming or denying abnormal status thereof, if the data indicating the status of a monitored object is not normal;
(e) program code for initializing a case relating to a monitored object having a verified status other than normal; and
(f) program code for maintaining in memory a list of all monitored objects, wherein selected of the plurality of monitored objects have parent/child dependency relations.
28. The computer program product of claim 27 further comprising:
(g) program code for alerting a device external to the apparatus that the status of the monitored object has been verified as not normal.
29. The computer program product of claim 28 wherein (f1) comprises:
(g1) program code for alerting an external device operatively coupled to the apparatus over a packet switched network that the status of the monitored object has been verified as not normal.
30. The computer program product of claim 28 wherein (f1) comprises:
(g1) program code for alerting an external device operatively coupled to the apparatus over a circuit switched network that the status of the monitored object has been verified as not normal.
31. The computer program product of claim 27 further comprising:
(g) program code for providing a device external to the apparatus with access to the data relating to the status of the monitored objects in memory.
32. The computer program product of claim 27 wherein (f) further comprises:
(f1) program code for maintaining in memory data identifying the parent/child dependency relations among a plurality of monitored objects.
33. The computer program product of claim 32 wherein the data identifying the parent/child dependency relationship among a plurality of monitored objects defines one or more Boolean relationships.
34. The computer program product of claim 32 wherein (d) comprises:
(d1) program code for identifying if any monitored object in the parent/child dependency relation has a status other than normal; and
(d2) program code for determining that the status of a monitored object is normal if the status of all parent monitored objects in a parent/child dependency chain is other than normal.

Description

FIELD OF THE INVENTION

This invention relates generally to computer networks and more specifically, to an apparatus and methods for identifying, diagnosing, and documenting problems in computer networks including networking devices, servers, appliances, and network services collectively known as objects.

BACKGROUND OF THE INVENTION

Much prior art has focused on identifying network and/or system fault conditions. Additionally, prior art has used topological network maps and diagnostic tools to display network fault conditions. Such tools have been designed to allow less skilled network administrators to conduct support from a network or system management station. Occasionally, network and/or system management systems interface with an exterior system for the documentation of problems and resolutions. Integration is often problematic requiring extensive manipulation and correlation of alarm conditions prior to problem and problem resolution documentation.

Such a traditional approach is inefficient on several levels. The traditional model assumes an administrator is available to actively monitor the network or system management station. In an environment where adequately trained human resources are unavailable, an administrator dedicated to monitoring the network management system is a luxury many technical staffs do not have. A successful system must therefore identify a fault condition and have an established methodology of contacting the appropriate personnel when a fault condition exists.

The current paradigm for network and system management systems is to represent fault information via a topological map. Typically a change in color (or other visual cue) represents a change in the condition of the network or system. This method, as currently applied, is appropriate when a single layer of the Open Systems Interconnect (OSI) logical hierarchical architecture model can represent the fault condition. For example, a fault condition associated with layer two devices can be adequately represented by a layer two topological map. However, to maintain the current paradigm of representing fault condition topologically, a topology map should present a view of the network consistent with complex multi-layer dependencies. Topological representations of large networks are also problematic. A large network is either squeezed onto a single screen or the operator must zoom in and out of the network to change the view. This common approach ignores known relationships between up and downstream objects in favor of a percentage view of the network, e.g. 100% equals the entire network, 50% equals one-half the network.

Further, adequate documentation and description of a problem or fault conditions and its corresponding resolution is essential but difficult to achieve within the confines of a current network or system management systems. Typically the problem description and problem resolution are documented external to the network or system management system. As a result of using an external system to document problems and their resolution, a dichotomy is created between the machine events in the network management system and the external system which records human intervention. Furthermore, the network management system will typically generate multiple events for a single object, such association often lost when translated to an external system. Reconciling the machine view of the network management system with that of the external system documenting the problem description/problem resolution is quite often difficult and unsuccessful.

Current network management tools depend upon the discovery of network/system devices associated with the network, typically through discovery of devices at layer two of the OSI model. Thereafter the network is actively rediscovered using the tool to maintain a current view of the network or system.

A need exist for a technique to automate the process by which network or system faults are translated into an event requiring human action.

A need exists for a technique to discover and document the current state of the network based on known network/system objects and to detect deviations from the known state of the network and report such discovered deviations as faults.

SUMMARY OF THE INVENTION

The invention discloses a network management appliance and methods for identifying, diagnosing, and documenting problems in computer networks using the appliance. The devices and process available on a network, as well as grouping of the same, are collectively referred to hereafter as "objects". Accordingly, a monitored or managed object may be physical device(s), process(es) or logical associations or the same. According to one aspect of the invention, the network appliance comprises one or more a polling modules, a decision engine, a database and a case management module. The network appliance monitors objects throughout the network and communicates their status and/or problems to any number of receiving devices including worldwide web processes, e-mail processes, other computers, PSTN or IP based telephones or pagers.

A Status Poller periodically polls one or more monitored network objects and receives fault responses thereto. A Trap Receiver receives device generated fault messages. Both the Trap Receiver and Status Poller generate and transmit decision requests to the decision engine. The decision engine verifies through on-demand polling that a device is down. What is the novelty of this? -- NNB]

A root cause analysis module utilizes status and dependency data to locate the highest object in the parent/child relationship tree that is affected to determine the root cause of a problem. Once a problem has been verified, a "case" is opened and notification alerts may be sent out to one or more devices. The decision engine interacts with the database and the case management module to monitor the status of problems or "cases" which have been opened. The case management module interacts with the various notification devices to provide the status updates and to provide responses to queries.

The status of a monitored object is maintained in memory using a virtual state machine. The virtual state machines are based on one or a plurality of different finite state machine models. The decision engine receives input data, typically event messages, and updates the virtual state machines accordingly. The inventive network appliance records thousands of network states and simultaneously executes thousands of state machines while maintaining a historical record of all states and state machines.

According to a first aspect of the invention, in a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprises: (a) a performance poller for sending performance queries to the plurality of monitored objects and for receiving responses therefrom; (b) a status poller for sending fault queries to the plurality of monitored objects and for receiving responses thereto; (c) a fault trapper for receiving fault traps generated by the monitored objects; (d) a decision engine responsive to decision requests from any of the fault trapper, status poller and performance poller, the decision engine further configured to send a verification query to one of the plurality of monitored objects identified in the decision request and for a receiving response to the verification query; (e) a database for storing data relating to the monitored objects and the status thereof; and (f) a case management module for receiving case management requests from the decision engine.

According to a second aspect of the invention, in a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprises: (a) a poller for sending queries to the plurality of monitored objects and for receiving responses therefrom; (b) a trap receiver for receiving traps generated by the monitored objects; (c) a decision engine responsive to decision requests from any of the trap receiver and poller, the decision engine further configured to send a verification query to one of the plurality of monitored objects identified in the decision request and for a receiving response to the verification query; (d) a memory for storing data relating to status of the monitored object; and (e) a case management module for receiving requests from the decision engine to open a case related to a monitored object and for presenting data relating to the case.

According to a third aspect of the invention, in a computer system having a processor, memory and a network interface, an apparatus for monitoring a plurality of device or process objects operatively coupled to the computer system over a computer network, the apparatus comprises: (a) means for monitoring the status of the plurality of monitored objects over the computer network; (b) means, coupled to the means for monitoring, for receiving data indicating that the status of a monitored object, and, if the data indicating that the status of a monitored object is not normal, for verifying that the status of a monitored object is not normal; (c) a memory for storing data relating to the status of the monitored object; and (d) means, coupled to the memory, for presenting data relating to the monitored objects.

According to a third aspect of the invention, in an apparatus operatively coupled over a computer network to a plurality of device or process objects, a computer program product and method comprises: (a) monitoring the status of the plurality of monitored objects; (b) receiving data indicating the status of a monitored object; (c) storing data relating to the status of the monitored objects in memory; (d) if the data indicating the status of a monitored object is not normal, verifying that the status of the monitored object is not normal; and (e) initializing a case relating to a monitored object having a verified status other than normal.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which:

DETAILED DESCRIPTION

FIG. 1 illustrates the system architecture for a computer system 100, such as a Dell Dimension 8200, commercially available from Dell Computer, Dallas Tex., on which the invention can be implemented. The exemplary computer system of FIG. 1 is for descriptive purposes only. Although the description below may refer to terms commonly used in describing particular computer systems, the description and concepts equally apply to other systems, including systems having architectures dissimilar to FIG. 1.

The computer system 100 includes a central processing unit (CPU) 105, which may include a conventional microprocessor, a random access memory (RAM) 110 for temporary storage of information, and a read only memory (ROM) 115 for permanent storage of information. A memory controller 120 is provided for controlling system RAM 110. A bus controller 125 is provided for controlling bus 130, and an interrupt controller 135 is used for receiving and processing various interrupt signals from the other system components. Mass storage may be provided by diskette 142, CD ROM 147 or hard drive 152. Data and software may be exchanged with computer system 100 via removable media such as diskette 142 and CD ROM 147. Diskette 142 is insertable into diskette drive 141 which is, in turn, connected to bus 130 by a controller 140. Similarly, CD ROM 147 is insertable into CD ROM drive 146 which is connected to bus 130 by controller 145. Hard disk 152 is part of a fixed disk drive 151 which is connected to bus 130 by controller 150.

User input to computer system 100 may be provided by a number of devices. For example, a keyboard 156 and mouse 157 are connected to bus 130 by controller 155. An audio transducer 196, which may act as both a microphone and a speaker, is connected to bus 130 by audio controller 197, as illustrated. It will be obvious to those reasonably skilled in the art that other input devices such as a pen and/or tablet and a microphone for voice input may be connected to computer system 100 through bus 130 and an appropriate controller/software. DMA controller 160 is provided for performing direct memory access to system RAM 110. A visual display is generated by video controller 165 which controls video display 170. Computer system 100 also includes a network adapter 190 which allows the system to be interconnected to a local area network (LAN) or a wide area network (WAN), schematically illustrated by bus 191 and network 195.

Computer system 100-102 are generally controlled and coordinated by operating system software. The operating system controls allocation of system resources and performs tasks such as process scheduling, memory management, and networking and I/O services, among other things. In particular, an operating system resident in system memory and running on CPU 105 coordinates the operation of the other elements of computer system 100. The present invention may be implemented with any number of commercially available operating systems including UNIX, Windows NT, Windows 2000, Windows XP, Linux, Solaris, etc. One or more applications 220 such as the inventive network management application may execute under control of the operating system 210. If operating system 210 is a true multitasking operating system, multiple applications may execute simultaneously.

In the illustrative embodiment, the present invention may be implemented using object-oriented technology and an operating system which supports execution of object-oriented programs. For example, the inventive system may be implemented using a combination of languages such as C, C++, Perl, PHP, Java, HTML, etc., as well as other object-oriented standards.

In the illustrative embodiment, the elements of the system are implemented in the C++ programming language using object-oriented programming techniques. C++ is a compiled language, that is, programs are written in a human-readable script and this script is then provided to another program called a compiler which generates a machine-readable numeric code that can be loaded into, and directly executed by, a computer. As described below, the C++ language has certain characteristics which allow a software developer to easily use programs written by others while still providing a great deal of control over the reuse of programs to prevent their destruction or improper use. The C++ language is well-known and many articles and texts are available which describe the language in detail. In addition, C++ compilers are commercially available from several vendors including Borland International, Inc. and Microsoft Corporation. Accordingly, for reasons of clarity, the details of the C++ language and the operation of the C++ compiler will not be discussed further in detail herein. The program code used to implement the present invention may also be written in scripting languages such as Perl, Java Scripts, or non-compiled PHP. If required, the non-compiled PHP can be converted to machine readable format.

Network Communication Environment

FIG. 2 illustrates a telecommunications environment in which the invention may be practiced such environment being for exemplary purposes only and not to be considered limiting. Network 200 of FIG. 2 illustrates a hybrid telecommunication environment including both a traditional public switched telephone network as well as packet-switched data network, such as the Internet and Intranet networks and apparatus bridging between the two. The elements illustrated in FIG. 2 are to facilitate an understanding of the invention. Not every element illustrated in FIG. 2 or described herein is necessary for the implementation or the operation of the invention.

Specifically, a packet-switched data network 202 comprises a network appliance 300, a plurality of processes 302-306, plurality of monitored devices 314a-n, external databases 310a-n, external services 312 represented by their respective TCP port, and a global network topology 220, illustrated conceptually as a cloud. One or more of the elements coupled to global network topology 220 may be connected directly through a dedicated connection, such as a T1, T2, or T3 connection or through an Internet Service Provider (ISP), such as America On Line, Microsoft Network, Compuserve, etc.

A gateway 225 connects packet-switched data network 202 to circuit switched communications network 204 which includes a central office 210 and one or more traditional telephone terminating apparatus 308a-n. Circuit switched communications network 204 may also include, although not shown, a traditional PSTN toll network with all of the physical elements including PBXs, routers, trunk lines, fiber optic cables, other central offices etc. Terminating apparatus 308a-n may be implemented with either a digital or analog telephone or any other apparatus capable of receiving a call such as modems, facsimile machines, cellular telephones, etc., such apparatus being referred to collectively hereinafter as a terminating apparatus, whether the network actually terminates. Further, the PSTN network may be implemented as either an integrated services digital network (ISDN) or a plain old telephone service (POTS) network.

Each network consists of infrastructure including devices, systems, services and applications. Manageable network components utilize management mechanisms that follow either standard or proprietary protocols. Appliance 300 supports multiple interfaces to manageable devices from various points within its architecture, providing the flexibility to monitor both types of network components.

Components that can be managed using standard or public protocols (including items such as routers, switches, servers, applications, wireless devices, IP telephony processes, etc.) are designed under the premise that such components would reside in networks where a network management system is deployed. Such devices typically contain a MIB (Management Information Base), which is a database of network management information that is used and maintained by a common network management protocol such as SNMP (Simple Network Management Protocol). The value of a MIB object can be retrieved using SNMP commands from the network management system. Appliance 300 monitors the raw status events from such infrastructure directly using various standard protocol queries through a Status. Poller 330 and a Trap Receiver 332, as explained hereinafter.

Network components that were not designed with network management applications may have internal diagnostics capabilities that make it possible to generate an alarm or other data log. This data may be available via an interface and/or format that is proprietary in nature. Such systems may also have the ability to generate log files in text format, and make them available through supported interfaces such as e-mail. If event processing capability is needed, appliance 300 can monitor such network components through custom status plug-ins modules.

Network Appliance Overview

In the illustrative embodiment, except for specific interface hardware, network appliance 300, referred to hereafter as simply as "appliance 300", may be implemented as part of an all software application which executes on a computer architecture similar to that described with reference to FIG. 1. As illustrated in FIGS. 3-5, appliance 300 can communicate either directly or remotely with any number of devices, or processes, including the a worldwide web processes 302, a Personal Digital Assistant 304, an e-mail reader process 306, a telephone 308, e.g., either a traditional PSTN telephone or an IP-enabled telephony process 311, and/or a pager apparatus 310. In addition, appliance 300 can communicate either directly or remotely with any number of external management applications 312 and monitored devices 314. Such communications may occur utilizing the network environment illustrated in FIG. 2 or other respective communication channels as required by the receiving or process.

Appliance 300 monitors network objects, locates the source of problems, and facilitates diagnostics and repair of network infrastructure across the core, edge and access portions of the network. In the illustrative embodiment, appliance 300 comprises a status monitoring module 318, a performance monitoring module 316, a decision engine 324, a case management module 326 and database 348. The implementations of these modules as well as their interaction with each other and with external devices is described hereafter in greater detail.

The present invention uses a priori knowledge of devices to be managed. For example, a list of objects to be monitored may be obtained from Domain Name Server. The desired objects are imported into the appliance 300. The relationships between imported objects may be entered manually or detected via an existing automated process application. In accordance with the paradigm of the invention, any deviation from the imported network configuration is considered a fault condition requiring a modification of the source data. In this manner the network management appliance 300 remains in synchronization with the source data used to establish the network configuration.

Status Monitoring Module

A Status Monitoring Module 318 comprises a collection of processes that perform the activities required to dynamically maintain the network service level, including the ability to quickly identify problems and areas of service degradation. Specifically, Status Monitoring Module 318 comprises Status Poller Module 330, On-Demand Status Poller 335, Status Plug-Ins 391, Bulk Plug-In Poller 392, Bulk UDP Poller 394, Bulk if OperStatus Poller 396, Bulk TCP Poller 398, Bulk ICMP Poller 397, Trap Receiver 332, Status View Maintenance Module 385, and Status Maps and Tables Module 387.

Polling and trapping are the two primary methods used by appliance 300 to acquire data about a network's status and health. Polling is the act of asking questions of the monitored objects, i.e., systems, services and applications, and receiving an answer to those questions. The response may include a normal status indication, a warning that indicates the possibility of a problem existing or about to occur, or a critical indication that elements of the network are down and not accessible. The context of the response determines whether further appliance 300 action is necessary. Trapping is the act of listening for a message (or trap) sent by the monitored object to appliance 300. These trap messages contain information regarding the object, its health, and the reason for the trap being sent.

A plurality of plug-ins and pollers provide the comprehensive interface for appliance 300 to query managed objects in a network infrastructure. Such queries result in appliance 300 obtaining raw status data from each network object, which is the first step to determining network status and health. The various plug-ins and pollers operate in parallel, providing a continuous and effective network monitoring mechanism. Pollers may utilize common protocols such as ICMP (Ping), SNMP Get, Telnet, SMTP, FTP, DNS, POP3, HTTP, HTTPS, NNTP, etc. As a network grows in size and complexity, the intelligent application of polling and trapping significantly enhances system scalability and the accuracy of not only event detection, but also event suppression in situations where case generation is not warranted.

Status Poller

Fault detection capability in appliance 300 is performed by Status Poller 330 and various poller modules, working to effectively monitor the status of a network. Status Poller 330 controls the activities of the various plug-ins and pollers in obtaining status information from managed devices, systems, and applications on the network. FIG. 6 illustrates the status flow between network appliance 300 and external network elements. Status Poller 330 periodically polls one or more monitored devices 314A-N. Status Poller 330 generates a fault poll query to a monitor device 314 and receives, in return, a fault poll response. The fault poll queries may be in the form of any of a ICMP Echo, SNMP Get, TCP Connect or UDP Query. The fault poll response may be in the form of any of a ICMP Echo Reply, SNMP Response, TCP Ack or UDP Response. Status Poller 330 may also receive a fault data request in URL form from web process 302. In response, Status Poller 330 generates and transmits fault data in HTML format to web process 302. Status Poller 330 generates decision requests for decision engine 334 in the form of messages. In addition, Status Poller 332 receives external data from an external management application 312. Trap Receiver 332 receives device generated fault messages from monitored devices 314. Both Trap Receiver 332 and Status poller 330 generate decision requests for decision engine 334 in the form of messages.

Status Poller 330 determines the needed poll types, segregates managed objects accordingly, and batch polls objects where possible. A Scheduler 373 triggers the Status Poller 330 to request polling at routine intervals. During each polling cycle, each monitored object is polled once. If any objects test critical, all remaining normal objects are immediately polled again. A Dependency Checker module which is part of the Root Cause Analysis Module determines which objects have changed status from the last time the Status Poller 330 was run, and determines, using the current state objects and the parent/child relation data, which objects are "dependency down" based on their reliance on an upstream object that has failed. This process repeats until there are no new critical tests found. Once the polling cycle is stable, a "snapshot" of the network is saved as the status of the network until the next polling cycle is complete. The network status information obtained is written into database 352 for use by other processes, such as the Decision Engine 334 when further analysis is required.

Polling a network for status information is an effective method of data gathering and provides a very accurate picture of the network at the precise time of the poll, however, it can only show the state of the network for that moment of time. Network health is not static. A monitored object can develop problems just after is has been polled and reflected a positive operational result. Moreover, this changed status will not be known until the device is queried during the next polling cycle. For this reason appliance 300 also incorporates the use of the Trap Receiver 332 to provide near real-time network status details.

Trap Receiver

A trap is a message sent by an SNMP agent to appliance 300 to indicate the occurrence of a significant event. An event may be a defined condition, such as a link failure, device or application failure, power failure, or a threshold that has been reached. Trapping provides a major incremental benefit over the use of polling alone to monitor a network. The data is not subject to an extended polling cycle and is as real-time as possible. Traps provide information on only the object that sent the trap, and do not provide a complete view of network health. Appliance 300 receives the trap message via Trap Receiver 332 immediately following the event occurrence. Trap Receiver 332 sends the details to Status View Maintenance Module 385, which requests the Status Poller 330 to query the network to validate the event and locate the root cause of the problem. Confirmed problems are passed to Case Management Module 326 to alert network management personnel.

The On-Demand Status Poller 335 provides status information to Decision Engine 334 during the verification stage. Unlike the Status Poller 330, On-Demand Status Poller 335 only polls the objects requested by the Decision Engine 334. Since this is usually a small subset of objects, the status can typically be found more quickly. The responses from these polls are fed back to the Decision Engine 334 for further processing and validation.

The Status View Maintenance Module 385 provides a gateway function between the Status Poller 330, and Root Cause Analysis and the Decision Engine Modules. The Status View Maintenance Module 385 controls the method by which network status information is created, maintained, and used. It serves as the primary interface for the depiction of network status details in the Status Maps and Status Table 387. Detailed object status information is presented through four (4) statuses: raw, dependency, decision, and case.

The Status Maps and Tables Module 387 is used to generate representations of complex relationships between network devices, systems, services and applications. Status Maps and Tables Module 387 works in conjunction with web server application 381 using known techniques and the HTML language to provide a web accessible user interface to the data contained in database 352. A Status Map depict the precise view of managed objects and processes as defined during the implementation process. The Status Map provides a fast and concise picture of current network issues, providing the ability to determine the specific source of network failure, blockage or other interference. Users can zoom to the relevant network view, and launch an object-specific Tools View that assists in the diagnostics and troubleshooting process and may include links to third party management tools, such as Cisco Resource Manager Essentials (RME), etc.

A Status Table enables a tabular view of managed network infrastructure. All managed network components 314 can be displayed individually, or assembled under categories according to device type, location, or their relationship to the monitoring of Groups of objects representing complete processes or other logical associations. As described in the User Interface section hereafter, a series of unique status icons clearly depict the operational state of each object, with the option to include more comprehensive status views including greater details on the various process elements for managed objects.

Status Plug-Ins/Bulk Pollers

As will be understood by those skilled in the arts, a plug-in, as used herein, is a file containing data used to alter, enhance, or extend the operation of an parent application program. Plug-ins facilitate flexibility, scalability, and modularity by taking the input from the a proprietary product and interfacing it with the intended application program. Plug-in modules typically interface with Application Program Interfaces (API) in an existing program and prevent an application publisher from having to build different versions of a program or include numerous interface modules in the program. In the present invention plug-ins are used to interface the status poller 335 with monitored objects 314.

The operation of plug-ins and bulk pollers is conducted at routine intervals by the Status Poller Module 330, and, on an as-needed basis, by the request of the On-Demand Status Poller Module 335. In the illustrative embodiment, the primary status plug-ins and pollers include Status Plug-ins 391, Bulk Plug-In Poller 392, Bulk UDP Poller 394, Bulk if OperStatus Poller 396, Bulk TCP Poller 398 and Bulk ICMP Poller 397.

Status Plug-Ins 391 conduct specific, individual object tests. Bulk Plug-In Poller 392 makes it possible to conduct multiple simultaneous tests of plug-in objects. Unlike many network management systems that rely solely on individual object tests, the Bulk Plug-In Poller 392 enables a level of monitoring efficiency that allows appliance 300 to effectively scale to address larger network environments, including monitoring via SNMP (Simple Network Management Protocol). Used almost exclusively in TCP/IP networks, SNMP provides a means to monitor and control network devices, and to manage configurations, statistics collection, performance, and security.

Bulk UDP Poller 394 is optimized to poll for events relating to UDP (User Datagram Protocol) ports only. UDP is the connectionless transport layer protocol in the TCP/IP protocol stack. UDP is a simple protocol that exchanges datagrams without acknowledgments or guaranteed delivery, requiring that error processing and retransmission be handled by other protocols. Bulk UDP Poller 394 permits multiple UDP polls to be launched within the managed network.

Bulk if OperStatus Poller 396 monitors network infrastructure for the operational status of interfaces. Such status provides information that indicates whether a managed interface is operational or non-operational.

Bulk TCP Poller 398 polls for events relating to TCP (Transmission Control Protocol) ports only. Part of the TCP/IP protocol stack, this connection-oriented transport layer protocol provides for full-duplex data transmission. Bulk TCP Poller 398 permits multiple TCP polls to be launched within the managed network.

Bulk ICMP Poller 397 performs several ICMP (ping) tests in parallel. Bulk ICMP Poller 397 can initiate several hundred tests without waiting for any current tests to complete. Tests consists of an ICMP echo-request packet to an address. When an ICMP echo-reply returns, the rawO status is deemed normal. Any other response or no answer within a set time generates a new echo-request. If an ICMP echo-reply is not received after a set number of attempts, the raw status is deemed critical. The time between requests (per packet and per address), the maximum number of requests per address, and the amount of time to wait for a reply are tunable by the network administrator using appliance 300.

Performance Monitoring Module

The primary component of performance monitoring module 316 is performance poller 322. Performance poller 322 is the main device by which appliance 300 interacts with monitored device(s) 314a-n and is responsible for periodically monitoring such devices and reporting performance statistics thereon. Performance poller 322 is operatively coupled to application(s) 312, monitored device(s) 314, decision engine 334 and web process(es) 302. FIG. 10 illustrates the communication flow between the performance poller 322 and decision engine 334, as well as external elements. Performance poller 322 polls monitored device(s) 314a-n periodically for performance statistics. Specifically, performance poller 322 queries each device 314 with an SNMP Get call in accordance with the SNMP standard. In response, the monitored device 314 provides a performance poll response to performance poller 322 in the form of an SNMP Response call, also in accordance with the SNMP standard. Based on the results of the performance poll response, performance poller 322 generates and transmits decision requests to decision engine 334 in the form of messages. Such decision requests may be generated when i) a specific performance condition occurs, ii) if no response is received within predefined threshold, or iii) if other criteria are satisfied. Decision engine 334 is described in greater detail hereinafter. In addition, one or more external management applications 312 provide external management data to performance poller 322 in the form of messages.

In the illustrative embodiment, performance poller 322 may have an object-oriented implementation. Performance poller 322 receives external data from applications 312 through message methods. Such external applications may include Firewalls, Intrusion Detection Systems (IDS), Vulnerability Assessment tools, etc. Poller 322 receives performance data requests from web process 302 via Uniform Resource Locator (URL) methods. In response, poller 322 generates performance data for web process 302 in the form of an HTML method. In addition, poller 322 receives performance poll response data from a monitored device 314 in the form of an SNMP response method. In addition, poller 322 receives performance poll response data from a monitored device 314 in the form of an SNMP response method. As output, poller 322 generates a performance poll query to a monitored device 314 in the form of an SNMP Get method. Performance poller 322 generates decision requests to decision engine 334, in the form of a message.

Performance Poller 322 obtains performance data from network devices and applications, creating a comprehensive database of historical information from which performance graphs are generated through the user interface of appliance 300, as described hereafter. Such graphics provide network management personnel with a tool to proactively monitor and analyze the performance and utilization trends of various devices and applications throughout the network. In addition, the graphs can be used for diagnostics and troubleshooting purposes when network issues do occur.

A series of device-specific Performance Plug-Ins 321 serve as the interface between the Performance Poller 322 and managed network objects. The performance criteria monitored for each component begins with a best practices of network management approach. This approach defines what elements within a given device or application will be monitored to provide for the best appraisal of performance status. The managed elements for each device or application type are flexible, allowing for the creation of a management environment that reflects the significance and criticality of key infrastructure. For instance, should there be an emphasis to more closely monitor the network backbone or key business applications such as Microsoft Exchange, a greater focus can be placed on management of this infrastructure by increasing the performance criteria that is monitored. Likewise, less critical infrastructure can be effectively monitored using a smaller subset of key performance criteria, while not increasing the management complexity caused by showing numerous graphs that are not needed.

Once the performance management criterion is established, the Performance Plug-Ins are configured for each managed device and application. Performance elements monitored may include, but are not limited to, such attributes as CPU utilization, bandwidth, hard disk space, memory utilization, or temperature. Appliance 300 continuously queries managed or monitored objects 314 at configured intervals of time, and the information received is stored as numeric values in database.

Event Processing

The appliance 300 architecture comprises sophisticated event processing capability that provides for intelligent analysis of raw network event data. Instead of accumulating simple status detail and reporting all network devices that are impacted, appliance 300 attempts to establish the precise cause of a network problem delivering the type and level of detail that network management personnel require to quickly identify and correct network issues. The primary components of event processing capability in appliance 300 are the Root Cause Analysis Module 383 and the Decision Engine 334.

Root Cause Analysis

When a change in network status is observed that may indicate an outage or other issue, the Status Poller 330 presents the to the Root Cause Analysis module 383 for further evaluation. During the course of a network problem or outage, this may consist of tens or even hundreds of status change event messages. These numerous events may be the result of a single or perhaps a few problems within the network.

The Root Cause Analysis Module 383 works directly with the Decision Engine 334 during the event evaluation process. Appliance 300 first validates the existence of an event and then identifies the root cause responsible for that event. This process entails an evaluation of the parent/child relationships of the monitored object within the network. The parent/child relationships are established during the implementation process of appliance 300, where discovery and other means are used to identify the managed network topology. A parent object is a device or service that must be functional for a child device or service to function. A child object is a device or service that has a dependency on a parent device or service to be functional. Within a network environment a child object can have multiple parent objects, and a parent object can have multiple children objects. In addition, the parent and child objects to a node or monitored object may be located at the same or different layers of the OSI network protocol model across the computer network. Because of this, a Dependency Checker function within Root Cause Analysis Module 383 performs a logical test on every object associated with a monitored object in question to isolate the source of the problem. When appliance 300 locates the highest object in the parent/child relationship tree that is affected by the event it has found the root cause of the problem.

Case Management System

The Case Management system 336 is an integral component of appliance 300 and provides service management functionality. Whereas the Decision Engine 334 works behind the scenes to identify and validate faults, Case Management system 336 is the interface and tool used to manage information associated with the state of the network. Case Management system 336 provides a process tool for managing and delegating workflow as it relates to network problems and activities. The Case Management generates service cases (or trouble tickets) for presentation and delivery to network management personnel.

Case management system 336 comprises a CMS application module 350, a database 352, a notification engine 356 and an escalation engine 354, as illustrated. CMS application module 350 comprises one or more applications and perform the CMS functionality, as explained hereinafter. CMS applications 350 receive CMS requests, in the form of URL identifiers from decision engine 334. In response, CMS applications 350 generate and transmit notification requests to notification engine 356, in the form of messages. CMS applications 350 generate and transmit CMS data to a worldwide web process 302 in the form of HTML data. Database 352 receives CMS queries from CMS applications 350 in the form of messages and generates in response thereto a CMS response in the form of a message, as well. In addition, database 352 receives notification queries from notification client 364, in the form of messages and generates, in response there, notification responses to notification client 364 in the form of messages as well.

Case Management system 336 accommodates Auto cases and Manual cases. Cases passed to the Case Management System from the Decision Engine Module appear as AutoCases. These system-generated cases are associated with a network problem. Appliance 300 has determined that the node referenced in the case is a device responsible for a network problem, based on the findings of Root Cause Analysis and the Decision Engine 334. The Auto Case is automatically assigned an initial priority level that serves until the case is reviewed and the priority is modified to reflect the significance of the problem relative to the network impact and other existing cases being handled.

Cases entered into Case Management system 336 by the network manager or network management personnel are called Manual Cases. This supports the generation, distribution, and tracking of network work orders, or can aid in efforts such as project management. Using a web browser, personnel can obtain the case data from either on-site or remote locations, and access a set of device-specific tools for diagnostics and troubleshooting. Unlike other general-purpose trouble ticketing systems, the appliance 300 has case management capabilities are specifically optimized and oriented to the requirements of network management personnel. This is reinforced in both the types and level of information presented, as well as the case flow process that reflects the specific path to network issue resolution. Opening a case that has been generated shows the comprehensive status detail such as the impacted network node, priority, case status, description, and related case history. The network manager or other personnel can evaluate the case and take the action that is appropriate. This may include assigning the case to a network engineer for follow-up, or deleting the case if a device has returned to fully operational status.

The main Case Management screen of the user interface provides a portal through web server application 381 from which all current case activity can be viewed, including critical cases, current priority status, and all historical cases associated to the specific object. Case data is retained in appliance 300 to serve as a valuable knowledge-base of past activity and the corrective actions taken. This database is searchable by several parameters, including the ability to access all cases that have pertained to a particular device. A complete set of options is available to amend or supplement a case including: changing case priority; setting the case status; assigning or re-assigning the case to specific personnel; correlating the case to a specific vendor case or support tracking number, and updating or adding information to provide further direction on actions to be taken or to supplement the case history.

Escalation engine 354 tracks escalations and requests notifications as needed. Escalation engine 354 generates and transmits escalation queries to database 352 in the form of messages and receives, in response thereto, escalation responses in the forms of messages. In addition, escalation engine 354 generates and transmits notification requests, in the form of messages, to notification server 360 of notification engine 356, in the form of messages. Automated policy-based and roles-based case escalation processes ensure that case escalations are initiated according to defined rules and parameters. Cases not responded to within pre-established time periods automatically follow the escalation process to alert management and other networking personnel of the open issue.

Notification Engine

When a new auto case or manual case is generated or updated, appliance 300 initiates a notification process to alert applicable network personnel of the new case. This function is provided through Notification Engine 356. Appliance 300 utilizes a configurable notification methodology that can map closely an organization's specific needs and requirements. Appliance 300 incorporates rules- and policy-based case notification by individual, role, or Group, and includes additional customizability based on notification type and calendar. Supported notification mechanisms include various terminal types supporting the receipt of standard protocol text messaging or e-mail, including personal computer, text pager, wireless Personal Digital Assistant (PDA), and mobile phones with messaging capability. The e-mail or text message may contain the important details regarding the case, per the notification content format established in system configuration.

As illustrated in FIG. 9, notification engine 356 comprises notification server 360, database 352, notification client 364, paging client 366, paging server 367, Interactive Voice Response (IVR) server 368 and SMTP mail module 369. Notification engine 356 generates notifications via e-mail and pager as necessary. Notification server 360 accepts notification requests, determines notification methods, and stores notifications in database 352. As stated previously, notification server 360 receives notification requests from CMS applications 350. Notification server generates and transmits Point Of Contact (POC) queries in the form of messages to database 352 and receives, in response thereto, POC responses, also in the form of messages. Notification client 364 generates notifications using appropriate methods. Notification client 364 generates and transmits notification queries, in the form of messages, to database 352 and receives in response thereto notification responses, also in the form of messages. In addition, notification client 364 generates and transmits page requests in the form of messages to paging client 366. Notification client 364 further generates, in the form of messages, IVR requests to IVR server 368 and e-mail messages to SMTP mail module 369. Paging client 366 receives page requests from notification client 364 and forwards the page requests onto page server 367. Paging server 367 generates pager notifications, in the form of messages, to a pager device 310. Paging server 367 accesses a TAP terminal via a modem or uses the Internet to forward the pager notification. IVR server 368 receives IVR requests and calls phone 308 via an IVR notification in the form of a telephone call which may be either packet-switched or circuit-switched, depending on the nature of the terminating apparatus and the intervening network architecture. SMTP mail module 369 processes notifications via e-mail and acts as a transport for paging notifications. SMTP mail module 369 generates messages in the form of e-mail notifications to e-mail process 306 and PDA notifications to personal digital assistant device 304.

Decision Engine

Decision Engine 334 is an extensible and scaleable system for maintaining programmable Finite State Machines created within the application's structure. Decision Engine 334 is the portion of system architecture that maintains the intelligence necessary to receive events from various supporting modules, for the purpose of verifying, validating and filtering event data. Decision Engine 334 is the component responsible for reporting only actual confirmed events, while suppressing events that cannot be validated following the comprehensive analysis process.

Referring to FIG. 7, decision engine 334 comprises, in the illustrative embodiment, a queue manager 340, decision processor 344, case generator 346, database 352 and one or more plug in modules 342. As illustrated, decision engine 334 receives decision requests from any of Performance poller 322, Status Poller 330 or Trap Receiver 332, in the form of messages. A queue manager 340 manages the incoming decision requests in a queue structure and forwards the requests to decision processor 344 in the form of messages. Decision processor 344 verifies the validity of any alarms and thresholds and forwards a generation request to case generator request 346 in the form of a message. Case generator 346, in turn, compiles cases for verification and database information and generates a CMS request which is forwarded to case management system 336, described in greater detail hereinafter.

In addition, decision processor 344 generates and transmits device queries in the form of messages to database 352. In response, database 352 generates a device response in the form of message back to decision processor 344. Similarly, decision processor 344 generates and transmits verification queries in the form of messages to plug in module 342. In response, module 342 generates a verification response in the form of a message back to decision processor 344. Plug in module 342 generates and transmits verification queries in the form of messages to a monitored device 314. In response, monitored device 314 generates a verification response in the form of a message back to plug-in module 342.

Decision engine 334 may be implemented in the C programming language for the Linux operating system, or with other languages and/or operating systems. Decision engine 334 primarily functions to accept messages, check for problem(s) identified in the message, and attempts to correct the problem. If the problem cannot be corrected the decision engine 334 opens a "case". In the illustrative embodiment, decision engine 334 may be implemented as a state-machine created within a database structure that accepts messages generated by events such as traps and changes state with messages. If the decision engine reaches certain states, it opens a case. The main process within the decision engine state-machine polls a message queue and performs the state transitions and associated tasks with the transitions. Events in the form of decision requests are processed by the decision engine/virtual state-machine. The decision module/virtual state-machine processes the request and initiates a verification query. The verification response to the verification query is processed by the decision module/virtual state-machine. Based on the configuration of the decision module/state-machine the decision module/state machine initiates a case management module case request. Events are polls, traps, and threshold violations generated by the status poller, fault trapper, and performance poller respectively. As shown in FIG. 11, decision engine 334 comprises several continuously running processes or modules including populate module 380, command module 382, decision module 384, variable module 386, on demand status poller module 388, and timer module 390, described in greater detail hereinafter. These processes may launch new processes when required. In the illustrative embodiment, these processes share 415 database tables in database 352 as a means for communication by accessing and manipulating the values within the database. In FIGS. 4-6 and 10, the functions of Decision Engine 334 are performed by command module 382, decision module 384, variable module 386, on demand status poller module 388, and timer module 390, described in greater detail hereinafter. In FIG. 7, the functions of Decision Processor 344 are performed by decision module 384, variable module 386, on demand status poller module 388, and timer module 390. The functions of Case Generator 346 is performed by command module 382.

Populate Module

The populate module 380 creates and initializes the state machine(s) to the "ground" state for each managed object 314 whenever a user commits changes to their list of managed objects. In the illustrative embodiment, unless purposefully overridden, the populate module 380 will not overwrite the current machine state for a managed object. Otherwise, notifications could be missed. Also, the deletion of an object upon a commit results in the deletion of all state machines, timers, and variables associated with the object to prevent unused records and clutter in database 352.

Command Module

The command module 382 retrieves records from the command table, performs the task defined in a database record, and, based on the result returned by the command, places a message in the message queue, i.e. the Message Table. In the illustrative embodiment, a command can be any executable program, script or utility that can be run using the system( ) library function.

In illustrative embodiment, the command module 382 may be implemented in the C programming language as a function of a Decision Engine object and perform the functions described in the pseudo code algorithm set forth below in which any characters following the "#" symbol on the same line are comments: while TRUE # loop forever retrieve the record that has been sitting in the commands queue table for the longest period of time use the system command (or some other as yet to be determined method) to execute the command found in the action field of the current record. The argument list for action will be build using the values found in the host, poll, instance, and argument fields of the current record. Upon completion of the command, if the message found in the message field is not blank, put the message into the message queue. #end loop forever

Decision Module

The decision module 384 retrieves messages from the message queue, determines which state machine the message is intended for, changes the state of the machine based on the content of the message, and "farms out" to the other modules the tasks associated with the state change. In the illustrative embodiment, a task has associated therewith a number of optional components including a type, action, arguments, condition and output message. A brief description of each task component is shown below: type-identifies which module, i.e., command, variable, timer, or on demand state poller, that is to perform the task. The action of some types of tasks may be handled by the decision module and not sent to another module. For example, a message with the type "say" is just a request to put a new message into the message queue. The decision module handles such task. action-the specific action the module is to take. For example, increment a counter or start a timer. arguments-any arguments required to complete the action condition-if present, identifies a condition that must be met before the associated message can be put into the message queue. A condition may consist of a comparison between the value of a variable stored in the variables table and a constant value or the value of another variable that evaluates as either true or false. An example condition would be "count>5", which means that the value of the value field in the variables table record where the value of the varName field is 'count' for the current object should be greater than five for a message to be put into the queue. Condition expressions may be of the form: <VAR_NAME COMPARISON_OPERATOR VALUE>[[AND|OR] [VAR_NAME COMPARISON_OPERATOR VALUE]] . . . By adhering to this format, the code that parses the condition expression will not have to be changed if the condition expression changes. Also, such format allows for arbitrarily complex condition expressions. output message-the message to be put into the message queue upon completion of the task. The output message can be blank indicating that there is no message to put into the message queue on completion of the task. Since messages are deleted as they are taken or "popped" from the message queue, the messages may be logged to the log table in database 352 to provide a permanent record of message traffic.

In order to provide additional flexibility to the arguments field of the active_timers, command_queue, and variable_queue tables, the arguments field in the transition_functions and state_f