0 Replies Latest reply on Apr 13, 2012 7:16 AM by objectiser

SLA Monitoring Approaches

objectiser Apr 13, 2012 7:16 AM

With the release of Switchyard 0.4, it is now possible to obtain service related metrics regarding the number of times service was invoked, the number of faults that occurred, the min, max and average response times. In a subsequent release, it will also be possible to obtain this information at the granularity of the operations associated with a service, to determine which operations on the service are not performing as expected.

As well as recording this information for each service, the information is also accumulated for the references used by each service. This enables an administrator to localise where any performance issue may be.

As part of the BAM work being undertaken, we need to be able to monitor SLAs against the service metrics. SLAs will generally want to ensure that the response time remains within expected limits, although it is possible that an organisation may also wish to ensure that a particular service is only invoked a certain number of times within a defined period. To allow a certain level of flexibility, CEP rules will be used to encode the SLAs and therefore used to do the monitoring.

This document examines three approaches that can be used to collect the service metric information from Switchyard and apply it to the CEP rules:

a) Periodically inspect Switchyard service metrics

Switchyard currently observes the 'completed exchange events' generated by the execution of services in switchyard, and then aggregates the relevant information within an internal service metric model. Unless explicitly reset, the metrics are simply updated with each successive event.

When an administrator is investigating a performance issue, this level of information is useful, as it outlines the performance over a long period of time. The adminstrator can examine the service and its references to understand how long each component takes to execute.

However SLAs tend to be based on performance over a particular time period, e.g. 1 second. Therefore to obtain meaningful information, the SLA monitor would need to determine the difference in 'average' response time, against the difference in invocation counts, to attempt to infer the average response time in a particular period. As time goes on, this approach will become less accurate.

So the use of the accumulated service metrics is not ideally suited to the task of SLA monitoring, although they are very useful as a means for an administrator to manually examine the performance of their services. Therefore once an SLA violation has been notified, this information can be used to help localise where the problem actually is.

b) Event based pre-processing

As initially mentioned, the service metrics are accumated based on events received from the switchyard runtime indicating when an exchange has been completed.

Therefore a second approach could be similar to the current metric collection approach, however it will be reset at regular intervals, to ensure that the metrics being evaluated by the CEP rules are an accurate reflection of the activity that occurred within that period. So a single 'event' will be presented to the CEP rule, reflecting the aggregated information over the defined time period.

The disadvantage of this approach is that the collection period for all services would probably need to be the same, and it would be very difficult to support SLA monitoring where the period may change at different times (not sure if realistic, but just highlighting possible limitations).

The advantage is that using this type of pre-processor would simplify the nature of the rules. It is less likely that temporal rules would be required.

c) Activity based

As mentioned in previous documents discussing governance plans, as well as the BAM design, there are two phases to the BAM work. First is to implement SLA monitoring and second is full blown activity monitoring.

In the BAM design note it discussed using two separate "Event Processor Networks", one for metrics and the other for activities, on the basis that they were collected as distinct information.

However another way to view the SLA monitoring capability is just as a subset of the general activity monitoring. Therefore, as the underlying information for the metrics is just events, these could be converted into activity events that are recorded using the more general activity monitoring framework. In this situation, the SLA monitoring would simply be performed as one part of the 'activities' Event Processor Network, along with any other processing that a customer would want to do.

The potential disadvantage is that it is putting more work onto the CEP engine, as it will be responsible for accumulating all of the temporal information from the individual events, to identify any violations.

The advantage of this approach is that it is a consistent use of the BAM infrastructure, rather than a separate approach just for SLA monitoring. It also means that the nature of the 'SLAs' is completely defined within the CEP rules - so (for example), if the time period of interest changes under certain circumstances, then this can be described. It also means that the solution is not specific to the collection of stats from Switchyard, as it can be applied to any service that reports activity information to BAM.

Proposed approach:

Start working on (c) to evaluate the approach. If it looks unsuitable, then switch to (b).