SLA - Service Level Agreement

Gwarancja niezawodności czy zbędny wydatek?
 

Still, not all companies decide to sign the SLA when deploying an IT system. Managers argue, for example, that they need to protect the budget from unnecessary spending. However, it is the lack of SLA that generates unforeseen and often much higher costs associated with the loss of system operational continuity and the lack of solution procedures.
It is not only about financial losses, but also sometimes the losses which are difficult to make up for, such as the loss of credibility with customers.

1. What is the SLA
In business practice, the subject of the Service Level Agreement (SLA) is the provision of services aimed at ensuring the continuity of essential business processes. These processes are carried out mostly by the IT infrastructure, which consists of applications and devices.

  • The services rendered under the concluded SLA include:
  • Providing the ability to report incidents (including errors) through the system of report registration
  • The guaranteed availability of specialists who are ready and able to solve the reported incidents within the specified timeframe
  • Ensuring the availability of technical support via telephone (hotline) for applications, providing consultations in the operation of the system
  • System monitoring (measuring the availability and performance, enabling to detect performance losses before users can notice them)

The parameters of SLA usually define the required response times (the maximum time after which specialists begin working on the incident), incident resolution times (e.g. restoring the availability of the application) and the contractual penalties for failure to comply with these parameters.

2. Benefits

Maintaining the process continuity
When the system fails, i.e. there is an incident concerning an error or a problem with the system performance, the provider will restore the continuity of the application operations in a time stipulated in SLA. In practice there are two distinguished times:

  • Response time – the time from reporting the error until the confirmation of the registration in the reporting system
  • Repair time – the time from reporting the error until the restoration of the continuity of the process, excluding the total time of suspension caused, e.g. Service Desk waiting for a user’s response to a question.

The provider maintains current knowledge about the system, updates the documentation, and creates testing mechanisms.

One can often hear the opinion that you do not have to buy an SLA since the application was deployed X years ago and worked reliably during that time. Some managers want to save on fixed fees, and thus spend the money saved on the development of the system or any other purpose.

This decision can have dramatic consequences since factors not directly related to the application code itself can interfere with the reliable operation of the system and lead to a failure even when the application itself does not contain errors. For example, the ordering system has worked flawlessly for 2 years, but the application becomes unstable when the number of simultaneous system users increased significantly due to, e.g., the company development or marketing activities. Another example would be a system failure caused by the update of the server software or other infrastructure element when the system becomes incompatible with it. After the warranty expires, the provider no longer has to maintain the necessary and current knowledge about the system, which extends the repair time and increases costs. In the extreme case, it may be that the provider no longer maintains competence in the specific technology and is not able to undertake a modification.

SLA forces the provider to keep the knowledge about the deployed system, its structure and functions.

This way you can be sure that the provider will be able to efficiently respond to various errors and that a support team was allocated to the project, and this team will have time to deal with the problem in the event of a failure and do it in a short time. In this case, the provider has to continuously update the system documentation, ensure adequate competence in the team, etc. With a view to a long-term cooperation, providers often are also keen to create additional mechanisms for testing and warning about threats to the stability of the application. These aspects are important not only for the proper implementation of the SLA, but also contribute to decreasing the cost of system development and its quality.

3. Important SLA parameters

The following questions on the parameters are among many questions, which you need to answer before proceeding to negotiate SLAs:

  • What processes are crucial for our company and at what times do we have to ensure the continuity of their operation
  • What types of errors we allow and how to describe them
  • What support business hours and what response and repair times we shall expect
  • How to measure the effectiveness of the service
  • What are the communication procedures and escalation paths

In practice, the answers to these questions are put in a table, such as the following.

Example 1:

Error type

Response time

Repair time

Business hours

Critical error
(e.g. service is offline, one cannot place orders)

0.5 h

1 h

9-17, 365 days

Non-critical error

2 h

4 h

Mon - Fri between 9-17 on working days

Remember to specify in the SLA parameters on which days and at what times the support is effective. They shall reflect the system usage statistics and minimize potential losses in case of system unavailability. At the same time, keep in mind that the more stringent the requirements, the higher the SLA price. Therefore, you should choose the SLA level as a compromise between the cost of SLA and the level of actual business requirements.

Workaround

You should be aware of the difference between a workaround and the total elimination of the cause of the error. In short, a workaround restores the business process, however, it does not necessarily mean the total elimination of the cause of the error. From a business perspective, the system operation is essential, so you need to negotiate the shortest time in which the provider provides at least a workaround to the problem. The total elimination of the cause of error may involve the preparation, testing, and implementing of a new system version, which requires more than a few hours during which you can provide a workaround. Depending on the SLA, the stipulated workaround times may be shorter or longer for the total elimination of the cause of the error. Implementation of new versions should be based on separate orders, which the SLA does not cover (otherwise the cost of the SLA will be incomparably higher).

Measurement of the application availability

The level of application availability expressed as a percentage and calculated for specific periods (usually months) is a very important parameter. It is beneficial if the provider actively monitors the application availability using a specialized system, such as NAGIOS, and provides these results. In case problems with the application availability are detected, systems such as NAGIOS can send text and e-mail messages to the specific persons. Application availability measurement systems store all information relating to the availability and enable to generate scheduled reports. Monitoring is made possible by automatic processes running on the server that perform actions which check the application availability at specified intervals. You have to make sure that monitoring is running on a server in another, independent data center, so that you are sure that it can access the Internet and operates properly. Frequently, the monitoring mechanisms themselves are also checked. Systems such as NAGIOS can check the response times to individual requests, e.g. the time after which the new order page is displayed. More about NAGIOS in our other paper: Measurement of the application availability. Operation monitoring objectively assesses and measures the effectiveness of SLA service. 

Communication and support lines

The description of the communication between the service provider and the client is a key element of any SLA. Specify who can report problems and how. Usually, two support lines are distinguished, whereby:

  • The first support line - applies to solving the problems of direct system users. These are mostly minor problems suitable for a helpdesk. The most common form of communication is a phone, chat, e-mail.
  • The second support line - applies to solving problems and incidents, where the first line of support cannot solve them. In this situation, the 1st line support employees hand the issue over to the 2nd line in a certain area (e.g. to the software provider).

In both cases, the use of the report tracking system is highly advisable to measure the indicators of the actual SLA level. In the case the support is provided by an external company, the report tracking system will be necessary to safeguard the interests of both parties.

Escalation

In SLAs, the escalation is most often a communication procedure in case of problems in the course of handling the incident (e.g. the expected or occurring delays in the provision of a solution or the management of critical incidents). Escalation procedures often involve the inclusion of senior staff in the communication in order to solve the existing problem.

The most common problems when negotiating SLAs:

  • Businesses expect levels of availability and system support that are inadequate to the real needs. Sometimes the business requirements during the SLA negotiations for non-critical IT systems specify the system availability level at 99.999% per month, and full 24 hours/365 days year support with a time to repair any errors within 1 hour. Indeed, there are systems that require the highest levels of access and uptime (e.g. core banking system or mobile phone services). However, many business solutions operate during certain hours of the day and their use is minimal outside of these hours. Hence the lack of availability during these hours does not mean significant losses. It is difficult to justify by the highest 24-hour support levels in such situations.
  • Businesses expect quotations of SLA support agreements without specifying the support parameters. The quotation of SLA support service directly depends on these parameters, so the business owner of the system has to specify at least the approximate expectations regarding the support level. If this is a problem, you can ask the provider for a quote for a number of variants of the support level.

4. Conclusions

When planning the project budget, you should foresee the resources for the system support service, as this is actually the only effective way to ensure the availability of the system at the required level. The savings of this type are apparent and in a worst-case scenario, they can result in a loss of profits and customer trust. It is also worth mentioning that there are many systematic collections of best practices related to the support services such as ITIL. They should be known and used by both the provider and recipient.