Problem Description
On 08.03.2024, the Cloud environment experienced a service disruption, starting at approximately 10:50 AM UTC+1.
A small surge load hit the Identity Server database, resulting in an outage, and making access to Protel Air PMS (Front Office) unavailable.
After identifying the root cause, we implemented a workable solution for immediate remedy. Afterwards, all systems were fully operational again at approximately 12:40 PM UTC+1.
To prevent such an incident from reoccurring, we have performed a thorough analysis of what happened and taken corrective measures. Our assessment of the fault, including future preventative actions, can be found below.
Affected systems
All protel services on the Cloud environment
Impact
Login to the Cloud environment was no longer possible, with a generic HTTP error message.
Root Cause
Around 10:50 AM UTC+1, a small surge of requests arrived at the Identity Server for unknown reasons.
One of the first steps to resolve the issue was to increase the database performance from XL to 2XL to make login possible as quickly as possible.
A closer examination was required here to find the root cause of the exceptionally high load on the database of the external Identity Management system. The results showed that the requests for the Identity systems steadily increased over the last two years. These requests are fulfilled partly by accessing a database table that aggregates more and more information over time. The surge, in combination with too-large table content, increased the load on the database to a breaking point, from where the processing of requests slowed down dramatically. After a very short time, newer requests were rejected, and error messages were returned.
Mitigation / Preventive Actions
Plans have been developed to improve the performance of the Identity and Access Management system in the short, medium, and long term.
● The short-term solution involves increasing the database performance from XL to 2XL.
● For the medium-term solution, we will reduce the data in the specific table that slowed down performance.
● The long-term solution will be replacing our current Identity & Access Management system with its newly developed successor. This replacement will also include using a different external Identity Management system, which is more suited for high-performance use.