protel Cloud Service Degraded Performance
Incident Report for protel Cloud Center
Postmortem

German Version see below

Problem Description

On Wednesday 6 October 2021, some protel AIR customers experienced an intermittent connection disruption, starting at approx. 3:15PM UTC+2:00.

Due to a high memory load of the Redis server process which caused out-of-memory reboots; service interruption in the server architecture caused intermittent connection issues for all users. 

After identifying the root cause and disabling the new protel.I/O Redis feature, the system was operational again, at approx. 11:30PM UTC+2:00.

In order to prevent such an incident from reoccurring, we have performed a thorough analysis of what happened and corrective measures were taken. Our assessment of the fault, including future preventative actions, can be found below.

Affected systems

protel AIR and connected interfaces

Impact

Login to the applications was intermittently not possible.

Root Cause

Through our investigation, we noticed the restarts of Redis server which were triggered by memory over usage; hence resulting in a high DB load at the following approximated timestamps: 3:10PM ,3:50PM, 5:15PM, 6:20PM, 7:10PM, 8:00PM, 10:30PM and 11:15PM

We identified that the new Redis protel.I/O feature - created to increase the runtimes of the I/O Messaging (specifically: Folio transmission) - simultaneously causes a high network load on both the I/O tomcat machines and on the Redis server. 

At approx. 11:10PM, we disabled the new protel.I/O Redis feature and could affirm the system was running in a stable manner. The system was fully operational again at approx. 11:30PM UTC+2:00.

Resolution

Disabling the new protel.I/O Redis feature. 

Mitigation / Preventive Actions

Dedicated Redis instance for protel.I/O servers refactored.

DISCLAIMER: THIS DOCUMENT HAS BEEN COMPILED FOR INFORMATION PURPOSES ONLY BY PROTEL HOTELSOFTWARE GMBH (PROTEL) TO THE BEST OF ITS KNOWLEDGE AND BELIEF BASED ON INFORMATION CURRENTLY AVAILABLE AND AT HAND. HOWEVER, PROTEL DOES NOT GUARANTEE THAT THE INFORMATION IS CORRECT, COMPLETE, UP-TO-DATE AND/OR IN THE CORRECT ORDER. PROTEL RESERVES THE RIGHT TO MAKE CHANGES AND/OR ADDITIONS WITHOUT PRIOR NOTICE. PROTEL MAKES NO EXPRESS OR IMPLIED WARRANTY (INCLUDING BUT NOT LIMITED TO ANY WARRANTY OR MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, ETC.) WITH RESPECT TO THIS INFORMATION. INFORMATION FROM PROTEL IS PROVIDED TO USERS "AS IS". PROTEL SHALL NOT BE LIABLE TO USERS OR ANY OTHER PERSON FOR ANY INTERRUPTIONS, INACCURACIES, ERRORS OR OMISSIONS ETC. IN PROTEL'S INFORMATION, REGARDLESS OF THE CAUSE, FOR ANY RESULTING DAMAGES (INCLUDING, WITHOUT LIMITATION, DIRECT OR INDIRECT, CONSEQUENTIAL DAMAGES, ETC.). IN ALL OTHER RESPECTS, PROTEL'S GENERAL TERMS AND CONDITIONS, WHICH CAN BE DOWNLOADED FROM THE PROTEL WEBSITE AT [ [[HTTP://WWW.PROTEL.NET/DE/AGB/](HTTP://WWW.PROTEL.NET/DE/AGB)](HTTP://WWW.PROTEL.NET/DE/AGB) ]([[HTTP://WWW.PROTEL.NET/DE/AGB](HTTP://WWW.PROTEL.NET/DE/AGB)](HTTP://WWW.PROTEL.NET/DE/AGB)), SHALL APPLY.

Vorfall

Am Mittwoch, dem 6. Oktober 2021, verzeichneten einige protel Air-Kunden um 15.15 Uhr MESZ zeitweise Verbindungsunterbrechungen. 

Aufgrund einer hohen Speicherauslastung des Redis-Serverprozesses wurden out-of-memory Reboots durchgeführt, die für Unterbrechungen der Verbindung in der Server-Architektur und zeitweise Verbindungsprobleme für alle Nutzer sorgten.

Nachdem wir die Ursache gefunden und das neue protel.I/O-Redis Feature gegen 23.10 Uhr MESZ deaktiviert hatten, kam es zu keinen weiteren Unterbrechungen. 

Um solche Vorfälle in Zukunft zu vermeiden, haben wir intensive Analysen durchgeführt und Korrekturmaßnahmen ergriffen. Unsere Bewertung des Fehlers, einschließlich künftiger Präventivmaßnahmen, finden Sie weiter unten.

Betroffene Systeme

protel Air und einige verwandte Systeme

Auswirkungen

Die Anmeldung bei den Anwendungen war zeitweise nicht möglich.

Ursache

Durch unsere Nachforschungen haben wir festgestellt, dass die Neustarts des Redis-Servers durch eine übermäßige Nutzung der Speicherkapazität genutzt wurden, was zu einer hohen Last zu den folgenden Zeitpunkten führte (geschätzt): 15.10 Uhr, 15.50 Uhr, 17.15 Uhr, 19.10 Uhr, 20 Uhr, 22.30 Uhr und 23.15 Uhr 

Wir haben herausgefunden, dass das neue protel.I/O-Redis Feature welches eingeführt wurde, um die Laufzeiten der protel.I/O-Messages (besonders für die Übertragung von Rechnungen) zu reduzieren, gleichzeitig für eine hohe Netzwerklast sowohl auf Seiten der protel.I/O-Tomcat-Maschine, als auch auf Seiten des Redis-Servers sorgte. 

Gegen 23.10 Uhr deaktivierten wir das neue protel.I/O-Redis Feature und konnten feststellen, dass das System wieder stabil lief. Gegen 23.30 Uhr MESZ kam es zu keinen weiteren Verbindungsabbrüchen.

Lösung

Deaktivieren des neuen I/O-Redis Features

Risikominderung / Präventivmaßnahmen

Die spezielle Redis-Instanz für den protel.I/O-Server wurde überarbeitet.

DISCLAIMER: THIS DOCUMENT HAS BEEN COMPILED FOR INFORMATION PURPOSES ONLY BY PROTEL HOTELSOFTWARE GMBH (PROTEL) TO THE BEST OF ITS KNOWLEDGE AND BELIEF BASED ON INFORMATION CURRENTLY AVAILABLE AND AT HAND. HOWEVER, PROTEL DOES NOT GUARANTEE THAT THE INFORMATION IS CORRECT, COMPLETE, UP-TO-DATE AND/OR IN THE CORRECT ORDER. PROTEL RESERVES THE RIGHT TO MAKE CHANGES AND/OR ADDITIONS WITHOUT PRIOR NOTICE. PROTEL MAKES NO EXPRESS OR IMPLIED WARRANTY (INCLUDING BUT NOT LIMITED TO ANY WARRANTY OR MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE OR USE, ETC.) WITH RESPECT TO THIS INFORMATION. INFORMATION FROM PROTEL IS PROVIDED TO USERS "AS IS". PROTEL SHALL NOT BE LIABLE TO USERS OR ANY OTHER PERSON FOR ANY INTERRUPTIONS, INACCURACIES, ERRORS OR OMISSIONS ETC. IN PROTEL'S INFORMATION, REGARDLESS OF THE CAUSE, FOR ANY RESULTING DAMAGES (INCLUDING, WITHOUT LIMITATION, DIRECT OR INDIRECT, CONSEQUENTIAL DAMAGES, ETC.). IN ALL OTHER RESPECTS, PROTEL'S GENERAL TERMS AND CONDITIONS, WHICH CAN BE DOWNLOADED FROM THE PROTEL WEBSITE AT [ [[HTTP://WWW.PROTEL.NET/DE/AGB/](HTTP://WWW.PROTEL.NET/DE/AGB)](HTTP://WWW.PROTEL.NET/DE/AGB) ]([[HTTP://WWW.PROTEL.NET/DE/AGB](HTTP://WWW.PROTEL.NET/DE/AGB)](HTTP://WWW.PROTEL.NET/DE/AGB)), SHALL APPLY.

Posted Oct 13, 2021 - 08:48 CEST

Resolved
This incident has been resolved.
Posted Oct 07, 2021 - 08:38 CEST
Update
We are continuing to monitor for any further issues.
Posted Oct 07, 2021 - 08:08 CEST
Update
We still continue to closely monitor all systems and react in the moment of the slightest outage.
It could be that that some customers are currently experiencing degraded system performance.
We would like to apologise for any inconvenience caused and thank you for your understanding.
Posted Oct 06, 2021 - 23:06 CEST
Update
We have identified the cause, but it still can come to short interruptions. We will continue to closely monitor all systems and react in the moment of the slightest outage. We would like to apologise for any inconvenience caused and thank you for your understanding.
Posted Oct 06, 2021 - 19:27 CEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 06, 2021 - 16:46 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 06, 2021 - 16:06 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 06, 2021 - 16:05 CEST
Identified
We are aware that some customers are currently experiencing degraded system performance. We're working with priority to get things back to normal. View the current status and impacted services via https://cloudstatus.protel.net.
Posted Oct 06, 2021 - 16:04 CEST
This incident affected: protel Cloud Solutions | Europe, North America (protel Air, protel IDS Interface, protel Air System Data, Identity and Access Management (IAM)).