protel Cloud Service Interruption

Incident Report for protel Cloud Center

Postmortem

Problem Description and Impact

For a period of approximately 48 hours between Thursday, January 14th and Saturday, January 16th 2021, impacted customers could not access protel Cloud Center and some related systems.

In order to prevent such an incident from reoccurring, we have performed an analysis of what happened and the corrective measures were taken. Our current assessment of the fault, including future preventative actions, can be found below.

Affected systems

protel Cloud Center and some related systems hosted by Amazon Web Services (AWS) in Ireland.

Root Cause

An individual hotel tried to simultaneously send 160.000 emails with their own address in CC. The hotel's email provider blocked receiving the messages, which resulted in the email address being blacklisted. Every failed email is logged into the related database including the full mail as an attachment to the error-entry. As a result, the database cluster became overloaded and unresponsive and subsequently failed.

A critical incident was declared and all available technical and support resources were allocated. During the course of the incident, 23 separate customer status updates were provided via the protel status page.

The first attempt to correct the situation was to re-initialize all APIs used by protel Cloud Center and protel Messenger to get the impacted processes working again. However, remote initializing failed, as the instances were not responsive for scripted remote commands. Direct access to the instance was necessary to execute a re-initialization.

The database cluster restart was attempted while still being under heavy load from all cloud systems. As a result, each server instance within the cluster had to be mounted separately in repair mode before the cluster could be joined again. Due to the enormous amount of data to be processed, this turned out to be an extremely lengthy process.

Resolution

In order to return the system to service, the following steps were performed:

The blacklist entries were limited in length and the number of logged attempts limited to avoid database entries that are too large.

Additionally, changes were made to avoid processing emails in the queue that contain a blacklisted email address.
Restarted the DBs process:
- Repair and restart database 1
- Copy database 1 to database 2
- Copy database 1 to database 3
Once the first database had been repaired and restarted, the non-essential log entries were deleted in order to reduce the enormous amount of data.
A new database cluster was set up in the PCI environment and a snapshot from the productive database cluster was imported. This made it possible to run the following services independent of the Cloud API: EFT, Credit Card Manager (without PCI dialogue) and CD-Proxy (pull, not push).
Once the second database was available, we initiated restarting the Cloud APIs and connected systems one by one to keep the load on the databases during restart as low as possible.

Mitigation / Preventive Actions

protel Messenger will be updated so as to appropriately reduce the number of mails that can be sent out with one messenger execution.
Decoupling cloud-based applications, such as the EFT services, from the Cloud API will keep the number of impacted systems or services to a minimum should there be a failure of one component in the future.
Our DevOps team will simulate the incident in order to develop further strategies to avoid full outages and minimise the delay involved in returning systems to service.

--end--

‌Disclaimer: This document has been compiled for information purposes only by protel hotelsoftware GmbH (protel) to the best of its knowledge and belief based on information currently available and at hand. However, protel does not guarantee that the information is correct, complete, up-to-date and/or in the correct order. protel reserves the right to make changes and/or additions without prior notice. protel makes no express or implied warranty (including but not limited to any warranty or merchantability or fitness for a particular purpose or use, etc.) with respect to this information. Information from protel is provided to users "as is". protel shall not be liable to users or any other person for any interruptions, inaccuracies, errors or omissions etc. in protel's information, regardless of the cause, for any resulting damages (including, without limitation, direct or indirect, consequential damages, etc.). In all other respects, protel's General Terms and Conditions, which can be downloaded from the protel website at http://www.protel.net/tc, shall apply.

Posted Jan 22, 2021 - 14:28 CET

Resolved

All Systems are now fully operational again and have been operating normally since the issue was resolved. We do appreciate the significance of this incident and thank you for your patience during the time it took us to restore service to the impacted components.

As mentioned previously, a detailed root cause analysis (RCA) will be published via this channel during the course of this week. This will include a more detailed technical description of what caused the problem, a summary of the steps that were taken to return impacted systems back to service and the preventative action that we will take to mitigate against this problem arising again in the future.

Posted Jan 18, 2021 - 17:46 CET

Update

All Best Western Interfaces are up and running again according to the standard protocol.

Posted Jan 16, 2021 - 18:39 CET

Update

Update 4.30 pm

protel Air PMS/Front Office continues to work normally (Note that Front Office was not impacted by this issue). The only impediment at this time is that new objects (users, hotels) can not be created right now.

*protel Cloud Center (with the exception of messenger), IDS, EFT, WBE, Voyager, 3rd party interfaces (including Kiosk, Ariane etc) are now returned to service. *

For protel messenger, event-based emails are being sent by the system, but the system currently does not show the confirmation. Some customer-created messages are currently not fully working yet. We continue to work to bring these back to service.

The Best Western IFC is being brought back online in batches. When the BW Interfaces are working for all hotels, we will amend the status here to operational later this evening.

We continue to work on this issue to fix the remaining items and update this status page as these are completed.

As previously mentioned, we will automatically provide a full and detailed root cause analysis (RCA) via this channel once the incident has been resolved and we have had the chance to perform a complete analysis of the issue and determine the steps to be taken to ensure that such a case can not reoccur in future.

Posted Jan 16, 2021 - 16:46 CET

Update

2 Databases of 3 are now fully up and running. The third is in the process of starting. With the 2 DBs that are now online, we have initiated the process of restarting selected services and extensively testing the quality of the data and responses. As the QA process finishes on each individual item, we will inform you via this channel as the various components are returned to service.

Posted Jan 16, 2021 - 15:21 CET

Update

IDS Services are now partially restored for PULL instances.
As further services come back online during the course of the afternoon, we will provide further information here.

Posted Jan 16, 2021 - 13:22 CET

Update

EFT services are now restored and fully functional. PCI is back online for IDS and protel i/o connections.
As further services come back online we will provide further Information here.

Posted Jan 16, 2021 - 11:54 CET

Update

Update 10am CET
During the course of the night and morning, the main system database has now been fully repaired and restored to service by our systems engineering team. This will enable us to start bringing the first services back online during the course of this morning. In parallel, we are already in the process of cloning the restored database to create a cluster in order to bring full stability back to the system and to enable us to return all remaining components back to service. Information will follow in this channel as services start to come back online during the course of the morning and afternoon. The time that this will take depends on the speed of the cloning exercise and the numbers of suspended messages from all systems that need to be processed. All available system (processing) resources within the data centre have been deployed to ensure that any running processes complete as quickly as possible.

Posted Jan 16, 2021 - 10:12 CET

Update

Our development and engineering teams are continuing to work around the clock on returning the impacted components to service. The database rebuild continues to run and once completed, our operations teams will be in a position to restart the impacted services. We do expect that it will still take a number of hours until all work has completed and all services are fully operational. We will continue to keep you informed about the process and the overall status via this channel. Thank you.

Posted Jan 16, 2021 - 03:02 CET

Update

Update 23.00 CET
The rebuild of the database is continuing and our engineering teams are doing all that they can and using all resources at their disposal to ensure that this process competes as soon as possible. We will continue to work on this throughout the night and until the issue is brought to resolution. Any changes to the status, further information or updates regarding timelines will be posted here as soon as they become available. Thank you for your continued patience.

Posted Jan 15, 2021 - 23:07 CET

Update

The teams are continuing to work on the issue. Progress is being made with the database rebuild and resync.

Posted Jan 15, 2021 - 21:10 CET

Update

Update 17.30 CET.

At this point, we would like to provide you with more detailed information concerning this ongoing incident, based on the information currently available to us.

Firstly, we would again like to reassure you that we understand the urgency and importance of the issue and that all available resources have been working on this issue since the outset. We will continue to work on this around the clock until full service is restored.

The problem that we are facing is database related. The complexity of this incident is being compounded by the fact that several interrelated components are impacted. We have had to perform a time-consuming repair of the so-called Mongo databases and reinstall these within the AWS environment. A very substantial amount of data has had to be restored into the recreated databases. Each individual component or service must be reconnected to the database layer. Because the Mongo databases are mirrored services, these need to be brought back into sync with one another, which is also very time-consuming. Once they are up and running, individual components need to be restarted, and we need to be mindful that they can not be "switched on" simultaneously. The switch must be completed in a controlled manner - one by one - so that further issues or exacerbation of the existing problem are not caused by the amount of data traffic being processed.

Because of this ongoing task's complexity, we can not yet provide you with guidance as to when the system will return to normal, in part or as a whole. We remain optimistic that the work should complete over the next few hours and are doing our very best to meet this target.

We will continue to keep you updated via this channel whenever further information comes to light.

We will automatically provide a full and detailed root cause analysis (RCA) via this channel once the incident has been resolved and we have had the chance to perform a complete analysis of the issue and determine the steps to be taken to ensure that such a case can not reoccur in future.

Thank you.

Posted Jan 15, 2021 - 17:30 CET

Update

Update 3.45 pm CET. Work to restore services is still ongoing.

Any changes to the status, further information or updates regarding timelines will be posted here as soon as they become available - next update at 5.30 pm today. We thank you for your continued patience while we work to resolve this problem.

Posted Jan 15, 2021 - 15:53 CET

Update

Update 1 pm CET. The Database repair continues to progress but unfortunately, we can still not provide guidance as to how long this will take to complete. EFT and CD-Proxy have been separated from the main fault and we hope to be able to provide guidance about how long it will take to bring these services back online latest during the next update. Any changes to the status, further information or updates regarding timelines will be posted here as soon as they become available - next update at 3 pm today. We thank you for your continued patience while we work to resolve this problem.

Posted Jan 15, 2021 - 13:13 CET

Update

Update 11.00 CET. The Database repair is progressing but we can not, as yet, say how long it will take to complete. A separate team continues to focus on EFT and Booking related services and are testing a newly staged system. Any changes to the status, further information or updates regarding timelines will be posted here as soon as they become available - next update at 1 pm today. We thank you for your continued patience while we work to resolve this problem.

Posted Jan 15, 2021 - 11:05 CET

Update

We are continuing to work on a fix for this issue.

Posted Jan 15, 2021 - 10:01 CET

Update

Update 09.30 CET. While the cause of the problem IS known, resolving the issue is proving to be time-consuming. Our engineering teams and database specialists are working to resolve the problem as soon as possible, with priority being placed on the EFT and booking related services. Any changes to the status, further information or updates regarding timelines will be posted here as soon as they become available, latest by 11 am CET this morning. We thank you for your continued patience while we work to resolve this problem.

Posted Jan 15, 2021 - 09:38 CET

Update

As previously advised, our development teams have identified the cause of the problem and continue to take all appropriate steps to resolve the issue. The team have been working around the clock in order to bring impacted components back to service. We will provide guidance around the expected time still required to return the system as soon as it is available. This page will be updated with further information at 09.30 CET latest again. View the current status and impacted services via https://cloudstatus.protel.net.

Posted Jan 15, 2021 - 08:01 CET

Identified

We have identified the cause of the problem and are taking the appropriate steps to resolve the issue. A further status message will follow when we have information about the remaining interruption time or when service has been resumed. View the current status and impacted services via https://cloudstatus.protel.net.

Posted Jan 14, 2021 - 15:08 CET

Investigating

Unfortunately, access to some of our cloud systems is still interrupted. *Protel Air Front Office continues to operate as normal* but other ancillary services are currently not reachable. Note that this is unrelated to the Adobe Flash End-Of-Life announcement. We continue to investigate and are working to resolve the issue as soon as possible. You will always find the current status on https://cloudstatus.protel.net.

Posted Jan 14, 2021 - 12:38 CET

Update

Access to all systems with the exception of protel Messenger has been restored. We expect Messenger to be back online shortly - a further message will follow as soon as this is the case.

We will continue to closely monitor all systems and would like to apologise for any inconvenience caused and thank you for your understanding.

Posted Jan 14, 2021 - 11:27 CET

Monitoring

We have identified the cause and solved this issue. Access to all systems with the exception of protel Messenger and IDS has been restored. We expect IDS and Messenger to be back online shortly - a further message will follow as soon as this is the case.

We will continue to closely monitor all systems and would like to apologise for any inconvenience caused and thank you for your understanding.

Posted Jan 14, 2021 - 10:35 CET

Investigating

Unfortunately, access to some of our cloud systems is still interrupted. We continue to investigate and are working to resolve the issue as soon as possible. You will always find the current status on https://cloudstatus.protel.net.

Posted Jan 14, 2021 - 08:35 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 14, 2021 - 01:27 CET

Investigating

We are aware that some customers are currently experiencing degraded system performance. We're working with priority to get things back to normal. View the current status and impacted services via https://cloudstatus.protel.net.

Posted Jan 13, 2021 - 20:49 CET

This incident affected: protel Cloud Solutions | Europe, North America (protel Air, protel WBE - Web Booking Engine, protel IDS Interface, protel BWI Interface (Best Western International Interface), Credit Card Interface, Other interfaces to local systems, protel Cloud Center Services, PCI-DSS Environment).