For a period of approximately 48 hours between Thursday, January 14th and Saturday, January 16th 2021, impacted customers could not access protel Cloud Center and some related systems.
In order to prevent such an incident from reoccurring, we have performed an analysis of what happened and the corrective measures were taken. Our current assessment of the fault, including future preventative actions, can be found below.
protel Cloud Center and some related systems hosted by Amazon Web Services (AWS) in Ireland.
An individual hotel tried to simultaneously send 160.000 emails with their own address in CC. The hotel's email provider blocked receiving the messages, which resulted in the email address being blacklisted. Every failed email is logged into the related database including the full mail as an attachment to the error-entry. As a result, the database cluster became overloaded and unresponsive and subsequently failed.
A critical incident was declared and all available technical and support resources were allocated. During the course of the incident, 23 separate customer status updates were provided via the protel status page.
The first attempt to correct the situation was to re-initialize all APIs used by protel Cloud Center and protel Messenger to get the impacted processes working again. However, remote initializing failed, as the instances were not responsive for scripted remote commands. Direct access to the instance was necessary to execute a re-initialization.
The database cluster restart was attempted while still being under heavy load from all cloud systems. As a result, each server instance within the cluster had to be mounted separately in repair mode before the cluster could be joined again. Due to the enormous amount of data to be processed, this turned out to be an extremely lengthy process.
In order to return the system to service, the following steps were performed:
The blacklist entries were limited in length and the number of logged attempts limited to avoid database entries that are too large.
Restarted the DBs process:
Once the first database had been repaired and restarted, the non-essential log entries were deleted in order to reduce the enormous amount of data.
A new database cluster was set up in the PCI environment and a snapshot from the productive database cluster was imported. This made it possible to run the following services independent of the Cloud API: EFT, Credit Card Manager (without PCI dialogue) and CD-Proxy (pull, not push).
Once the second database was available, we initiated restarting the Cloud APIs and connected systems one by one to keep the load on the databases during restart as low as possible.
--end--