On Friday, 22th September 2023, protel Cloud experienced a service disruption, starting approximately 07:00 AM UTC.
After noticing high load on the Front Office backend servers , it was found that the environment showed a high CPU usage and failed to establish connections to the global caching database servers. The automatic health checks were failing and ended in frequent restarts of all Backend server instances, causing long wait times and frequent error messages. After enhancing the monitoring capabilities and downgrading the connection pooling library the environment went back to a stable state.
The development team is still investigating what the root cause of the malfunctioning library is to implement a workable solution, but the downgrade ensured that all systems were fully operational again, approx. 3:00 PM UTC.
In order to prevent such an incident from reoccurring, the development team is performing a thorough analysis of what happened so that corrective measures can be taken. Our assessment of the fault, including future preventative actions, can be found below.
protel Cloud Front Office
Due to the failure of connecting to the global caching database, many client requests ran into time-outs. This resulted in the users being presented with frequent 504 Error messages, as well as not being able to login into the Front Office.
The connection pooling library has been updated during the last release on Wednesday, 20th September and is not completely backward compatible with the current Front Office infrastructure.
Unfortunately the error only occurred for an exceptional unknown use case that was not properly tested in advance. With the now obtained knowledge the next version will undergo an intensive testing phase to prevent this same error again.
In addition the development team will prepare an emergency response plan for this kind of error to ensure a fast rollback of the affected systems.