Problem Description
On Wednesday 15th June 2022, protel AIR customers were not able to access the NG environment, as well as connected applications, starting approx. 12:20 PM UTC+2:00.
The applications were reachable again, approx. 14:55 PM UTC+2:00.
In order to prevent such an incident from reoccurring, we have performed a thorough analysis of what happened and the corrective measures were taken. Our assessment of the fault, including future preventative actions, can be found below.
Affected systems
IAM and connected applications (for example pAir, dSignature, SMP)
Impact
Login to the applications was intermittently not possible.
Root Cause
As a preparation for the upcoming update of the Identity Server, WSO2, from version 5.08 to 5.11 on our production / live environment (scheduled for the 6th of July), two new servers had to be created and configured.
When the software update had to be deployed on the newly created servers, the current servers were incorrectly chosen as the target of deployment.
The effect of this was that logging into IAM was not possible anymore, as the Identity Server couldn’t be reached, affecting all connected applications.
In order to undo the “incorrect” update of the production/live servers the following steps had to be taken:
Misconfiguration of a DevOps script had to be corrected.
Resources of the Identity Server were no longer linked and had to be restored.
Restoration of each service had to be applied on a step by step basis
Unfortunately, all of the actions had to be done manually, resulting in an elongated downtime.
At approx. 14:55PM UTC+2:00, the reverting of the “incorrect” update of the live Identity Servers was completed and as a result all systems were restored and applications were once again accessible.
Upon completion, the functionality of the Identity Server was closely monitored and smaller incidents, such as the loss of some user permissions were restored. For certain applications, a restart was required as new certificates had to be configured
Mitigation / Preventive Actions
Under normal deployment or update circumstances a roll-back strategy is implemented. As this issue occurred “unexpectedly”, this was not in place. The following actions will be taken in order to reduce the chance of recurrence: