Security Bulletin: Everbridge Response to Amazon AWS Service Degradation of 2021-December-22

Amazon AWS experienced degradation of service on 2021-December-22 that affected some of the services our clients use when accessing their Everbridge service from *.everbridge.net domain.  Below is the Everbridge response to this critical event. 
 
Our commitment to our clients is uptime of 99.99% for our core services.  We actively monitor all our critical services to ensure we adhere to our uptime commitment.  Maintaining our uptime commitment is an ongoing effort.  We design our applications to fail gracefully, and our monitoring framework includes early warning alerts.  
  
In compliance with FedRAMP requirements, Everbridge performs an annual contingency plan failover exercise to demonstrate the resilience of our service.  We performed the last exercise in August 2021 with successful results within the scope of the exercise requirements.  We test for what we know.  Everbridge uses Amazon AWS as our cloud provider for all core capabilities.  The annual failover exercise does not test for the recent Amazon AWS service degradation.  According to the details of the RCA from Amazon, the Amazon AWS service design leverages an internal network to which no client has access. 
  
AWS has had three (3) service issues in the last seven (7) years that affected Everbridge to different degrees. We learned a little more from each issue.  It is unreasonable to expect any software to be immune to any issues or failures - even cloud providers are not immune.  According to the RCA provided by Amazon, the root cause of the AWS outage was a faulty scaling event on AWS's internal backbone network devices.  This caused the devices to become oversaturated causing requests to be dropped in both the internal backbone network as well as the network customers utilize.  In turn, this produced communication errors and latency between all the availability zones (AZs).  There are several actions AWS is taking to address this issue, but AWS does not currently have any recommended actions for customers.  
  
We continuously evaluate how we use our cloud environment for our client-facing products and services.  These environments are secure, flexible, scalable, and highly available.  We also continuously modernize our product infrastructure and scalability to support growing usage from our client base.  The lack of communication between AWS services caused failures between the EC2 service (Compute) and the Elastic Block Storage (EBS) service. At the start and during Amazon’s service issue, this connection was intermittent. Heartbeat timeouts, which trigger failover, passed consistently on the EC2 service but this AWS service could not communicate with the storage layer. The connections to our primary DB node in the affected AZ caused failures, errors, and latency of responses back to our application layer.  This, in turn, caused degradation of our Everbridge services.  We are taking action to improve our architecture and to ensure the improvement of our highly available, fault-tolerant architecture design. This will protect Everbridge services and applications from future underlying cloud provider disruptions.  Our plan includes implementing a more robust architecture in the data layer, which will help mitigate single AZ failures within an AWS region. This data architecture will enhance our fault tolerance within a single AWS region and will not require our database to failover to a secondary replica set in different AZs.  There is no firm end date for this work since the team will learn more as they implement specific changes as part of our ongoing effort to maintain our commitment to 99.99% uptime.  

Contact your Everbridge Account Manager or Everbridge Technical Support if you have questions about the above.
 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.