Incident summary/analysis/recovery status of outage on March 10

  • Saturday, 20th March, 2021
  • 12:11pm

As most of you're aware by now, our services hosted in OVH's Strasbourg DC went offline due to an uncontained fire in SBG2 building.

Timeline of events:

  • Major uncontained fire at OVH Strasbourg DC on 10 March 2021, 0130 UTC.
  • This event took down 9 mail servers, Storage-node 1 which served as the primary storage node for another 33 mail servers, and 5 other servers that were part of redundant infrastructure which itself remained unaffected.
  • A new primary storage node Storage-node 3 was provisioned at the Roubaix DC and all of the 33 mail servers that lost connection to their primary storage node were back online by 10 March 2021, 0930 UTC.
  • The 9 mail servers that were hosted in SBG, were back online by 10 March 2021, 1500 UTC - with inputs from affected customers to re-install server using the console installation wizard.
  • No loss of mail occurred, as the redundant backup MX queued mail to servers for later delivery while they were offline, and email data was restored to Storage-node 3 from secure, offsite backups by 13 March 2021.

Difficulties faced:

  • Due to the unplanned switch from (old mail system) OMS to new mail system (NMS), we were unable to automatically restore admin panel (Mail portal) settings which included domains, mailboxes, aliases, etc. from database backups without requiring customer input. We were able to provide the list of mailboxes on providing the list of domains for all customers, and import these for some customers, but due to operational bottlenecks/issues we were unable to perform the import task for all customers.
  • Due to non standardized backups for OMS services, we were unable to immediately provide list of aliases, and other data like addressbooks and calendars hosted in SOGo.
  • Due to non standardized backups for OMS services, databases for some services containing above data could not be recovered.

Steps taken to improve service:

  • Backups have been standardized in NMS services, including email data, addressbooks, calendars, database, etc. (all offsite) which would allow us to easily restore servers to their original state post a major outage like this without requiring customer input.
  • We will proceed with a planned migration later next year to migrate all existing OMS services to NMS.

Recovery status:

  • Email data: Recovered
  • Domains: Recovered / Recoverable
  • Mailboxes: Recovered / Recoverable

Alias, address-book, and calendar recovery status are mentioned below with recoverable status for each server:

  • Mail9 (basic plan): Recoverable
  • Mail10 (basic plan): Non-recoverable
  • Dedicated services: please contact support for recovery status.

 

« Back