- The longest Atlassian outage of all time has left hundreds of companies without access to JIRA, Confluence, and OpsGenie.
- Atlassian stayed silent for most of the outage, and only acknowledged it on the 9th day.
- Customers received templated emails and no answers to their questions.
- The cause of the outage was a script that was supposed to delete customer data from a plugin, but accidentally deleted all customer data for anyone using the plugin.
- Atlassian can restore all data to a checkpoint in a matter of hours, but this would mean everyone else would lose all data committed since that point.
- Atlassian is now restoring customers in batches of up to 60 tenants at a time, which takes between 4 and 5 elapsed days.
- Atlassian customers experienced a major outage, with zero access to their products and data.
- Customers had difficulty reporting the issue to Atlassian, due to the domain being deleted.
- The biggest complaint from customers was the poor communication from Atlassian.
- The impact of the outage was large, with many companies not having backups of critical documents on Confluence or JIRA.
- Customers are eligible for a 50% discount for their next, monthly bill.
- The biggest impact of this outage is not in lost revenue, but reputational damage.
- Competitors are sure to win from this fumble, and will reference this Atlassian outage in their sales pitches for years to come.
- Incident handling learnings include having a runbook for disaster recovery, communicating directly and transparently, speaking the customer’s language, and avoiding radio silence.
- Avoiding the incident includes having a rollback plan for all migrations and deprecations, and doing dry-runs of migrations and deprecations.
- Atlassian failed to follow their own guidelines for incident handling, and executives took no public ownership of the outage until day 9.
Published April 13, 2022
Visit The Pragmatic Engineer to read Gergely Orosz’s original post