6
lungdart
33d

We had a production outage directly caused by our team not following a change procedure correctly. Now we're under a microscope and in a "get well" program.

They took over the daily standup for this high priority program and are organizing efforts in confluence instead of jira.

Now we have a confluence doc of what everyone is working on with someone changing the text status in a table by hand every morning along with the comments in a note section...

Comments
  • 1
    I bet it wasn't "directly" caused by not following procedure. Probably not even indirectly. What was it?
  • 1
    I would be so ashamed 😁
  • 0
    @donkulator bad change with rollback steps, when it went bad, they adhocd a fix instead of rolling back that didn't fix the outage, but did stop alerts.

    Service was down for 24 hours and we only caught it because the customer kept escalating the complaint
Add Comment