7
ctnqhk
2y

Stakeholder: Users are unable to buy tickets on the website. IT says Azure’s health check is showing an unhealthy status.

[It’s Sunday. Web Engineering is not on call so no one sees this right away.]

Stakeholder: IT restarted the Azure website twice, but users still can’t place orders.

Me: There was never an issue with the Azure site. That health check is inaccurate. There is a rewrite rule that sends the Azure supplied domain to our custom domain. The Azure health check doesn’t like that so it returns an unhealthy status. The problem is the ticketing server that the website has to communicate with. The ticketing server is overwhelmed and can’t handle more requests. IT should have checked the ticketing server’s logs. This has happened before and it’s never been an Azure issue. It’s a ticketing server issue.

Stakeholder and IT: Oops 😅

—-

JFC. Stop trying to make this web engineering’s problem. Stop trying to make it look like engineering dropped the ball. The ticketing server has experienced this issue multiple times. The ticketing server is maintained by a different team. The website’s symptoms are always the same and there are steps you need to take before you make the decision to restart the website, which will cause the website to show a blue screen of death that says 503 service unavailable for a few minutes. And we have a switch to shut off all transactions. Why do you not want to use it when it’s clear the website can’t process transactions???

Comments
  • 5
    You lost me at Jakarta Fried Chicken.

    Other than that really good rant. Sorry to hear you have to work with these headless chickens. Time to fry them some more.
  • 2
    Lying health checkers are the worst.

    At that point you might as well honestly just kill the health check entirely.

    A health check is subconsciously seen as one of the most obvious things you check. and should trust.

    So I can totally see why IT forgot that there was this huge caveat that it's not trustworthy.

    When fires are burning you lose 50% intelligence - and if someone new is working at IT of course they'll go for the most obvious thing "health check says it's dead" - delving deeper than that is much more complex.
  • 2
    @jiraTicket The messed up part is IT made a script to notify them of a ticketing server failure, but the script had an error in it and didn’t fire. It was a new script. So even though the website had all the symptoms of a ticket server failure, they didn’t bother to check that server’s logs before hitting restart on the website. I decided to make a decision tree so stakeholders and staff don’t panic like this again.
Add Comment