Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "logs of death"
-
So we have an API that my team is supposed send messages to in a fire and forget kind of style.
We are dependent on it. If it fails there is some annoying manual labor involved to clean that mess up. (If it even can be cleaned up, as sometimes it is also time-sensitive.)
Yet once in a while, that endpoint just crashes by letting the request vanish. No response, no error, nothing, it is just gone.
Digging through the log files of that API nothing pops up. Yet then I realize the size of the log files. About ~30GB on good old plain text log files.
It turns out that that API has taken the LOG EVERYTHING approach so much too heart that it logs to the point of its own death.
Is circular logging such a bleeding edge technology? It's not like there are external solutions for it like loggly or kibana. But oh, one might have to pay for them. Just dump it to the disk :/
This is again a combination of developers thinking "I don't need to care about space! It's cheap!" and managers thinking "100 GB should be enough for that server cluster. Let's restrict its HDD to 100GB, save some money!"
And then, here I stand trying to keep my sanity :/1 -
Stakeholder: Users are unable to buy tickets on the website. IT says Azure’s health check is showing an unhealthy status.
[It’s Sunday. Web Engineering is not on call so no one sees this right away.]
Stakeholder: IT restarted the Azure website twice, but users still can’t place orders.
Me: There was never an issue with the Azure site. That health check is inaccurate. There is a rewrite rule that sends the Azure supplied domain to our custom domain. The Azure health check doesn’t like that so it returns an unhealthy status. The problem is the ticketing server that the website has to communicate with. The ticketing server is overwhelmed and can’t handle more requests. IT should have checked the ticketing server’s logs. This has happened before and it’s never been an Azure issue. It’s a ticketing server issue.
Stakeholder and IT: Oops 😅
—-
JFC. Stop trying to make this web engineering’s problem. Stop trying to make it look like engineering dropped the ball. The ticketing server has experienced this issue multiple times. The ticketing server is maintained by a different team. The website’s symptoms are always the same and there are steps you need to take before you make the decision to restart the website, which will cause the website to show a blue screen of death that says 503 service unavailable for a few minutes. And we have a switch to shut off all transactions. Why do you not want to use it when it’s clear the website can’t process transactions???3