Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "everything's on fire"
-
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...7 -
Not laughing.
Not cursing.
Both for interviewing and being interviewed.
Some interviews could have been taken straight from a mexican telenovela.......
"Yeah, I worked for a year in the Walmart IT administration."
"Ok, what did you do?"
"Oh I had the high responsibility of taking care of swapping printer cartridges, programming the registers, stuff like that..."
"You apply for a senior database management role, you're aware of that?"
"Yeah. I took a bootcamp for 3 months in the evening after work. I'm up for the job and expect a payment of <lol, even having a stroke while writing a payment check that number will never happen>".
I made that up - but we had these cases... The story is just rewritten and mixed up for obvious reasons.
When I'm being interviewed, the same thing can happen by the way, too.
IMHO a interview is made not only for the company, but for me as an employee, too. I don't sugar coat it. I want to know what type of shit I'm getting into and how much I'm drowning in it.
Some "types" of interviewers react kinda funny when I start roasting them with questions...
For example, the authoritarian type usually reacts with disrespect. How dare u piss on my front lawn.... Kind of reaction. Which makes it hard not too laugh, because who wants to work for someone who throws a tamper tantrum during a interview? Even harder when the same guy promised you heaveb before (the flowery kind of bullshit, like everything's peaceful and fine and teams great and they have such a great leadership...)
Even worse is the patsy.
When you're sitting in an interview and the only answers you get are:
- Sorry, I don't know.
- I'm not allowed to ....
- Not in my area of expertise....
All just nice ways of saying: I will say nothing cause then I'd need to take some responsibility.
:)
The most Mexican telenovela stuff though in being interviewed is when I managed to divide a team of interviewers and it starts to become a "Judge Judy" or similar freaked out justice show...
A: "No, our team doesn't work that way".
B: "But you will in the short future, WE committed to it".
C: "Not that I'm aware of".
And me, an obvious sinner and person who enjoys entertainment and schadenfreude, just keeps adding kerosene to the fire.
"So, it seems like the team of A has its own rules which do not apply to B and C, do they also have greater funding?".
Oh it makes just fun to spur a good blood bath. -
Summary: Burnout, and everything's broken.
I don't feel like doing a damn thing today. I look at the code and cringe. I look at Slack and think "ugh. i can't." Mental capitals are even too much work.
(I've started reading "Zen and the Art of Motorcycle Maintenance" to try and combat burnout. I'll write a rant/story about it here if I find it helpful. but all I want to do today is drink tea and read.)
But onto the story:
Heroku is deprecating support for and will automatically upgrade any old verisons of Postgres running on its platform after August something (like five days from now).
I performed the upgrade to PG10 on Sunday (and late into the night), provisioning a new follower, blah blah blah.
However, the version of Rails we're using (4.2.x) doesn't support PG10 sequences, so I manually added in support via a monkeypatch. I did this on our QA servers first, obviously, and everything worked as expected. After half a day of no issues, I did the same on production, and again: everything worked as expected.
But today? I keep hearing about new things that are broken. One specific type of alert doesn't work for one specific person (wat). Can't send [redacted] at all. Can't update merchants! Yet there are magically no errors logged.
That last one (well, two) are just great; let me explain: when there's an error concerning merchants, the error gets caught, isn't logged or recorded anywhere so it just disappears, and the rescue block triggers a json response instead and happily exits. This is for an internal admin tool, so returning a user-friendly error is kinda stupid anyway, but masking what actually happened? fuck that dev with an obelisk made from spikes and solidified pain. That json response is also lovely: it's a 200 OK returning {status: 1, data: "[generic message containing incorrect IT jargon]"}. Doesn't even say "error" anywhere. Bloody everything about this pattern is absolutely wrong. Even the friggin' text.
Fucking hell. I want to pipe the entire codebase into shred and walk out the door.
But I digress. So many things are broken, my motivation is wanning to a sliver, and I have a conference call today where I'll undoubtedly be asked why everything is on smoking and/or on fire, and my huge and overly productive week last week will ofc mean nothing by contrast.
Ugh.
`shred ~/dev/work -zfu -n 32 &; ./brew tea --hot && wine ~/takeabreak.exe`rant zen and the art of motorcycle maintenance postgres heroku ship's sinking and the fixer's all fixed out burnout21 -
"Alright everyone, we can't keep this up. Every day our builds are breaking because of test failures."
"We could just be more diligent devs and actually write/update tests based on new behavior we introduce to the system?"
"What? No! We're just going to get rid of all tests!"
a few days later
"Guys!!! Everything's on fire now! How didn't we catch these huge breaking changes!"
https://media3.giphy.com/media/...2 -
!rant, more of an incredulous/cruelly amused "you had ONE job..."
so: biggest IT/PC/electronics store in my (and neighboring) country. their webpage, of course with the function to buy online, because of course.
the big green "Buy" button does nothing. doesn't work. doesn't react. I keep clicking it multiple times, shorter, longer, etc, because maybe their JS scripts are just shit so they slow.
nope.
okay. open devtools, JS console.
hover over the button: "Error: isMobile is not a function".
click the button: "Error: isMobile is not a function"
WAT.
search for isMobile in the script.
173 occurences.
fuck this.
console: isMobile = function(){return false;}
because I'm not on my phone.
click the "Buy" button.
works flawlessly.
...HOW?
THE WHOLE PAGE IS AN ESHOP YOU COMIC RELIEF INCOMPETENTS! =D
173 uses of non-existing function that blocks business-critical feature, THE ONLY CORE FEATURE FOR WHICH YOUR SITE EVEN EXISTS, and NOBODY, not the dev who fucked it up, NOT EVEN QA, noticed it??? =D =D
if I was the boss of the devs, or even boss of the whole company...
git blame
...and then i'd go the whole chain from the dev who caused the bug, through all of the QA people who "tested" that version before deploy, and I would personally, on the spot, fire each and every single one of them.
mainly because of who knows how much money this stupid not even a proper bug lost them.
but secondarily, because clearly none of those people give a single shit (n)or have an idea how to do their jobs.
=D =D
yeah but I was a good guy, filed a bug report in the "Complaints" section of their Contact form.
it goes to some call-center-like peon, so it starts with a sentence "forward this to your site's dev people outright to file as a bug, thank you".
but... HOW.... =D
HOW can you let something like this through? =D
the bottleneck of your whole user interaction, which forms first of the three steps OF THE MAIN AND MOST IMPORTANT FUNCTION of your whole business... =D
...I...
...does not compute =D
...BUT THEY USING ANGULAR, SO THEY ALL MODERN AND HIGH-TECH AND EVERYTHING'S FINE!!! =D =D1