devRant - A fun community for developers to connect over code, tech & life as a programmer

Search - "hair on fire"

73

devNews

1261

5y

--- GitHub 24-hour outage post mortem ---

As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.

It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.

When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.

At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.

Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.

Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).

Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.

It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.

Link to the post mortem: https://blog.github.com/2018-10-30-...

random fail high availability post-mortem chaos github mysql outage

7
10

ChlorideCull

202

8y

Boss fast tracked a release for a customer. No testing. Not checking commits or the state. Undocumented requirement brings down everything and everyone is running around with their hair on fire. Sigh.

undefined

2
3

IHateFrameworks

116

108d

In reply to this:
https://devrant.com/rants/260590/...

As a senior dev for over 13 years, I will break you point by point in the most realistic way, so you don't get in troubles for following internet boring paternal advices.

1) False. Being go-ahead, pro active and prone to learn is a good thing in most places.
This doesn't mean being an entitled asshole, but standing for yourself (don't get put down and used to do shit for others, or it will become the routine) and show good learning and exploration skills will definitely put you under a good light.
2)False. 2 things to check:
a) if the guy over you is an entitled asshole who thinkg you're going to steal his job and will try to sabotage you or not answer acting annoyed, or if it's a cool guy.
Choose wisely your questions and put them all togheter. Don't be that guy that fires questions in crumbles, one every 2 minutes.
Put them togheter and try to work out the obvious and what can be done through google or chatgpt by yourself. Then collect the hard ones for the experienced guy and ask them all at once. He's been put over you to help you.

3) Idiotic. NO.
Working code = good code. It's always been like this.
If you follow this idiotic advice you will annoy everyone.
The thing about renaming variables and crap it's called a standard. Most company will have a document with one if there is a need to follow it.
What remains are common programming conventions that everyone mostly follows.

Else you'll end up getting crazy at all the rules and small conventions and will start to do messy hot spaghetti code filled with syntactic sugar that no one likes, included yourself.

4)LMAO.
This mostly never happens (seniors send to juniors) in real life.
But it happens on the other side (junior code gets reviewed).
He must either be a crap programmer or stopped learning years ago(?)

5) This is absolutely true.
Programming is not a forgiving job if you're not honest.
Covering up mess in programming is mostly impossible, expecially when git and all that stuff with your name on it came out.
Be honest, admit your faults, ask if not sure.
Code is code, if it's wrong it won't work magically and sooner or later it will fire back.

6)Somewhat true, but it all depends on the deadline you're given and the complexity of the logic to be implemented.
If very complex you have to divide an conquer (usually)

7)LMAO, this one might be true for multi billionaire companies with thousand of employees.
Normal companies rarely do that because it's a waste of time. They pass knowledge by word or with concise documentation that later gets explained by seniors or TL's to the devs.
Try following this and as a junior:
1) you will have written shit docs and wasted time
2) you will come up to the devs at the deadline with half of the code done and them saying wtf who told you to do that

8) See? What an oxymoron ahahah
Look at point 3 of this guy than re-read this.
This alone should prove you that I'm right for everything else.

9) Half true.
Watch your ass. You need to understand what you're going to put yourself into.
If it's some unknown deep sea shit, with no documentations whatsoever you will end up with a sore ass and pulling your hair finding crumbles of code that make that unknown thing work.
Believe me and not him.
I have been there. To say one, I've been doing some high level project for using powerful RFID reading antennas for doing large warehouse inventory with high speed (instead of counting manually or scanning pieces, the put rfid tags inside the boxes and pass a scanner between shelves, reading all the inventory).
I had to deal with all the RFID protocol, the math behind radio waves (yes, knowing it will let you configure them more efficently and avoid conflicts), know a whole new SDK from them I've never used again (useless knowledge = time wasted and no resume worthy material for your next job) and so on.
It was a grueling, hair pulling, horrible experience that brought me nothing in return execpt the skill of accepting and embracing the pain of such experiences.

And I can go on with other stories. Horror Stories.
If it's something that is doable but it's complex, hard or just interesting, go for it. Expecially if the tech involved is something marketable.

10) Yes, and you can't stop learning, expecially now that AI will start to cover more and more of our work.

devrant rant dev

4

Top Tags

rant linux code windows fuck i java c programming android dev the is javascript js joke life a python

Weekly Rant

View

Most unrealistic deadline you've had?