Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "incident-reporting"
-
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...7 -
It began when I was tasked with creating a better and more engaging experience for our new Facebook page. This was in Facebook's early days, so there were not really any "best practices". We were making it up as we went along. I decided one way would be to game-ify things, since gaming, at the time, was a Big Deal on Facebook and people were starting to use it to build customer funnels.
Grasping for low-hanging fruit, I decided a Tetris variant around our topic would be fun. I had to hire a dev because at the time I was a static HTML web developer just getting into social media management. I knew nothing about game development or how to use Facebook's API for such things.
Long story short, we got about $10,000 (FB app devs came at a premium then) into the project when I came across a very recent article about the history of Tetris games. It said that even though Tetris had once been considered for all intents to be public domain due to it being created by a Russian coder during the Cold War, it had just been acquired by an IP protection entity that was charging royalties for any variant of Tetris created from a specific date onward and paying the original developer. So, even though I thought I had been thorough in my initial permissions checking, it turned out we were gonna be in deep doo-doo with licensing fees and restrictions if we released this game to the public.
I had to call my boss and admit my error. She was FURIOUS and really gave me an ass-chewing over it. I then had to call the marketing person whose budget I'd been slaving away at wasting. She was a bit more forgiving (her budget was in the millions). Then I had to call the corporate legal department and explain what was going on. They told me to immediately pay any outstanding hours, then fire the dev but not before getting him to send me all code and assets, deleting his copy, and then, upon my receipt of those assets, deleting MY copy so that nothing of it ever existed. And I was supposed to say _nothing_ to the dev about why he was being let go, so that there would be no "trail" leading back to this fiasco. (The dev hounded me for weeks asking what he'd done wrong. It killed me that I was bound and gagged by corporate legal and couldn't tell him.)
I was in so much trouble. I was literally in tears over it. I'd never wasted that much money in my life. That incident pretty much sealed my fate as far as any trust my bosses ever put in me again (not much at all). I was a bit of a pariah in a lot of ways for the next 5 years whereas I had come onto the team as a young social media rockstar at first.
After that, and a couple of other bad scenarios that were less my fault and more due to a completely dysfunctional management and reporting structure, they eventually "transferred" me to another team. Which was really just a way of getting rid of me by sending me to a department that was already starting to outsource overseas and lay people off. It was less messy that way. I was in the first set of layoffs.
Since then, I've had a BIG fear of EVER joining a large corporation EVER again. I prefer to work for small businesses now, even if I get paid less. Much less stressful from an office politics and impact of mistakes standpoint.3 -
I was working on email reporting to business customers and in the test phase was mass sending email to my own account. However it suddenly stopped working and it took me a few minutes to realize I had commented out the hardcoded line with my email address. I had to write to each Customer and apologize for the spam after my error. Also had to get whitelisted our email server after the incident with a few.2
-
Ok so.
You know you have to deal with annoying things when you take on a guard duty role and yes, we signed up for it because of the mullah.
However, you also want to do this with a reliable and robust monitoring and alerting systemthat you can depend on! And no i am not going to advertise a product for this... What i will tell you is which one to avoid.
Meet Quest "Foglight" ... It does EVERYTHING! It monitors, it alerts, it does trend watching it does fancy shmancy graphics, it does reporting, it is very extendable... WAUW, right! right?
Well, if you were stuck somewhere in 2005-2010 maybe... But this fucklight is cutting short on EVERYTHING
Today , i got called up at 3:30 in the morning (i am typing this after the incident) because this shit of a system has "HIgh Availability" by basically letting the FMS server suck each others jaggons and hope it somehow respons. This is a sort of keepalived thing, but on proprietary java tech..
Oh, yes, it's written on java and... yes.. Java 6
This means that, effectively we are running RHEL5 machines (yes, RHEL 5!!!) because something more modern in place? nope.
I have no idea anymore what i am ranting about, i'm tired, i'm tired of this shit, i'm tired of getting called up just because of some dude has been cussing up a sales representative, sucked each others jaggons and pushed the federal goverment with a shit solution for almost a decade now.
Fuck Foglight
Fuck Quest software, because did you really think you would get enterprise level support for an enterprise product which you payed enterprise euro's for it? You are so naive, how cute...
And consequently : Fuck Dell and Good job Dell.. For purchasing quest software, mess around with it, and then dump it back to the market... Srsly Dell , you were like me when i had this hot ass chick as a girlfriend but later seemed to be too crazy to justifiably tolerate compared to her hotness. Dump it like it's trump.
Oh, and, wauw! Foglight graced us with a successful startup process after .. what.. 6 times restarting? In 2 hours... With 12 CPU's and 128 GB ram and .... oh fuck this you don't deserve such resources.4