Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
Get a devDuck
Rubber duck debugging has never been so cute! Get your favorite coding language devDuckBuy Now
Search - "high availability"
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...10
My dream is to build a shopping cart for web stores that doesn't fucking suck.
Seriously Bigcommerce, Shopify, Magneto, etc. All of you can eat bag of dicks and burn in hell for ever.
I don't care what languages you fancy, all of their stacks are a pile of shit, monkey patched together with popsicle sticks and duct tape and it all falls apart with high concurrency.
All their greasy haired sales teams will throw all manners of horse shit at the poor bastards who are trying to run a business so they can pad their commission checks... "High availability", "scalable", "reliable", "Increased conversation rate"... Lying dick fucks, all of them! I am calling them the fuck out on that snake oil they're all peddling.
The only thing worse than their shit APIs is the shit documentation and the shit support that accompanies them.
Support of these platforms are pretty much all the same, sure mayhaps one has 24*7 phone support and another closes at 9 or some shit like that, either way the only people they put on the phone are monkeys that will freeze up and say "I'm not a developer so I can't help you"... Guess what, "Eric"! I didn't ask if you're a fucking dev! I'm calling because one of your devs fucked up and I need you to tell him to unfuck it so I can get the fuck on with my day!
Their app/plugin market places are shameful to say the least. The overall quality of software is somewhat dire and it's mostly dominated by oversees developers who speak English about as well as the language they're developing with (not very well usually).
I could go on until I hit the character limit but I'm gonna end it here by saying, all shopping carts suck and they should burn for eternity in the depths of hell so that a savior can free all developers from this agonizing torment.9
Most satisfying bug I've fixed?
Fixed a n+1 issue with a web service retrieving price information. I initially wrote the service, but it was taken over by a couple of 'world class' monday-morning-quarterbacks.
The "Worst code I've ever seen" ... "I can't believe this crap compiles" types that never met anyone else's code that was any good.
After a few months (yes months) and heavy refactoring, the service still returned price information for a product. Pass the service a list of product numbers, service returns the price, availability, etc, that was it.
After a very proud and boisterous deployment, over the next couple of days the service seemed to get slower and slower. DBAs started to complain that the service was causing unusually high wait times, locks, and CPU spikes causing problems for other applications. The usual finger pointing began which ended up with "If PaperTrail had written the service 'correctly' the first time, we wouldn't be in this mess."
Only mattered that I initially wrote the service and no one seemed to care about the two geniuses that took months changing the code.
The dev manager was able to justify a complete re-write of the service using 'proper development methodologies' including budgeting devs, DBAs, server resources, etc..etc. with a projected year+ completion date.
My 'BS Meter' goes off, so I open up the code, maybe 5 minutes...tada...found it. The corresponding stored procedure accepts a list of product numbers and a price type (1=Retail, 2=Dealer, and so on). If you pass 0, the stored procedure returns all the prices.
Code basically looked like this..
public List<Prices> GetPrices(List<Product> products, int priceTypeId)
foreach (var item in products)
List<int> productIdsParameter = new List<int>();
List<Price> prices = dataProvider.GetPrices(productIdsParameter, 0);
foreach (var price in prices)
if (price.PriceTypeID == priceTypeId)
prices = dataProvider.GetPrices(productIdsParameter, price.PriceTypeID);
* Omitting the other 'WTF?' code to handle the zero price type
I removed the double stored procedure call, updated the method signature to only accept the list of product numbers (which it was before the 'major refactor'), deployed the service to dev (the issue was reproducible in our dev environment) and had the DBA monitor.
The two devs and the manager are grumbling and mocking the changes (they never looked, they assumed I wrote some threading monstrosity) then the DBA walks up..
DBA: "We're good. You hit the database pretty hard and the CPU never moved. Execution plans, locks, all good to go."
<dba starts to walk away>
DevMgr: "No fucking way! Putting that code in a thread wouldn't have fix it"
Me: "Um, I didn't use threads"
Dev1: "You had to. There was no way you made that code run faster without threads"
Dev2: "It runs fine in dev, but there is no way that level of threading will work in production with thousands of requests. I've got unit tests that prove our design is perfect."
Me: "I looked at what the code was doing and removed what it shouldn't be doing. That's it."
DBA: "If the database is happy with the changes, I'm happy. Good job. Get that service deployed tomorrow and lets move on"
Me: "You'll remove the recommendation for a complete re-write of the service?"
DevMgr: "Hell no! The re-write moves forward. This, whatever you did, changes nothing."
DBA: "Hell yes it does!! I've got too much on my plate already to play babysitter with you assholes. I'm done and no one on my team will waste any more time on this. Am I clear?"
Seeing the dev manager face turn red and the other two devs look completely dumbfounded was the most satisfying bug I've fixed.5
Titled my presentation "High Availability Setup", after a moment of thought, I changed it to "High Availability Architecture".
There, I will sound a bit more intelligent when I read it out loud on Monday. 😎😂3
Just now I was reading on https://pve.proxmox.com/wiki/... about high availability. Now my Proxmox VE is just a tower (which happens to have ECC memory) that's stored in my storage room (and which is mostly used for experimental and home server purposes). But my mail servers.. those have been made with high availability in mind. Most importantly, I've made their services entirely redundant (but within the same datacenter). And when they have updates, I apply updates to one, reboot, see if it didn't break something and then do the same to the other server after the first one came up again. So no downtime whatsoever.
If memory serves me right, I think that I've been able to maintain these servers for the last year without any downtime at all (I reboot them every month to apply new kernels but they haven't both been simultaneously down at any moment). Does that make them High Availability? My interventions regarding their availability have been rather trivial. Is it really that hard..?4
Monday morning, we were told by our teacher that we had one week to create a clustered system with virtual machines , handled with 2 hypervisors, and the whole thing must come with high availability
These are the kind of stuff that make me doubt about becoming devops later, 3 days in and I'm only starting to get what we're doing, but I'm such a massive dead weight for the rest of my group 😵😵
I had a discussion with my colleagues about my bachelor thesis.
Together we created within the last 18 month a REST-API where we use LDAP/LMDB as database (tree structured storage). Of course our data is relational and of course we have a high redundancy there. It's a 170 call API and I highly doubt that it's actually conforming REST.
Ensuring DB integrity is done in the backend and coding style there is "If we change it at one place, let's make sure to also change it everywhere else", so you get a good impression how much of spaghetti code we have there.
Now I proposed to code a solution in my bachelor thesis where we use a relational database (we even have an administrated Oracle DB with high availability) and have a write-only layer to also store the data in LDAP but my colleagues said that "it would add too much complexity to the system".
Instead I should write the relational layer myself and fetch the data somehow from the existing LDAP tree.
What the actual fuck, spaghetti code is what makes the system really unnecessarily complex so that no one will understand that code in 2 years.
Congratulations, you just created legacy code that went into production in 2018 while not accepting the opportunity to let that legacy code get eliminated.
Now good luck with running and maintaining that system and it's inconsistencies.1
Got my ActiveMQ-Zookeeper Replicated LevelDB setup finished! All provisioned with Ansible. so happy :) needed to share. anyone else like setting up high availability stuff?9
-i won't follow logging practices
-i won't follow secure coding
-i won't leverage profiling n monitoring tools
-i won't reuse best practices
-i won't listen to thought leaders
-i will outsource writing UT
-i will outsource code quality checks
-i will outsource all testing
-i will ignore n overide CTO team
But I still want high stability, security n 4 9s availability. Just want it done. My team is best. Am a fast-track leadership program leader who never has or ever needs to cod. I just know ...
People I have to deal with every sprint. Site reliability is not easy ...
Teaching good code makes great products to morons, toughest ...
"Beginners mind needed"2
I've been working for over a year now in this remote job as a sysadmin for a local client. I personally find this job quite intimidating at first with all of the infrastructure and all of its many microservices running in high availability set up. I enjoyed learning everything about them and why it's been set up this way, which gives me ideas if I were to build my own app (not competing with my current employer, of course).
But now I don't feel comfortable managing this beast in its many environments.
From time to time, I would hear from my old colleagues at my old sucky company for help in their work and that they know I'm an expert in. I help and it makes me feel good.
Now I'm at a career dilemma. I don't want to lose my current job because I feel "uncomfortable" with managing and administrating the tech holding the whole infrastructure. And I don't wanna go back to my old job with the sucky pay and the feel of being unchallenged. And if I try to find another job, I might be as lucky as I do now, especially good difficult it is for me to find a remote job to begin with.
Objectively, I just need to clear off my debts (at this rate, in 4 years), and have a side income to support my family. But I don't think I can follow through on that plan. Should I look for a new job or do better with the current job that I have now?3
What is the best alternative to cronjobs, guaranteeing high availability and jobs not being duplicated?6