Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "monitoring systems"
-
--- GitHub 24-hour outage post mortem ---
As many of you will remember; Github fell over earlier this month and cracked its head on the counter top on the way down. For more or less a full 24 hours the repo-wrangling behemoth had inconsistent data being presented to users, slow response times and failing requests during common user actions such as reporting issues and questioning your career choice in code reviews.
It's been revealed in a post-mortem of the incident (link at the end of the article) that DB replication was the root cause of the chaos after a failing 100G network link was being replaced during routine maintenance. I don't pretend to be a rockstar-ninja-wizard DBA but after speaking with colleagues who went a shade whiter when the term "replication" was used - It's hard to predict where a design decision will bite back and leave you untanging the web of lies and misinformation reported by the databases for weeks if not months after everything's gone a tad sideways.
When the link was yanked out of the east coast DC undergoing maintenance - Github's "Orchestrator" software did exactly what it was meant to do; It hit the "ohshi" button and failed over to another DC that wasn't reporting any issues. The hitch in the master plan was that when connectivity came back up at the east coast DC, Orchestrator was unable to (un)fail-over back to the east coast DC due to each cluster containing data the other didn't have.
At this point it's reasonable to assume that pants were turning funny colours - Monitoring systems across the board started squealing, firing off messages to engineers demanding they rouse from the land of nod and snap back to reality, that was a bit more "on-fire" than usual. A quick call to Orchestrator's API returned a result set that only contained database servers from the west coast - none of the east coast servers had responded.
Come 11pm UTC (about 10 minutes after the initial pant re-colouring) engineers realised they were well and truly backed into a corner, the site was flipped into "Yellow" status and internal mechanisms for deployments were locked out. 5 minutes later an Incident Co-ordinator was dragged from their lair by the status change and almost immediately flipped the site into "Red" status, a move i can only hope was accompanied by all the lights going red and klaxons sounding.
Even more engineers were roused from their slumber to help with the recovery effort, By this point hair was turning grey in real time - The fail-over DB cluster had been processing user data for nearly 40 minutes, every second that passed made the inevitable untangling process exponentially more difficult. Not long after this Github made the call to pause webhooks and Github Pages builds in an attempt to prevent further data loss, causing disruption to those of us using Github as a way of kicking off our deployment processes (myself included, I had to SSH in and run a git pull myself like some kind of savage).
Glossing over several more "And then things were still broken" sections of the post mortem; Clever engineers with their heads screwed on the right way successfully executed what i can only imagine was a large, complex and risky plan to untangle the mess and restore functionality. Github was picked up off the kitchen floor and promptly placed in a comfy chair with a sweet tea to recover. The enormous backlog of webhooks and Pages builds was caught up with and everything was more or less back to normal.
It goes to show that even the best laid plan rarely survives first contact with the enemy, In this case a failing 100G network link somewhere inside an east coast data center.
Link to the post mortem: https://blog.github.com/2018-10-30-...6 -
One of our newly-joined junior sysadmin left a pre-production server SSH session open. Being the responsible senior (pun intended) to teach them the value of security of production (or near production, for that matter) systems, I typed in sudo rm --recursive --no-preserve-root --force / on the terminal session (I didn't hit the Enter / Return key) and left it there. The person took longer to return and the screen went to sleep. I went back to my desk and took a backup image of the machine just in case the unexpected happened.
On returning from wherever they had gone, the person hits enter / return to wake the system (they didn't even have a password-on-wake policy set up on the machine). The SSH session was stil there, the machine accepted the command and started working. This person didn't even look at the session and just navigated away elsewhere (probably to get back to work on the script they were working on).
Five minutes passes by, I get the first monitoring alert saying the server is not responding. I hoped that this person would be responsible enough to check the monitoring alerts since they had a SSH session on the machine.
Seven minutes : other dependent services on the machine start complaining that the instance is unreachable.
I assign the monitoring alert to the person of the day. They come running to me saying that they can't reach the instance but the instance is listed on the inventory list. I ask them to show me the specific terminal that ran the rm -rf command. They get the beautiful realization of the day. They freak the hell out to the point that they ask me, "Am I fired?". I reply, "You should probably ask your manager".
Lesson learnt the hard-way. I gave them a good understanding on what happened and explained the implications on what would have happened had this exact same scenario happened outside the office giving access to an outsider. I explained about why people in _our_ domain should care about security above all else.
There was a good 30+ minute downtime of the instance before I admitted that I had a backup and restored it (after the whole lecture). It wasn't critical since the environment was not user-facing and didn't have any critical data.
Since then we've been at this together - warning engineers when they leave their machines open and taking security lecture / sessions / workshops for new recruits (anyone who joins engineering).26 -
Newer Dev here. Just recently started in a position as a developer. I'm tasked with consolidating our monitoring systems into one cohesive display. After lumping together all the indexes and helping build a custom API I'm now working on front end. Front end is easy, I've done it before. Should be no problem. I was wrong. I spent a whole day fiddling with a React dynamic table and the CSS to format it. Today, I stumble upon the react-table component. Got the results I was looking for in less than 2 hours. I'm convinced that this was a lesson better learned early on.
-
A loooong time ago...
I've started my first serious job as a developer. I was young yet enthusiastic as well as a kind of a greenhorn. First time working in a business, working with a team full of experienced full-lowered ultra-seniors which were waiting to teach me the everything about software engineering.
Kind of.
Beside one senior which was the team lead as well there were two other devs. One of them was very experienced and a pretty nice guy, I could ask him anytime and he would sit down with me a give me advice. I've learned a lot of him.
Fast forward three months (yes, three months).
I was not that full kind of greenhorn anymore and people started to give me serious tasks. I had some experience in doing deployments and stuff from my other job as a sysadmin before so I was soon known as the "deployment guy", setting up deployments for our projects the right way and monitoring as well as executing them. But as it should be in every good team we had to share our knowledge so one can be on vacation or something and another colleague was able to do the task as well.
So now we come to the other teammate. The one I was not talking about till now. And that for a reason.
He was very nice too and had a couple of years as a dev on his CV, but...yeah...like...
When I switched some production systems to Linux he had to learn something about Linux. Everytime he encountered an error message he turned around and asked me how to fix it. Even. For. The. Simplest. Error. He. Could. Google. Up.
I mean okay, when one's new to a system it's not that easy, but when you have an error message which prints out THE SOLUTION FOR THE ERROR and he asks me how to fix it...excuse me?
This happened over 30 times.
A. Week.
Later on I had to introduce him to the deployment workflow for a project, so he could eventually deploy the staging environment and the production environment by hisself.
I introduced him. Not for 10 minutes. I explained him the whole workflow and the very main techniques and tools used for like two hours. Every then and when I stopped and asked him if he had any questions. He had'nt! Wonderful!
Haha. Oh no.
So he had to do his first production deployment. I sat by his side to monitor everything. He did well. One or two questions but he did well.
The same when he did his second prod deploy. Everythings fine.
And then. It. Frikkin. Begins.
I was working on the project, did some changes to the code. Okay, deploy it to dev, time for testing.
Hm.
Error checking out git. Okay, awkward. Got to investigate...
On the dev server were some files changed. Strange. The repo was all up to date. But these changes seemed newer because they were fixing at least one bug I was working on.
This doubles the strangeness.
I want over to my colleague's desk.
I asked him about any recent changes to the codebase.
"Yeah, there was a bug you were working on right? But the ticket was open like two days so I thought I'll fix it"
What the Heck dude, this bug was not critical at all and I had other tasks which were more important. Okay, but what about the changed files?
"Oh yeah, I could not remember the exact deployment steps (hint from the author: I wrote them down into our internal Wiki, he wrote them done by hisself when introducing him and after all it's two frikkin commands), so I uploaded them via FTP"
"Uhm... that's not how we do it buddy. We have to follow the procedure to avoid..."
"The boss said it was fine so I uploaded the changes directly to the production servers. It's so much easier via FTP and not this deployment crap, sorry to say that"
You. Did. What?
I could not resist and asked the boss about this. But this had not Effect at all, was the long-time best-buddy-schmuddy-friend of the boss colleague's father.
So in the end I sat there reverting, committing and deploying.
Yep
It's soooo much harder this deployment crap.
Years later, a long time after I quit the job and moved to another company, I get to know that the colleague now is responsible for technical project management.
Hm.
Project Management.
Karma's a bitch, right? -
"you've worked with nagios before haven't you? Can you give a presentation on it" 'sure' in the meeting: so tell us about opennms5
-
So here I work with this colleague that , at first , had a reasonable résumé. Whatever.
Time goed by and he is just doing tickets, clicking left and right, the usual grind of a shitty monitoring system which I am working intensely on deprecating that shit. Anyhoo
The last few days it became apparent that his resume was basically a hot air cake and he knows basically nothing intrinsically.
As I have stated before in previous rants, "everyone was a noob once"... But this guy...
He wants to do "something with Ansible"... "Ok what do you want to do?" , I asked (and I regret to have asked).
He basically wants to write new files on targets. Easy enough, I show him how he could do it with playbooks, inventory and role just for demonstrating the entire chain.
This guy chanes everything up, thereby breaking host group assignment, he launchea it on ALL machines...
Luckily it's a harmless file, so dodged a bullet there.
But the real wtf ia that he did it with the root account for our systems, without understanding the difference between "authentication" and "authorization"...
I am now explaining him what the difference is and how he can be able to check it. I give him the commands literally! ( sudo -l -U <user>)
Manages to fucking open up each sudoer file in vim , mistype or whatever he did in an attempt to leave vim... Breaks sudo...
Now he tries to spin it in such a way that I have steered him to break things.
"Dude you just fucking failed a copy/paste and you did absolutely fuckall without understanding what you are doing, then splurge out accusations because you did it wrong!"
FMLrant privilege escalation authentication authorization living eventually gets revealed colleagues without intrinsic knowledge breaking sudo3 -
Looking to sharpen and pursue a SysAdmin/DevOps career, looking at online job offers to get the big picture of required skills and I say FUCK. It would take me a lifetime.
Azure, AWS, Google cloud platform.
CD tools: Ansible, Chef or Puppet
Scripting ninja with Python/Node and Shell/Power shell.
Linux & Windows administration
Mongo, MySQL and their relatives.
Networking, troubleshooting failure in disturbed systems
Familiarity with different stacks. Fuck. (Apache, nginx, etc..)
Monitoring infrastructure ( nagios, datadog .. )
CI tools: jenkins, maven, etc..
DB versioning: liquibase, flyway etc.
FUCK FUCK FUCK.
Are they looking for Voltron? FUCK YOU FROM THE DEEPEST LEVEL OF MY DEEP FUCK.1 -
Continuing to learn k8s ecosystem and to achieve acceptable level
With trying eventually Helm, Argo CD and even trying to use not managed setup for k8s.
Going though books to find out theory about being SRE.
And about data intensive apps.
Learning and trying Kafka
Learning and trying FastAPI and diving in generally to async python ecosystem
Learning Go.
Learning few more books to increase code quality and its compositioning.
Getting more practice in monitoring and logging systems with applicating them to k8s.3 -
MQTT - all I used to know about this is its name, untill few months back a client sent us some requirements which included MQTT. I opened its specification and I was fucking shocked! I am implementing almost similar protocol in most of my applications (which needs subscription based service) for last 3 years. I have developed IoT apps, remote monitoring systems, HMI systems using the same fucking protocol! Even I had implemented the same thing on HTTP using long polling a few years back!!
Now I feel like open sourcing my protocol. But I don't know where to start. Any help please?1 -
So I'm building this environmental monitoring system for one of the Labs to monitor Temperature and Humidity. the "software" that comes as part of the package with these sensors is really just a website you host yourself if you don't choose the cloud option. No big deal really, (see my previous rant about getting windows server through SSC) I setup IIS and get the "software" registered get a couple sensors running looks good. However I don't like the error messages that popup because it's unsecured. do some reading and I find out that most browsers will give you a warning if your not using HTTPS even if it's for internal use only. OK we'll how hard can it be in implement encryption, turns out it's not that hard and you can do it for free how with letsencrypt and other places. I like free, now i have to use SSH to get into the server and run an ACME client. Hey open SSH is part of windows now cool, download an ACME client SSH into the server and nope doesn't work. Oh right I'm behind a corporate firewall and a bunch of other shit I can't control. Why is so damn arduous to setup this god dam internal website and the problems aren't even the site. Now I'm playing with AWS spinning up an instance to be able to try and get an SSL certificate just so i don't have to tell people it's OK to trust this site ignore the big angry warning.
Best part is other similar internal sites don;t use SSL and all have big messages about someone stealing your soul if you go there and these are commercial systems that run all the HVAC for all the campuses across Canada.
I need more Tylenol. -
9 Ways to Improve Your Website in 2020
Online customers are very picky these days. Plenty of quality sites and services tend to spoil them. Without leaving their homes, they can carefully probe your company and only then decide whether to deal with you or not. The first thing customers will look at is your website, so everything should be ideal there.
Not everyone succeeds in doing things perfectly well from the first try. For websites, this fact is particularly true. Besides, it is never too late to improve something and make it even better.
In this article, you will find the best recommendations on how to get a great website and win the hearts of online visitors.
Take care of security
It is unacceptable if customers who are looking for information or a product on your site find themselves infected with malware. Take measures to protect your site and visitors from new viruses, data breaches, and spam.
Take care of the SSL certificate. It should be monitored and updated if necessary.
Be sure to install all security updates for your CMS. A lot of sites get hacked through vulnerable plugins. Try to reduce their number and update regularly too.
Ride it quick
Webpage loading speed is what the visitor will notice right from the start. The war for milliseconds just begins. Speeding up a site is not so difficult. The first thing you can do is apply the old proven image compression. If that is not enough, work on caching or simplify your JavaScript and CSS code. Using CDN is another good advice.
Choose a quality hosting provider
In many respects, both the security and the speed of the website depend on your hosting provider. Do not get lost selecting the hosting provider. Other users share their experience with different providers on numerous discussion boards.
Content is king
Content is everything for the site. Content is blood, heart, brain, and soul of the website and it should be useful, interesting and concise. Selling texts are good, but do not chase only the number of clicks. An interesting article or useful instruction will increase customer loyalty, even if such content does not call to action.
Communication
Broadcasting should not be one-way. Make a convenient feedback form where your visitors do not have to fill out a million fields before sending a message. Do not forget about the phone, and what is even better, add online chat with a chatbot and\or live support reps.
Refrain from unpleasant surprises
Please mind, self-starting videos, especially with sound may irritate a lot of visitors and increase the bounce rate. The same is true about popups and sliders.
Next, do not be afraid of white space. Often site owners are literally obsessed with the desire to fill all the free space on the page with menus, banners and other stuff. Experiments with colors and fonts are rarely justified. Successful designs are usually brilliantly simple: white background + black text.
Mobile first
With such a dynamic pace of life, it is important to always keep up with trends, and the future belongs to mobile devices. We have already passed that line and mobile devices generate more traffic than desktop computers. This tendency will only increase, so adapt the layout and mind the mobile first and progressive advancement concepts.
Site navigation
Your visitors should be your priority. Use human-oriented terms and concepts to build navigation instead of search engine oriented phrases.
Do not let your visitors get stuck on your site. Always provide access to other pages, but be sure to mention which particular page will be opened so that the visitor understands exactly where and why he goes.
Technical audit
The site can be compared to a house - you always need to monitor the performance of all systems, and there is always a need to fix or improve something. Therefore, a technical audit of any project should be carried out regularly. It is always better if you are the first to notice the problem, and not your visitors or search engines.
As part of the audit, an analysis is carried out on such items as:
● Checking robots.txt / sitemap.xml files
● Checking duplicates and technical pages
● Checking the use of canonical URLs
● Monitoring 404 error page and redirects
There are many tools that help you monitor your website performance and run regular audits.
Conclusion
I hope these tips will help your site become even better. If you have questions or want to share useful lifehacks, feel free to comment below.
Resources:
https://networkworld.com/article/...
https://webopedia.com/TERM/C/...
https://searchenginewatch.com/2019/...
https://macsecurity.net/view/...