Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "db-logs"
-
Hey, Root? How do you test your slow query ticket, again? I didn't bother reading the giant green "Testing notes:" box on the ticket. Yeah, could you explain it while I don't bother to listen and talk over you? Thanks.
And later:
Hey Root. I'm the DBA. Could you explain exactly what you're doing in this ticket, because i can't understand it. What are these new columns? Where is the new query? What are you doing? And why? Oh, the ticket? Yeah, I didn't bother to read it. There was too much text filled with things like implementation details, query optimization findings, overall benchmarking results, the purpose of the new columns, and i just couldn't care enough to read any of that. Yeah, I also don't know how to find the query it's running now. Yep, have complete access to the console and DB and query log. Still can't figure it out.
And later:
Hey Root. We pulled your urgent fix ticket from the release. You know, the one that SysOps and Data and even execs have been demanding? The one you finished three months ago? Yep, the problem is still taking down production every week or so, but we just can't verify that your fix is good enough. Even though the changes are pretty minimal, you've said it's 8x faster, and provided benchmark findings, we just ... don't know how to get the query it's running out of the code. or how check the query logs to find it. So. we just don't know if it's good enough.
Also, we goofed up when deploying and the testing database is gone, so now we can't test it since there are no records. Nevermind that you provided snippets to remedy exactly scenario in the ticket description you wrote three months ago.
And later:
Hey Root: Why did you take so long on this ticket? It has sat for so long now that someone else filed a ticket for it, with investigation findings. You know it's bringing down production, and it's kind of urgent. Maybe you should have prioritized it more, or written up better notes. You really need to communicate better. This is why we can't trust you to get things out.
*twitchy smile*rant useless people you suck because we are incompetent what's a query log? it's all your fault this is super urgent let's defer it ticket notes too long; didn't read21 -
So, some time ago, I was working for a complete puckered anus of a cosmetics company on their ecommerce product. Won't name names, but they're shitty and known for MLM. If you're clever, go you ;)
Anyways, over the course of years they brought in a competent firm to implement their service layer. I'd even worked with them in the past and it was designed to handle a frankly ridiculous-scale load. After they got the 1.0 released, the manager was replaced with some absolutely talentless, chauvinist cuntrag from a phone company that is well known for having 99% indian devs and not being able to heard now. He of course brought in his number two, worked on making life miserable and running everyone on the team off; inside of a year the entire team was ex-said-phone-company.
Watching the decay of this product was a sheer joy. They cratered the database numerous times during peak-load periods, caused $20M in redis-cluster cost overrun, ended up submitting hundreds of erroneous and duplicate orders, and mailed almost $40K worth of product to a random guy in outer mongolia who is , we can only hope, now enjoying his new life as an instagram influencer. They even terminally broke the automatic metadata, and hired THIRTY PEOPLE to sit there and do nothing but edit swagger. And it was still both wrong and unusable.
Over the course of two years, I ended up rewriting large portions of their infra surrounding the centralized service cancer to do things like, "implement security," as well as cut memory usage and runtimes down by quite literally 100x in the worst cases.
It was during this time I discovered a rather critical flaw. This is the story of what, how and how can you fucking even be that stupid. The issue relates to users and their reports and their ability to order.
I first found this issue looking at some erroneous data for a low value order and went, "There's no fucking way, they're fucking stupid, but this is borderline criminal." It was easy to miss, but someone in a top down reporting chain had submitted an order for someone else in a different org. Shouldn't be possible, but here was that order staring me in the face.
So I set to work seeing if we'd pwned ourselves as an org. I spend a few hours poring over logs from the log service and dynatrace trying to recreate what happened. I first tested to see if I could get a user, not something that was usually done because auth identity was pervasive. I discover the users are INCREMENTAL int values they used for ids in the database when requesting from the API, so naturally I have a full list of users and their title and relative position, as well as reports and descendants in about 10 minutes.
I try the happy path of setting values for random, known payment methods and org structures similar to the impossible order, and submitting as a normal user, no dice. Several more tries and I'm confident this isn't the vector.
Exhausting that option, I look at the protocol for a type of order in the system that allowed higher level people to impersonate people below them and use their own payment info for descendant report orders. I see that all of the data for this transaction is stored in a cookie. Few tests later, I discover the UI has no forgery checks, hashing, etc, and just fucking trusts whatever is present in that cookie.
An hour of tweaking later, I'm impersonating a director as a bottom rung employee. Score. So I fill a cart with a bunch of test items and proceed to checkout. There, in all its glory are the director's payment options. I select one and am presented with:
"please reenter card number to validate."
Bupkiss. Dead end.
OR SO YOU WOULD THINK.
One unimportant detail I noticed during my log investigations that the shit slinging GUI monkeys who butchered the system didn't was, on a failed attempt to submit payment in the DB, the logs were filled with messages like:
"Failed to submit order for [userid] with credit card id [id], number [FULL CREDIT CARD NUMBER]"
One submit click later and the user's credit card number drops into lnav like a gatcha prize. I dutifully rerun the checkout and got an email send notification in the logs for successful transfer to fulfillment. Order placed. Some continued experimentation later and the truth is evident:
With an authenticated user or any privilege, you could place any order, as anyone, using anyon's payment methods and have it sent anywhere.
So naturally, I pack the crucifixion-worthy body of evidence up and walk it into the IT director's office. I show him the defect, and he turns sheet fucking white. He knows there's no recovering from it, and there's no way his shitstick service team can handle fixing it. Somewhere in his tiny little grinchly manager's heart he knew they'd caused it, and he was to blame for being a shit captain to the SS Failboat. He replies quietly, "You will never speak of this to anyone, fix this discretely." Straight up hitler's bunker meme rage.13 -
I swear to god, I'm going to track down the dipshit who just made my day hilariously painful.
So here I am, finishing up this project that's been going on for what feels like an eternity, when I get an email "why doesn't order X show up in this other system?".
I mean, it's a common thing they can take 15 minutes to push across, so the usual quick glance and what do you know, it's just sitting there as if it's waiting to be pushed through, than an hour later... it's still there, so I start digging, maybe a data issue, nope looks all good, customer details, payment details, products...
just another order, jump on the logs and all looks fi......... wait.... why does this postcode have 3 digits and not 4 , Australia has 4 digit postal codes fyi, looks at order again, 3 digits, look at log, 3....hold on why's it only 3 digits, checks code, handled as string... ok..... where the fuck would it drop a digit.... frontend requires 4 digits, validation requires 4 digits... how the fuck did you get 3 digits in... I can't see anything anywhere that logically makes sense for this🤔
Drops address into google and it's a postcode starting with 0.
Jumps on DB and the fucker is an int in the postcode table. For all you playing at home 0123 <> 123
I don't know if I should feel bad, or impressed, it's been 7 years since this table was created, and 7 years before someone managed to live in one of these parts of the country with a leading 0.
QA didn't spot this years ago,
No one tested this exact scenario,
The damn thing isn't even documented as a required delivery area, but here we are!
Kudos good sir, you broke it! 🤜 🤛
You sir may get your order now!rant cover every possibility always suspect the unexpected my problem now! not my fault 😅 data how dafuq was that even missed11 -
One Thursday noon,
operation manager: (looking at mobile)what the.....something is wrong i am getting bunch of emails about orders getting confirmed.
Colleague dev: (checks the main email where it gets all email sent/received) holy shit all of our clients getting confirmation email for orders which were already cancelled/incomplete.
Me: imediately contacting bluehost support, asking them to down the server so just that we can stopp it, 600+ emails were already sent and people keep getting it.
*calls head of IT* telling the situation because he's not in the office atm.
CEO: wtf is happening with my business, is it a hacker?
*so we have a intrusion somebody messed the site with a script or something*
All of us(dev) sits on the code finding the vulnerabilities , trying to track the issue that how somebody was able to do that.
*After an hour*
So we have gone through almost easch function written in the code which could possibly cause that but unable to find anything which could break it.
Head asking op when did you started getting it actually?
Op: right after 12 pm.
*an other hour passes*
Head: (checking the logs) so right after the last commit, site got updated too?. And....and.....wtf what da hell who wrote this shit in last commit?
* this fuckin query is missing damn where clause* 🤬
Me: me 😰
*long pause, everyone looking at me and i couldn't look at anyone*
The shame and me that how can i do that.
Head: so its you not any intrudor 😡
Further investigating, what the holy mother of #_/&;=568 why cronjob doesn't check how old the order is. Why why why.
(So basically this happened, because of that query all cancelled/incomplete orders got updated damage done already, helping it the cronjob running on all of them sending clients email and with that function some other values got updated too, inshort the whole db is fucked up.)
and now they know who did it as well.
*Head after some time cooling down, asked me the solution for the mess i create*
Me: i took backup just couple of days before i can restore that with a script and can do manual stuff for the recent 2 days. ( operation manager was already calling people and apologising from our side )
Head: okay do it now.
Me: *in panic* wrote a script to restore the records ( checking what i wrote 100000000 times now ), ran...tested...all working...restored the data.
after that wrote an apology email, because of me staff had to work alot and it becomes so hectic just because of me.
* at the end of the day CEO, head, staff accepted apology and asked me to be careful next time, so it actually teached me a lesson and i always always try to be more careful now especially with quries. People are really good here so that's how it goes* 🙂2 -
The website for our biggest client went down and the server went haywire. Though for this client we don’t provide any infrastructure, so we called their it partner to start figuring this out.
They started blaming us, asking is if we had upgraded the website or changed any PHP settings, which all were a firm no from us. So they told us they had competent people working on the matter.
TL;DR their people isn’t competent and I ended up fixing the issue.
Hours go by, nothing happens, client calls us and we call the it partner, nothing, they don’t understand anything. Told us they can’t find any logs etc.
So we setup a conference call with our CXO, me, another dev and a few people from the it partner.
At this point I’m just asking them if they’ve looked at this and this, no good answer, I fetch a long ethernet cable from my desk, pull it to the CXO’s office and hook up my laptop to start looking into things myself.
IT partner still can’t find anything wrong. I tail the httpd error log and see thousands upon thousands of warning messages about mysql being loaded twice, but that’s not the issue here.
Check top and see there’s 257 instances of httpd, whereas 256 is spawned by httpd, mysql is using 600% cpu and whenever I try to connect to mysql through cli it throws me a too many connections error.
I heard the IT partner talking about a ddos attack, so I asked them to pull it off the public network and only give us access through our vpn. They do that, reboot server, same problems.
Finally we get the it partner to rollback the vm to earlier last night. Everything works great, 30 min later, it crashes again. At this point I’m getting tired and frustrated, this isn’t my job, I thought they had competent people working on this.
I noticed that the db had a few corrupted tables, and ask the it partner to get a dba to look at it. No prevail.
5’o’clock is here, we decide to give the vm rollback another try, but first we go home, get some dinner and resume at 6pm. I had told them I wanted to be in on this call, and said let me try this time.
They spend ages doing the rollback, and then for some reason they have to reconfigure the network and shit. Once it booted, I told their tech to stop mysqld and httpd immediately and prevent it from start at boot.
I can now look at the logs that is leading to this issue. I noticed our debug flag was on and had generated a 30gb log file. Tail it and see it’s what I’d expect, warmings and warnings, And all other logs for mysql and apache is huge, so the drive is full. Just gotta delete it.
I quietly start apache and mysql, see the website is working fine, shut it down and just take a copy of the var/lib/mysql directory and etc directory just go have backups.
Starting to connect a few dots, but I wasn’t exactly sure if it was right. Had the full drive caused mysql to corrupt itself? Only one way to find out. Start apache and mysql back up, and just wait and see. Meanwhile I fixed that mysql being loaded twice. Some genius had put load mysql.so at the top and bottom of php ini.
While waiting on the server to crash again, I’m talking to the it support guy, who told me they haven’t updated anything on the server except security patches now and then, and they didn’t have anyone familiar with this setup. No shit, it’s running php 5.3 -.-
Website up and running 1.5 later, mission accomplished.6 -
TLDR - you shouldn't expect common sense from idiots who have access to databases.
I joined a startup recently. I know startups are not known for their stable architecture, but this was next level stuff.
There is one prod mongodb server.
The db has 300 collections.
200 of those 300 collections are backups/test collections.
25 collections are used to store LOGS!! They decided to store millions of logs in a nosql db because setting up a mysql server requires effort, why do that when you've already set up mongodb. Lol 😂
Each field is indexed separately in the log.
1 collection is of 2 tb and has more than 1 billion records.
Out of the 1 billion records, 1 million records are required, the rest are obsolete. Each field has an index. Apparently the asshole DBA never knew there's something called capped collection or partial indexes.
Trying to get approval to clean up the db since 3 months, but fucking bureaucracy. Extremely high server costs plus every week the db goes down since some idiot runs a query on this mammoth collection. There's one single set of credentials for everything. Everyone from applications to interns use the same creds.
And the asshole DBA left, making me in charge of handling this shit now. I am trying to fix this but am stuck to get approval from business management. Devs like these make me feel sad that they have zero respect for their work and inability to listen to people trying to improve the system.
Going to leave this place really soon. No point in working somewhere where you are expected to show up for 8 hours, irrespective of whether you even switch on your laptop.
Wish me luck folks.3 -
I really, really, fucking god damn it REALLY need to move a legacy project from the grave yard server and get it in git, and then build a dev environment for it, so I can stop making incredibly volatile changes direct to PROD (backend, frontend and DB all at once and then test it while it’s live and being used, but fuck me if I can be bothered digging through a 10GB code base and attempting to make it work in a multi-environment setup when it’s going to be a long trip down the error logs until it works again 😱🔫2
-
Today I just realized that a program I deployed was running without DB for almost a week. Thought it was populated suring the deploy.
Fortunately, nobody cared and was not exposed to any error logs1 -
And this, ladies and gentlemen, is why you need properly tested backups!
TL;DR: user blocked on old gitlab instance cascade deleted all projects the user was set as owner.
So, at my customer, collegue "j" reviews gitlab users and groups, notices an user who left the organisation
"j" : ill block this user
> "j" blocks user
> minutes pass away, working, minding our own business
> a wild team devops leader "k" appears
k: where are all the git projects?
> waitwut?.jpg
> k: yeah all git projects where user was owner of, are deleted
> j.feeling.despair() ; me.feeling.despair();
> checks logs on server, notices it cascade deletes all projects to that user
> lmgt log line
> is a bugreport reported 3(!) years ago
> gitlab hasnt been updated since 3 years
> gitlab system owner is not present, backup contact doesnt know shit about it
> i investigate further, no daily backup cron tasks, no backup has been made whatsoever.
> only 'backups' are on file system level, trying to restore those
> gitlab requires restore of postgres db
> backup does not contain postgres since the backup product does not support that (wtf???)
> fubar.scene
> filesystem restore finished...
> backup product did not back up all files from git tree, like none of refs were stored since the product cannot handle such filenames .. Git repo's completely broken
Fuck my life6 -
Did a bunch more cowboy coding today as I call it (coding in vi on production). Gather 'round kiddies, uncle Logan's got a story fer ya…
First things first, disclaimer: I'm no sysadmin. I respect sysadmins and the work they do, but I'm the first to admit my strengths definitely lie more in writing programs rather than running servers.
Anyhow, I recently inherited someone else's codebase (the story of my profession career, but I digress) and let me tell you this thing has amateur hour written all over it. It's written in PHP and JavaScript by a self-taught programmer who apparently discovered procedural programming and decided there was nothing left to learn and stopped there (no disrespect to self-taught programmers).
I could rant for days about the various problems this codebase has, but today I have a very specific story to tell. A story about errors and logs.
And it all started when I noticed the disk space on our server was gradually decreasing.
So today I logged onto our API server (Ubuntu running Apache/PHP) and did a df -h to check the disk space, and was surprised to see that it had noticeably decreased since the last time I'd checked when everything was running smoothly. But seeing as this server does not store any persistent customer data (we have a separate db server) and purely hosts the stateless API, it should NOT be consuming disk space over time at all.
The only thing I could think of was the logs, but the logs were very quiet, just the odd benign message that was fully expected. Just to be sure I did an ls -Sh to check the size of the logs, and while some of them were a little big, nothing over a few megs. Nothing to account for gigabytes of disk space gradually disappearing.
What could it be? I wondered.
cd ../..
du . | sort --sort=numeric
What's this? 2671132 K in some log folder buried in the api source code? I cd into it and it turns out there are separate PHP log files in there, split up by customer, so that each customer of ours (we have 120) has their own respective error log! (Why??)
Armed with this newfound piece of (still rather unbelievable) evidence I perform a mad scramble to search the codebase for where this extra logging is happening and sure enough I find a custom PHP error handler that is capturing (most) errors and redirecting them to these individualized log files.
Conveniently enough, not ALL errors were being absorbed though, so I still knew the main error_log was working (and any time I explicitly error_logged it would go there, so I was none the wiser that this other error-catching was even happening).
Needless to say I removed the code as quickly as I found it, tail -f'd the error_log and to my dismay it was being absolutely flooded with syntax errors, runtime PHP exceptions, warnings galore, and all sorts of other things.
My jaw almost hit the floor. I've been with this company for 6 months and had no idea these errors were even happening!
The sad thing was how easy to fix all the errors ended up being. Most of them were "undefined index" errors that could have been completely avoided with a simple isset() check, but instead ended up throwing an exception, nullifying any code that came after it.
Anyway kids, the moral of the story is don't split up your log files. It makes absolutely no sense and can end up obscuring easily fixable bugs for half a year or more!
Happy coding.6 -
OMFG I don't even know where to start..
Probably should start with last week (as this is the first time I had to deal with this problem directly)..
Also please note that all packages, procedure/function names, tables etc have fictional names, so every similarity between this story and reality is just a coincidence!!
Here it goes..
Lat week we implemented a new feature for the customer on production, everything was working fine.. After a day or two, the customer notices the audit logs are not complete aka missing user_id or have the wrong user_id inserted.
Hm.. ok.. I check logs (disk + database).. WTF, parameters are being sent in as they should, meaning they are there, so no idea what is with the missing ids.
OK, logs look fine, but I notice user_id have some weird values (I already memorized most frequent users and their ids). So I go check what is happening in the code, as the procedures/functions are called ok.
Wow, boy was I surprised.. many many times..
In the code, we actually check for user in this apps db or in case of using SSO (which we were) in the main db schema..
The user gets returned & logged ok, but that is it. Used only for authentication. When sending stuff to the db to log, old user Id is used, meaning that ofc userid was missing or wrong.
Anyhow, I fix that crap, take care of some other audit logs, so that proper user id was sent in. Test locally, cool. Works. Update customer's test servers. Works. Cool..
I still notice something off.. even though I fixed the audit_dbtable_2, audit_dbtable_1 still doesn't show proper user ids.. This was last week. I left it as is, as I had more urgent tasks waiting for me..
Anyhow, now it came the time for this fuckup to be fixed. Ok, I think to myself I can do this with a bit more hacking, but it leaves the original database and all other apps as is, so they won't break.
I crate another pck for api alone copy the calls, add user_id as param and from that on, I call other standard functions like usual, just leave out the user_id I am now explicitly sending with every call.
Ok this might work.
I prepare package, add user_id param to the calls.. great, time to test this code and my knowledge..
I made changes for api to incude the current user id (+ log it in the disk logs + audit_dbtable_1), test it, and check db..
Disk logs fine, debugging fine (user_id has proper value) but audit_dbtable_1 still userid = 0.
WTF?! I go check the code, where I forgot to include user id.. noup, it's all there. OK, I go check the logging, maybe I fucked up some parameters on db level. Nope, user is there in the friggin description ON THE SAME FUCKING TABLE!!
Just not in the column user_id...
WTF..Ok, cig break to let me think..
I come back and check the original auditing procedure on the db.. It is usually used/called with null as the user id. OK, I have replaced those with actual user ids I sent in the procedures/functions. Recheck every call!! TWICE!! Great.. no fuckups. Let's test it again!
OFC nothing changes, value in the db is still 0. WTF?! HOW!?
So I open the auditing pck, to look the insides of that bloody procedure.. WHAT THE ACTUAL FUCK?!
Instead of logging the p_user_sth_sth that is sent to that procedure, it just inserts the variable declared in the main package..
WHAT THE ACTUAL FUCK?! Did the 'new guy' made changes to this because he couldn't figure out what is wrong?! Nope, not him. I asked the CEO if he knows anything.. Noup.. I checked all customers dbs (different customers).. ALL HAD THIS HARDOCED IN!!! FORM THE FREAKING YEAR 2016!!! O.o
Unfuckin believable.. How did this ever work?!
Looks like at the begining, someone tried to implement this, but gave up mid implementation.. Decided it is enough to log current user id into BLABLA variable on some pck..
Which might have been ok 10+ years ago, but not today, not when you use connection pooling.. FFS!!
So yeah, I found easter eggs from years ago.. Almost went crazy when trying to figure out where I fucked this up. It was such a plan, simple, straight-forward solution to auditing..
If only the original procedure was working as it should.. bloddy hell!!8 -
Never mess with a motivated developer. I will make your life difficult in return.
Me: we need server logs and stats daily for analysis
DBA: to get those, you need to open a ticket
Me: can't you just give me SFTP access and permissions to query the stats from the DB?
DBA: No.
*OK.... 🤔🤔🤔*
*Writes an Excel Template file that I basically just need to copy and paste from to create a ticket*
This process should not take me more than 2mins 👍😁😋🙂😙😙😙😙😙😙😙😙
For them.... 😈😈😈😈😈😈😈😈😈😈😈9 -
Working in a bank, using MIcrosoft platform:
To open my email, I need to enter my password and sms OTP.
To open my email using phone, I need to enter my password and sms OTP.
To open Teams, I need to enter my password and sms OTP.
To open Teams using phone, I need to enter my password and sms OTP.
To access Microsoft Azure, I need to enter my password and sms OTP.
To git pull/push, I need to enter my sms OTP.
To check UAT logs, I need to enter my sms OTP.
To get access to UAT DB, I need to connect to VPN, which then asks for OTP.
Did I also mention that I need to do these OTPs every single fucking day?
#OTPDrivenDevelopment5 -
ALWAYS read warnings guys.
Story time !
A client of ours has a synchronization app (we wrote it) between his inhouse DB and our app. (No, no APIs on their end. It’s a schelduled task).
Because we didn’t want to ask them for logs every single time, the app writes logs to disk (normal) and in Applications Insights in Azure.
When needed, I can go in portal, get all logs for last execution in a nice CSV file.
Well, recently we added more logs (Some problems were impossible to track).
So client calls us : “problem with XXX”
Me : Goes to Azure, does the same manipulation as always. Dismiss a smaaaaaalish warning without reading. Study logs. Conclusion: “The XXX is not even in the logs, check your DB”.
Little I knew, the warning was telling me “Results are truncated at 10.000 lines”.
So client was right, I was wrong and I needed to develop a small app to get logs with more than 10.000 lines. (It’s per execution. Every 3 hours) -
Not the worst, but probably the only one I can sort of explain & not get into trouble for NDA breach..
Umm.. here it goes.. wrong id returned from db procedure, tried to do something on db with that id and got exception that the id doesn't exist. Instead of checking why the procedure returns nonexistent id, he just wrapped everything in try catch without any logs.. & of course, didn't tell anyone about this.. o.0
I know, I know, code review could have prevented this, but holy fuck..
Guy's cv had more experience than I have now, so at the time, I didn't think I'd have to check every line of code he wrote, especially not for shit like this.3 -
Sometimes I don't know if my co-worker is that stupid or...
Well, he came to me with an strange problem with mongoose.
I looked at the error message. And guess what the database was not reachable. Asked him, did you check the mongo db service. No. Of course the service was not running. Told him to restart it. Then he restarted robo t not the service itself. Major face palm. He then asked me if I knew why his service was not running. Do I look like some kind of wizard? Told him to check the logs. Long story short, his drive ran out of space....2 -
So, we (I'm the backend guy and work with a UI dev) are building this product portfolio management tool for our client and they have a set of 250 users. The team has two point of contacts for the 250 users who maintain the master data, help users with data quality, tool guidance, reporting and other stuff. So one day one of these two support users come to me and say : Hey I'm not able to add new transactions coz a customer is missing.
We have the provision to create / maintain customers.
I check the production DB, application code, try creating the customer and then the transaction, everything works perfectly fine.
I ask the user for a screen sharing session, the user starts reproducing the error like this :
We have a 3 system landscape - Dev / Test and Prod
U : Logs into the test system url, creates the customer.
U : Points out the toast saying customer creation is successful.
U : opens a new tab, opens the production system, tries creating the transaction, searches for the customer and says " see !! cant find the customer here ! the master data management apps never work !! "
FML?. -
Yesterday was a horrible day...
First of all, as we are short of few devs, I was assigned production bugs... Few applications from mobile app were getting fucked up. All fields in db were empty, no customer name, email, mobile number, etc.
I started investigating, took dump from db, analyzed the created_at time stamps. Installed app, tried to reproduce bug, everything worked. Tried API calls from postman, again worked. There were no error emails too.
So I asked for server access logs, devops took 4 hrs just to give me the log. Went through 4 million lines and found 500 errors on mobile apis. Went to the file, no error handling in place.
So I have a bug to fix which occurs 1 in 100 case, no stack trace, no idea what is failing. Fuck my job. -
So I made an update to my React Native app. I changed UI of a couple of screen, added a few animations here and there, refactored how my graphQL resolvers work in the backend(no breaking changes), changed how data gets loaded into the database etc.
It worked in dev so I figured hey let's deploy it. Today is(was because it's now 3am but more on that later) a national holiday so no one goes to work so no one will use my app so I have an entire day to deploy.
I started at 15:00(because i woke up at 13:00 lol). I tested the update once again in dev and proceeded to deploy it to prod. I merged backend to master, built docker images, did migrations on the db, restarted docker-compose with new images. And now for the app. I run ./gradlew assembleRelease and it starts complaining that react-native-gesture-handler is not installed. Ugh, rm -rf node_modules && yarn install. It worked. But now gradlew crashes and logs don't tell me anything. Google tells me to change a bunch of gradle settings but none of them work. Fast forward 5h, it's around 20:00 and I isolated the issue to, again, react-native-gesture-handler. They updated from 2.2.4 to 2.3.0 which didn't fucking compile. 2 more hours passed (now 22:00) and I got v2.3.1 working which fixed the problem in 2.3.0 but made my app crash on startup. YOUR FUCKING LIBRARY GETS 250K WEEKLY DOWNLOADS AND YOU DONT EVEN BOTHER CHECKING IF IT COMPILES IN PROD ON ANDROID?! WHAT THE FUCK software-mansion?
After I solved that, my app didn't crash. Now it threw an error "Type errors: Network Request Failed" every time I fetch my legacy REST API(older parts use rest and newer use graphql. I'll refactor that in the next update). I'll spare you the debugging hell i went through but another 5h passed. Its 3am. My config had misspelled url to prod but good for dev... I hate myself and even more so react-native-gesture-handler.3 -
Worst week ever.
Servers are on fire. Respoinse times out of control
Some SIMPLE SQL queries (literaly select * from whatever where Id = id) timouts at 30 seconds.
No idea what's goining on (And I have full logs of all api calls and all DB queries). No way to find how to corelate this data.
Ok, I added 1000$/month on Azure and the problem is "masked", but not resolved.
I have dumps, I have logs I have everything, why the fuck I can't find the 1 or 2 APIs causing that ?!!!
Now I feel better.10 -
Here, code reviews are not happening 🙁☹️
When the actual error comes to the prod, we dig into the logs and figure out the reason.
The project is not stable and in the development phase, so requirements are coming too much every month with deadlines.
Deadlines are mostly 1 to 2 weeks.
Sr. Devs mostly merge PR without reviewing it.
I had lots of opportunities though due to various requirements like I learnt AWS dynamo DB, S3, and a few things regarding EC2.
But the coding standard which needs to be learned that I think I'm lacking because my code is not getting reviewed.
Not only about coding, we have to create a ticket in Jira for our task which is decided in the scrum and needs to assigned to ourselves.
In the name of scrum, there are 1 to 2 hours of meeting where they started brainstorming about new requirements and how we are going to implement them.
What should I do to make my code more cleaner and professional?1 -
Ok so first technical blog post/rant cuz I just reduced a lot of debt... Prolly gonna put this in an email to my boss (he says progress improvement is now a priority but there are some problems as listed below):
So last week, I spent a lot of time investigating db logs manually to figure out a prod issue: tiring, time consuming, and not very effective.
This week I built an app. It took a few days but having the time to design it correctly, it is very powerful.
So in order to really do process improvement, you need to have: dedicated the time, the problem solving mindset (the right people), and the understanding of what the problem is and why so you can build a good solution (time and people).1 -
So.. I'm migrating a physical server to a virtual (Hyper-V) one.
The physical server is running Windows Server 2008 R2 with IIS6 and Windows SQL Server 2012.
I've set up a VM with Windows Server 2016, IIS10 and Windows SQL Server 2017.
I'm testing with just moving 1 db at a time (we have about 20, 1 per client running this software and a few others) and I've already imported all of the IIS sites.
So the database import and IIS import went smoothly and was surprisingly without hassle but now I'm trying to run the website that I imported the database for and it is throwing 503 Errors at me.
I've been trying to find out the cause but for some reason IIS isn't making any logs.
It's not any 64/32bit system problems (they're both x64) and I can't seem to find anything wrong with the imported config.
Anyone got any ideas?14 -
Note to self: when you extend a functionality by rewriting a relevant part, remember to mark the old code as deprecated or delete it.
AKA "why the hell I´m not seeing anything in logs/db that reflects my changes" T_T -
I feel like a fucking god now!
We run a webshop and we are in contract with the national post office. Every time there is an update to their program I fear ahead of time what will be fucked up again.
After today's update we weren't able to open any shippment list we just saw a mile long error message. After the customer care couldn't figure out the problem, and the suggested solution might take up to 2days, and it is basically only a new customer file, i fired up my good old sqlite viewer friend, to chek if I am lucky...
Guess what! That shit is using unsecured sqlite dbs, so i've had no problem examining and even rewriting the values. So checking the logs and scraping the DB I've found the problem.
Apparently some asshole thought that deleting a service but keeping all of its references in other tables scattered around is a good fucking idea. And take it customer care, the new customer file won't fix shit, because it was in the global DB. I swear i am getting more familiar with that piece of garbage then the ones who made it.
On top of that the customer care told us, that if we couldn't manage to send the shippment list with the program we are not elligible for our contractual prices.
It is not enough that I had to fix their fucking shit program, they also "would like to charge us" because their pogram isn't working. What a fucking great service. (At least the lady on the telephone was friendly)1 -
Sooooo I came in to work yesterday and the first thing I see is that our client can't log on to the cms I set up for her a month ago. I go log in with my admin credentials and check the audit logs.
It says the last person to access it was me, the date and time exactly when we first deployed it to production.
One month ago.
I fired a calm email to our project managers (who've yet to even read the client complaint!) to check with ops if the cms production database had been touched by the ops team responsible for the sql servers. Because it was definitely not a code issue, and the audit logs never lie.
Later in the day, the audit log updated itself with additional entries - apparently someone in ops had the foresight to back up the database - but it was still missing a good couple weeks of content, meaning the backup db was not recent.
Fucking idiots. -
"Oh, sorry I didn't write you back! I checked 3 hours ago, and we only add the data once in our database before sending the notification to your endpoint, so everything is fine! Check if you run the same functio twice, it's an easy mistake!"
You. Fucking. Moron. You send the data 2 or 3 times (at random) every fucking time. I have nginx logs showing that, and I've fucking shown them to you TWICE. I don't fucking care if your DB is fine, check how many fucking times you POST the damn data. We're already 2 days behind schedule because you can't be arsed to check your own damn code. Ffs. How can you even be a senior developer?! -
Bullshittery continues. This time around, absolutely innocent, clamav is root cause. For once not incompetent idiot, but piece of software. IDK if that makes me happy or upset.
So our email server that I configured and took care of died. RIP. Damn, better put it back together ASAP. So Im under pressure, while still pissed at everything that I ranted before (actually my last 2 rants were throttled, and in total all of that happened past 60 minutes but devrant rate limiting) I start auditing logs. You imagine, we kindda need it NOW, and it's second time last month clamav is pulling stunts and MTA refuses (properly) to work without antivirus. So pressurized, I look at logs, what the fuck went wrong.
clamav deamonize() failed - cannot allocate memory
Hmm. Intresting, but sounds like bullshit. I know server is quite micro becouse they wanted to save on costs as much as possible, but it has well over half a gig free ram just before it crashes (like 800MB) with that message. Is it allocating almost gig in one call or what? Looked carefully at trusty htop while it was starting, and indeed, suddenly it just dies with quite a bit of ram free, almost as much as it weights already. And I remember booting it up when I was configuring it, and it had fair bit of headroom.
Google, help me friend... Okay, great, so apparently at some point clamav loads virus DB into ram (dafuq?), and than forks, which causes spike of 2x the ram usage, and than immidietely frees it up.
Great, that sounds like great design decision... At least I know, I can just slap on SWAP file, restart it and call it a day.
It worked, swap file is almost empty (used 15megs, 900 megs free ram, whatever).
That leaves me wandering, who figured out to load DB to ram? That means pretty much that clamav will eat a little bit more ram each vir db update, and that milisecond "double ram" spike will confuse innocent people who just wanted to run clamav and it worked last *long period of time* and now crashes without warning without any changes to configuration.
Maybe there is logical explanation, I want to know it.8 -
Worked with two different customers
(customer 1 is up to date because of active development and customer 2 got his update long ago)
Changed something for customer 1 and accidently pulled customer 2. 49 changesets (needs a db update probably). Rolledback and now keeping an eye on the error logs -
New Project
M: Hey, check these two processes. Both took different paths for the same input. Here are the logs. Both are the same though.
Me: Ok... do we have a debugger?
M: No this product doesn't have a debugger
Me: Any unit tests i should know of?
M: We don't do unit testing. Everything is done in Integration Testing.
Me: Ok. So how can i check the db for this?
M: You can't, the access is restricted. You'll have to raise a ticket to other team with the sql output you need.
Me: Ok. So I hope you have the schema at least.
M: Yes we have the schema. But there was some issue last week so the values might not be there in the correct column. They may or may not be present where they are supposed to be.
Wtf am i supposed to do... fucking play football on ticketing system with the other team 😐 -
TL;DR I have to bump a Redis cluster from t3.medium to m6g.large just to get enough network bandwidth even though I have no need of the extra memory.
Debugged an interesting issue today.
I am adding Elasticache to a project to reduce strain on the single node postgres DB.
Deployed a Redis replication group with 2 shards, with multi-AZ replication for resilience.
Everything was going well. We arent caching that much atm so was barely using 100Mb of memory.
Suddenly, when our US region comes online, latency skyrockets and the logs are full of Jedis timeout errors.
Still no issue with memory or node CPU.
The cause? Arbitrary network bandwidth throttling by AWS. The app currently processes about 3,000 requests per second so we were exceeding Amazons random ass allowances which arent documented anywhere.1 -
- Create a Cron to delete users with course status 'enrolled/incomplete' for several months based on existing courseUpdateLogs
- Run Cron manually 1st time
- Find out courseUpdateLogs were very incorrect
- Find out a LOT of valid users got deleted and can't access their worksite
- F***9 -
American clients spent good 4-5 hrs in night to deploy n start application. Late at night in US n UK morning then called us for help after 5 minutes looking at logs I said: "it would help if you have DB servers on as logs were saying clearly". What a knobs !!!
-
I recoded a REST endpoint that transfers large amounts of data from our db using a streaming response so it doesn't crash the server...
Pretty easy... Mostly just needed someone that knew wtf it was or has a bit of curiosity and asks questions... rather than just keep on doing what everyone else is doing...
Who hasn't seen logs updating in near real time in TeamCity, Jenkins... for the last 5yrs+... No one else ever wondered how it's done?
So yes solving a production issue with old technology and being called a genius... I guess is pretty satisfying? -
Any way to run some python code in android without losing my mind !? It is just a code which queries a simple api once a minute and logs it into a sqlite db. I want to use my phone as a substitute for a server instance, cos i dont have the monies to buy one..😬.. SL4A aint working.. P4A webview isnt working for some reason.14
-
Over 3 months, I wrestled and toiled with learning how rsyslog works, send to the log server, passes that to AlienVault OSSIM, where I have to build a plugin that, I thought could be done with a built-in plugin builder but ended up with building it from scratch, and have to learn Regex (surprisingly was fun thanks to amazing online resources), test, build, restart rsyslog, ossim-agent, ossim-server and ossim-db just to get the application log showing up on the BROWSER!
I like OSSIM but what's killing me the most is rsyslog. I still can't get grasp how to get custom logs of any kind into a log server. I don't think I'll remember any of this by tomorrow but whelp.