Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
Get a devDuck
Rubber duck debugging has never been so cute! Get your favorite coding language devDuckBuy Now
Search - "cover every possibility"
I swear to god, I'm going to track down the dipshit who just made my day hilariously painful.
So here I am, finishing up this project that's been going on for what feels like an eternity, when I get an email "why doesn't order X show up in this other system?".
I mean, it's a common thing they can take 15 minutes to push across, so the usual quick glance and what do you know, it's just sitting there as if it's waiting to be pushed through, than an hour later... it's still there, so I start digging, maybe a data issue, nope looks all good, customer details, payment details, products...
just another order, jump on the logs and all looks fi......... wait.... why does this postcode have 3 digits and not 4 , Australia has 4 digit postal codes fyi, looks at order again, 3 digits, look at log, 3....hold on why's it only 3 digits, checks code, handled as string... ok..... where the fuck would it drop a digit.... frontend requires 4 digits, validation requires 4 digits... how the fuck did you get 3 digits in... I can't see anything anywhere that logically makes sense for this🤔
Drops address into google and it's a postcode starting with 0.
Jumps on DB and the fucker is an int in the postcode table. For all you playing at home 0123 <> 123
I don't know if I should feel bad, or impressed, it's been 7 years since this table was created, and 7 years before someone managed to live in one of these parts of the country with a leading 0.
QA didn't spot this years ago,
No one tested this exact scenario,
The damn thing isn't even documented as a required delivery area, but here we are!
Kudos good sir, you broke it! 🤜 🤛
You sir may get your order now!10
I hate the elasticsearch backup api.
From beginning to end it's an painful experience.
I try to explain it, but I don't think I will be able to cover it all.
The core concept is:
- repository (storage for snapshots)
- snapshots (actual backup)
The first design flaw is that every backup in an repository is incremental. ES creates an incremental filesystem tree.
Some reasons why this is a bad idea:
- deletion of (older) backups is slow, as newer backups need to be checked for integrity
- you simply have to trust ES that it does the right thing (given the bugs it has... It seems like a very bad idea TM)
- you have no possibility of verification of snapshots
Workaround... Create many repositories as each new repository forces an full backup.........
The second thing: ES scales. Many nodes / es instances form a cluster.
Usually backup APIs incorporate these in their design. ES does not.
If an index spans 12 nodes and u use an network storage, yes: a maximum of 12 nodes will open an eg NFS connection and start backuping.
It might sound not so bad with 12 nodes and one index...
But it get's pretty bad with 100s of indexes and several dozen nodes...
And there is no real limiting in ES. You can plug a few holes, but all in all, when you don't plan carefully your backups, you'll get a pretty f*cked up network congestion.
So traffic shaping must be manually added. Yay...
The last thing is the API itself.
It's a... very fragile thing.
Especially in older ES releases, the documentation is like handing you a flex instead of toilet paper for a wipe.
Documentation != API != Reality.
Especially the fault handling left me more than once speechless...
gives you a state PARTIAL
gives you a state SUCCESS
Why? The first one is blocking and refers to the backup status itself. The second one shouldn't be blocking and refers to the backup operation.
And yes. The backup operation state is SUCCESS, while the backup state might be PARTIAL (hence no full backup was made, there were errors).
So we have now an additional API that we query that then wraps the API of elasticsearch. With all these shiny scary workarounds like polling, since some APIs are blocking which might lead to a gateway timeout...
Gateway timeout? Yes. Since some operations can run a LONG (multiple hours) time and you don't want to have a ton of open connections hogging resources... You let the loadbalancer kill it. Most operations simply run in ES in the background, while the connection was killed.
So much joy and fun, isn't it?
Now add the latest SMR scandal and a few faulty (as in SMR instead of CMD) hdds in a hundred terabyte ZFS pool and you'll get my frustration level.
PS: The cluster has several dozen terabyte and a lot od nodes. If you have good advice, you're welcome - but please think carefully about this fact.
I might have accidentially vaporized people sending me links with solutions that don't work on large scale TM.2