Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
cprn17615yI like the irony of this rant but man, I'd love for it to be way more technical. I mean we're all tech savvy people here capable of understanding even the most in-depth nuances (granted, when simplified and described in abstracts). The interesting part here is what was the "genius idea" exactly, how did it lead to loss of data and why did you need that data in the first place.
-
stacked26695y@cprn well, for simplicity let's say that we have a system that ingests a big stream of data and tries to find outliers out of it. Those outliers are put in an ElasticSearch stack for later investigation (it's actually more complex than that).
Our genius ex-manager got a fabulous idea for assigning a "score" to outliers so that you could sort them by how "interesting" they are.
A team member went on to implement his idea and wrote some components to add this extra "score" field to each ElasticSearch record. After the initial implementation, it turned out that his idea was total shit, based on magical pseudo-scientific formulas that only flat-earthers could believe in.
So, what did our genius ex-manager did after failing so miserably? Did he admit his mistake? Did he move on to something else? Did he take a vacation? None of this. -
stacked26695yHe evolved the model, adding more different "score" fields that are mixed and grouped together in a second phase. Today, we have about 30 "score" fields and 5 "overall score" fields (not kidding).
Now you might be wondering: how did this lead to loss of data? Very simple: the formulas are so poorly engineered and the implementation was so rushed that some of those scores often end up being NaN. More precisely, 25% of our records contain at least a NaN somewhere.
Apparently, ElasticSearch doesn't like the way we serialize our NaNs, and simply reject those records. 25% of the outliers we found over the last month was never stored.
Turns out that some of the most interesting outliers produced by my new outlier detector were among the lost ones. I spent a day trying to figure out why my records were not showing up before discovering the root cause. Also, I wanted to share my results in a few days, but now I cannot anymore. I have to wait another month in order to have enough data. -
cprn17615yOkay, this made my day now. :D I can feel proper frustration!
Also, reminds me about that time our admin installed Elastic Search for the first time ever and didn't know it tries to "guess" index data type by default... It took him 3 days to figure out he's missing some of the logs and only after seeing we prefixed every string in the log message with `ffua-` on the application layer. It's an abbreviation for Franky Fucked Up Again. We still have that prefix there, I think. -
matste6415y@stacked. Great rant, but you could have simply said „he decided to store data in Elasticsearch”.
Friends don’t let friends persist data in ES.
I had a manager who was a complete incompetent idiot (other than a fucking backstabber). He left the company ~3 weeks ago, yet I believe it would take 5 years to get rid of his legacy.
Today I discovered that one of his "genius ideas" led to the loss of months of data. This is already bad, but it's even more upsetting given that the records that have been lost are exactly the ones I needed to prove the validity of my project.
That fucking man keeps fucking with me even when he's not here, YOU DAMN ASSHOLE!!
rant