22
Karunamon
309d

Once upon a time, one or two jobs ago, a really awesome engineer specced out a distributed search application in response to a business need. This company was managed pretty oldschool and required a ton of paperwork and approvals.

The engineer spent many weeks running tests and optimizing the hell out of this app cluster. It flew, and he had the data to prove it could handle production workloads (think hundreds of terabytes of data being processed every single day)

Part of the way he achieved this was having RAID0 on all of the servers to maximize I/O throughput. He didn't care much about data loss, since the application itself was fault tolerant on a much more granular level.

Management, hearing about this, absolutely flipped their shit and demanded RAID6 instead. This despite the conclusive data that the engineer had that proved RAID6 couldn't keep up.

He more or less got told to STFU.

Even this despite the fact that a RAID restripe would actually take many times longer than rebuilding the failed node from scratch (a process that took about 30 minutes by hand, and could probably be automated to be done in less than five), causing a longer exposure to actual data loss throughout the length of the days-long array rebuild time.

The ill-thought-out requirement added about 50% to the cost of the project (*many* more hard drives now required), beyond the original budget, and the subsequent bureaucratic wrangling resulted in a late product launch.

6 months or so later, after real customers were using this product, the app was buckling under around half of its expected workload. A friend of the engineer suggested to management to try RAID0. Sure enough, that resolved the I/O bottleneck.

This rage-inducing story has a happy ending, though! Said engineer left the company not long after this incident, citing it as a reason for his departure. He was immediately hired by another company, making integer multiples of his prior salary.

The product the company botched the launch of by ignoring his spec? It died a few months later. Maybe the poor customer experience was to blame? Maybe the late launch? Maybe it was another reason entirely.

Either way, millions of dollars of hardware now sat fallow. This was a black eye on the company all the way up to the C-level.

tl;dr: Listen to your engineers. You hired them for their expertise.

Comments
Add Comment