13

Issue in production. Multi billion dollar enterprise. Complex landscape. We sort of make things.

Turns out there is a single point of failure at a specific integration point. Kind of a lot stopped. When I reached out to the people knowing anything about it and I raised the issue that maybe we should make a slight change in how we do things they just brushed it off. Like it was nothing… 😬

No data was lost but everything was delayed for many hours. The _truth_ varied in different parts of the ecosystem causing potential wrong or suboptimal decisions to be taken.

When I asked why this LOS was not detected they told be they have no means of detecting it. 😬

I’m like, yeah, it’s 2023, we’re going to land on Mars and you can bet your ass we can detect it and you are just LAZY DEVELOPERS!

Anyway, I escalated (nicely) and they are now implementing a (more) resilient system and we’re helping the team detecting THEIR LOS in minutes instead of downstream services hours later (they are bad also but it’s not their fault!)

Stay safe!

Comments
  • 7
    Hope you are getting a raise off it...

    Because, cynical me thinks this will be a medal in some hr/c-suite/middle management asshole.

    And yes, fixing shit *is* your job, but taking credit for the fix *is*your right.
  • 2
    Haha we sort of make things
    Haha
  • 0
    Define los... again..
    Your fancy modern-day acronyms do not compute.

    So. Where integration is concerned you mean where one software system is drawing from and placing data into the backend of another or through an API or something else ?

    Also how were you detecting the error ?
  • 3
    The sad truth is that often, everyone is aware of the deficiencies, most people want to remediate them, but only a few push for it, and they're usually the ones who can't really allocate the resources to do so because they either don't have the authority, or the business side doesn't see the value and insists new features are the priority instead... well, at least until the ship damn near sinks at least, then it's "why the hell didn't you technologists fix this before now?!" It's maddening.
  • 1
    @fzammetti so true, like when they will start to listen/read what we are telling them needs to be fixed/opitimzed/updated for good and only give green light when the damage is too big...those their priorities...
  • 2
    I'm at a loss here, what is LOS?
  • 3
    @hjk101

    I read it as loss of service.
  • 1
    @AvatarOfKaine Loss of Signal. It was (is?) NASA lingo when telemetry was not received. Usually behind the moon or in re-entry.
  • 0
    @AvatarOfKaine And yes, a dependent system. Just simple copy of data. 🤷🏼‍♂️
  • 0
    @CoreFusionX Nah, and I don’t expect it either. I would believe this is expected of me (us). 🤷🏼‍♂️
  • 0
    @AvatarOfKaine Oh, yes. The lack of data was not detected early but late. Very late. Too late. That’s what we fixed.
  • 1
    @hjk101 Loss of Signal. I believe it is an old NASA thing. No data.
  • 0
    @CoreFusionX Loss of Service would imply a fault. In this case no errors at all. Just bad monitoring (no monitoring?).
  • 2
    @sideshowbob76

    My bad.

    Still, observability is an important aspect, especially in mission critical projects.

    Which is why the one who identifies the faults and provides the solutions should be rewarded.

    Sure, it may be expected of you, but then again it was expected from your predecesor, and he clearly didn't do such a good job...
  • 0
    @CoreFusionX No worry! 😀

    Well, I think that there are many of us who do much more than is expected from us.

    Monitoring is usually one of the things that are forgotten when creating a thingy. I would encourage all to create a minimal level of monitoring that is required to put anything in production (or UAT for that matter). There should be a milestone in every project. I would go as far to have a dedicated team doing this that everyone can utilize. If you don’t have either, start setting up a road-map of where you want to be in a year from now.
Add Comment