Production goes down because there's a memory leak due to scale.

When you say it in one sentence, it sounds too easy. Being developers we know how it all goes. It starts with an alert ping, then one server instance goes down, then the next. First you start debugging from your code, then the application servers, then the web servers and by that time, you're already on the tips of your toes. Then you realize that the application and application servers have been gradually losing memory over a period of time. If the application is one that don't get re-deployed ever so often, the complexity grows faster. No anomaly / change detection monitor can detect a gradual decrease of memory over a period of months.

  • 0
    Of course it can be detected, you just need to define good rules and have a good monitoring system :)
  • 0
    @linuxxx I had data of over 18 months and I still couldn't detect a gradual memory change of 13KB every 4 days.

    I had to analyze each sample of data that came in to figure that out. Even when I widened out the scope of my monitor throughout the entire data, I couldn't see the decrease happening. It was mostly a straight line with probably a 5-10 degree slope. There's no way my monitors could detect that.

    Or maybe, I'm just a bad operations guy who didn't know how to set up my monitors right, like you said.
Add Comment