4

Wanted to try a new alerting based on a new Prometheus metric we added. To trigger an alert we killed the dev stage db of the service. Alert didn't get triggered. The reason was that the metrics endpoint suddenly needs exactly 60s for a response if the db is killed and prometheus timeout is 20s.
And to top it off, this behavior happens for each service we developed (that has a db) .
Well at least the new alerting already helped find a bug.

Comments
  • 2
    At least you didn't find out in prod :)
  • 0
    Poopsie...

    Lil hint: Always try to provide one metric (or response field) containing the datetime/timestamp when the metric was created...

    I was once fucked by caching :(

    Schrödingers metrics aren't funny...
Add Comment