Wanted to add alerting for systemd services in Prometheus today, which spontaneously turned out to be a huge pain in the lower human backend.

For some reason, on Ubuntu 16.04 systemd adds services without unit files for software, that isn't even installed on the damn server (in this case for mysql-server / mysql-common and mysql-client are installed) and lists them as "not-found" and "inactive". The prometheus node exporter that we use, has a little bug in the systemd collector that makes sure that the states of *all* services are collected - even those without a unit file.

so those metrics are pulled by prometheus and now I have to take with those faulty metrics in the condition logic of the alert, because I'm trying to trigger that one on a service which is listed with state "active" = 0 or "failed" = 1.

now guess. right! If the unit file doesn't exist, the regarded systemd service is marked as "inactive", which is another possible state of the metrics in the node exporter. the problem is that the value 1 for state "inactive" means, that "active" has the value 0 (not even wrong) and the alert is triggered.

so systemd fucks up somehow, the node exporter collector fucks up because systemd fucked up and I have to unfuck this with some crazy horse shit logic. w.t.f. to that.

the only good news is, that it works like a charm on Ubuntu 18.04, as far, as I can tell.

while writing this little rant, I thought of a solution.
I could try to change the alert condition to state "active" = 0 AND "failed" = 1.. but that will wait till tomorrow.

one does not simply patch monitoring conditions at midnight..

  • 0
    This is the reason I started containerizing these workloads.
  • 0
    I blame Ubuntu for that tho.
  • 1
    @theKarlisK Not having an FS representation to an object on linux is reason enough to blame them.
  • 0
    little update on this one: turns out that I'm not the first one, that triggered that trap. the bugfix comes with release 0.17.0 of the node exporter. yes, we are running on an older version - just don't ask why.

    so the plan is to find some suitable (stage) nodes where we can make a test run of 0.17.0 just in case, there are some breaking changes.
Add Comment