devRant - A fun community for developers to connect over code, tech & life as a programmer

Search - "node-exporter"

8

IntrusionCM

15177

3y

One of these days....

Where you want to do a tiny task

....

And suddenly an explosion nukes every service, related service and dependant service.

Chain reaction. Yaaaayyy........

(ancient prometheus node lead to an snapshot error, snapshot error made the migration tool unhappy, migration tool unhappy meant that my task failed - updating prometheus meant checking every target, exporter and so on...
Fuckity fuck it''s gangbang time.)

rant look-it's dead

1
3

rootofskynet

410

4y

Wanted to add alerting for systemd services in Prometheus today, which spontaneously turned out to be a huge pain in the lower human backend.

For some reason, on Ubuntu 16.04 systemd adds services without unit files for software, that isn't even installed on the damn server (in this case for mysql-server / mysql-common and mysql-client are installed) and lists them as "not-found" and "inactive". The prometheus node exporter that we use, has a little bug in the systemd collector that makes sure that the states of *all* services are collected - even those without a unit file.

so those metrics are pulled by prometheus and now I have to take with those faulty metrics in the condition logic of the alert, because I'm trying to trigger that one on a service which is listed with state "active" = 0 or "failed" = 1.

now guess. right! If the unit file doesn't exist, the regarded systemd service is marked as "inactive", which is another possible state of the metrics in the node exporter. the problem is that the value 1 for state "inactive" means, that "active" has the value 0 (not even wrong) and the alert is triggered.

so systemd fucks up somehow, the node exporter collector fucks up because systemd fucked up and I have to unfuck this with some crazy horse shit logic. w.t.f. to that.

the only good news is, that it works like a charm on Ubuntu 18.04, as far, as I can tell.

while writing this little rant, I thought of a solution.
I could try to change the alert condition to state "active" = 0 AND "failed" = 1.. but that will wait till tomorrow.

one does not simply patch monitoring conditions at midnight..

rant

3
3

rootofskynet

410

3y

Storytime.

Our prometheus node, one of your oldest systems (somehow fits the Titan reference..), is about to be relieved of its duties after several years of loyal services to the crew.

We decided to run with another Prometheus node in the ring, that will run simultaneously with the old one, so that the new one can start to collect metrics that we need for alerting (some historic metrics are needed too..). sort of an Prometheus cluster, without the cluster fun and with 2 different Prometheus versions.

The problems with this? Well it's not the new node or the latest shit versions of Prometheus per se.

1: The node exporter.
those dudes decided to make some breaking changes in a minor update, so that you will need to run with some magic bullshittery, that the latest Prometheus can make something out of the old metrics provided by the old node exporters.

The other one is the related puppet code.
The node definitions for Prometheus were built via exported resources on the target nodes.
The code worked like a charm with only one Prometheus node, but try that with two instances in the same way.

Still WIP, but some targets are already included in the new Prometheus instance.
alerting works so far.

Can't wait to close this ticket for good..

rant storytime

Top Tags

rant linux code windows fuck i java c programming android dev the is javascript js joke life a python

Weekly Rant

Most unrealistic deadline you've had?

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service