AboutPeople ask me sometimes, why I need so long to finish work for the day. Frankly folks, our guild does shit, you can't even imagine, so pls stop asking.
SkillsPuppet, Python, PyQt5, Redis, Prometheus, Ubuntu (14|16|18|20.04), CentOS 7, some Bash magic too. 🥴
LocationMars, Edge of Valles Marineres
Joined devRant on 3/4/2019
Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
The Prometheus tales
Part IV - A new FUBAR.
A new and very fascinating problem emerged a few days, after feeding some node definitions to the new titan instance.
It's a storage fuck-up. A major one.
If I'm informed correctly, the latest prometheus should have the same (or even better) log compression algorithms for metrics, as the old one - because these fuckers are so damn good at what they are doing: compress some fucking logs.
The new instance is agregating metrics as planned. Grafana work's like a fucking charm.
Nethertheless, because of very fascinating but unknown reasons, the new instance creates 50GB of metrics in under 4 fucking hours.
Am I missing something here? Some magic parameter that has to be passed to the titan, that enables the hardcore compress-them-fuckers-feature?
Debugging session is tomorrow.
To be continued.
!dev and on behalf of some non-it related members of my family.
how hard is it to create some ms teams accounts for students? (cloud, there is no on-prem, i presume)
the school in question has roughly 300 students (well.. in germany..).
with a proper grade of automation, this can be solved, or am I wrong about this one here?
the student in question, my cousins wee one, received login credentials, that just don't fkn work.
the first remote class session is planned for tomorrow morning@0900.
my guess would be, that the admin(-team; i hope..) will have some fun tomorrow morning, because he isn't the only one, where those fkn credentials do not work.3
debugging escalated hard. started with neos, went over Apache and nginx. no more problems there after a clean db import. spent the whole day on this and endet up with the "result", that varnish, this fkn (most of the times helpful) bastard, is the problem. didn't get any results after that. meh.
Storytime - The Prometheus tales - Part III (I think..).
Updated the node definitions on the old node today, just to keep it up to date. nothing fancy.
I went to the new node and and checked the setup again. I already had roughly 120 node definitions onboard for testing purposes.
so all firewalls should have been configured the right way, so that the wee one might celebrate the marriage with the rest of the gang finally.. and then went with "puppet YOLO" on the new node. added every fkn node definition to the new setup.
every node turned out just to be fine.
except for 137 little InstanceDown alerts (out of 600+).
it's a good thing, that the little fella can send mails to me, myself and I only for the time being.
so debugging. again. but at least it's not a problem related to prometheus itself, because the connections end with a timeout on the related nodes. should be more like a firewall fubar.
we will see.5
Our prometheus node, one of your oldest systems (somehow fits the Titan reference..), is about to be relieved of its duties after several years of loyal services to the crew.
We decided to run with another Prometheus node in the ring, that will run simultaneously with the old one, so that the new one can start to collect metrics that we need for alerting (some historic metrics are needed too..). sort of an Prometheus cluster, without the cluster fun and with 2 different Prometheus versions.
The problems with this? Well it's not the new node or the latest shit versions of Prometheus per se.
1: The node exporter.
those dudes decided to make some breaking changes in a minor update, so that you will need to run with some magic bullshittery, that the latest Prometheus can make something out of the old metrics provided by the old node exporters.
The other one is the related puppet code.
The node definitions for Prometheus were built via exported resources on the target nodes.
The code worked like a charm with only one Prometheus node, but try that with two instances in the same way.
Still WIP, but some targets are already included in the new Prometheus instance.
alerting works so far.
Can't wait to close this ticket for good..
once upon a time, there was a dream: we need to test the vagrant setups for our Devs, so that they can run these against the production environment of puppet without problems.
in the year of 2016, the once lone ranger - our team lead - created the ticket. don't. even. ask.
the idea was to build these vagrant setups via bamboo, log the results and fix the setups afterwards.
after weeks of brain fuckery (aka daily business), home office madness, beer, java specs, more beer and many failed builds, I made it.
bamboo now builds the fuckers via a dedicated agent now and I closed the ticket today \o/.
Wanted to add alerting for systemd services in Prometheus today, which spontaneously turned out to be a huge pain in the lower human backend.
For some reason, on Ubuntu 16.04 systemd adds services without unit files for software, that isn't even installed on the damn server (in this case for mysql-server / mysql-common and mysql-client are installed) and lists them as "not-found" and "inactive". The prometheus node exporter that we use, has a little bug in the systemd collector that makes sure that the states of *all* services are collected - even those without a unit file.
so those metrics are pulled by prometheus and now I have to take with those faulty metrics in the condition logic of the alert, because I'm trying to trigger that one on a service which is listed with state "active" = 0 or "failed" = 1.
now guess. right! If the unit file doesn't exist, the regarded systemd service is marked as "inactive", which is another possible state of the metrics in the node exporter. the problem is that the value 1 for state "inactive" means, that "active" has the value 0 (not even wrong) and the alert is triggered.
so systemd fucks up somehow, the node exporter collector fucks up because systemd fucked up and I have to unfuck this with some crazy horse shit logic. w.t.f. to that.
the only good news is, that it works like a charm on Ubuntu 18.04, as far, as I can tell.
while writing this little rant, I thought of a solution.
I could try to change the alert condition to state "active" = 0 AND "failed" = 1.. but that will wait till tomorrow.
one does not simply patch monitoring conditions at midnight..4
you want to build a database dump with bamboo.
the job works, everything is green AF - but there are no build artifacts. you check the buildconfig 5 times and then you realize, there are blanks after the copy pattern of the frak'n build artifact.
dafuq is this..?