29

> Monitoring: Load Average of 57!! ALERT!!!!
> me: What? That's not possible?
> *Monitoring froze 14 hours ago*
> *sshs into server*
> *see attached image*

The issue was ~1200 df processes that were issued by our monitoring system and all of them didn't finish because the external cluster we mounted onto that server died a few minutes before that. Just re-mounting the cluster fixed it but still a funny sight!

Comments
  • 2
    df in D state... Damn that's dangerous..
  • 8
    That uptime tho
  • 4
    @shaji reboot? Whats a reboot?
  • 0
    Can someone please explain or link to a description of how to read load data on Linux? I mean, ok, it's load 57 or 1197, but what's the scale here and how is it calculated?
  • 2
    Load average, or cpu queue. Shows how much work does the whole cpu pool have. Load avg 1 means 1cpu is completely busy with tasks. If you have only 1 processing domain [a core, a thread], this means your machine is completely loaded woth tasks and new tasks will line up to a queue. Now if you have like 4 domains on the server load avg of 1 means the machine is nearly idling. Load avg 4 would mean all 4 domains are busy with tasks.

    Processes in D state are locking domains with i/o, by default for 120 seconds if I recall it right. While domain is hogged other processes enqueue their tasks for it. The queue gets longer and longer. Logging in most likely will take longer as the queue grows [unless you're lucky and sshd has high priority in scheduler]
  • 1
    @netikras gladly the ssh process had a hay priority and i almost instantly logged into the server but yeah was still scary stuff
  • 1
    D processes are evil, because they are usually unkillable. Supoxe you run a reboot command on such server and disconnect waiting for the machine to become pingable again. If you do not watch ping output carefully you are doomed for an unexpected reboot. Server will not be rebooted until that process leaves D state which could happen... Any time at all. From my xp this mostly happens right in a middle of a business day :)
  • 0
    I wish I'd see same uptime for Windows Server, not sure how long they can stay up before they start going crazy, I never dare doing that to my Windows machine lol
  • 1
    @netikras i was able killall without issues as root which i am REALLY glad about
  • 2
    @gitpush I think one of the longest uptimes ive sewn on our servers were 1200 or so days and still counting
  • 0
    How many cores?
  • 1
    @netikras Simple to understand and straight to the important points. Thank you!
  • 1
    @linuxxx i think that one had 2 cores but i can check again tomorow if you really want to know
  • 0
    @ThatPerlDeb Oh okay not needed but if you'd have 1000 cores this wouldn't be THAT high :I

    Holy shit though O_o
  • 0
    @ThatPerlDeb tbh that should be the case for a server, I mean all should be planned correctly and implemented correctly. Servers are only meant to go down for upgrade, but makes me wonder throughout those 900 days not a single update required a reboot?
  • 2
    @gitpush no updates no restarts.. who needs security updates anyway?!
    Theres a lot of other things done wrong securitywise but oh well
  • 0
    @ThatPerlDeb seriously? And that's ok for management? :S
  • 3
    @linuxxx Just checked, server has 4 cores, 4 GB of RAM with 2 GB

    @gitpush We don't really have a "management", we just have 3 guys doing financal stuff and 5 other guys (1 boss, 4 employees) do code stuff.

    Don't get me started on how we're not using public keys to authenticate but rather use the same root password for every damn server or how people have password files in their home directory practically open to everyone in the intranet (only there, not to the internetz) but believe "it's okay" and all the other stuff that would make anyone with a sane mind go "WTF?"
  • 0
    @ThatPerlDeb and how are you surviving that :O
  • 2
    @gitpush by accepting the fact that I'll only have to endure one more year until my apprenticeship is finally over and I can leave this shithole for good.

    I could attempt to switch work places via my apprenticeship parliament (idk what the correct English term is) but I don't think the reason "incorrect security messures" is a valid reason to change companies. Also I don't think it's worth the hassle because if shit hits the fan I won't be to blame anyway unless I fucked up which I'm carefully trying not to.
  • 0
    @ThatPerlDeb aah in that case then ya stay there, it isn't worth the trouble moving on to a different place
  • 4
    @gitpush I just found this gem! :D
  • 1
    @ThatPerlDeb 😨😨😨😨😨
    Trying to break the world record of 20years? Not sure if 20 but I recall it was a really old server that had to be put to sleep
  • 0
    Ha, I’ve had servers with 25.000+ load. Usually some bad programmer’s fault.
Add Comment