23

# PROD

* 10 app instances running
* 1 instance starts burning up 100% cpu
* we ask for a Thread Dump (stack traces)
* we get a TD taken after they manually restarted the instance
* they: "Please investigate. We need this fixed ASAP"
* .....

EVERY FUCKING TIME!!! Not once in recent years have they managed to take a TD correctly. What kind of a retarded monkey do you have to be for this to not sink in for YEARS!

Who tf put those idiot monkeys there in the first place...

Comments
  • 7
    LOL.

    The problem with 100% cpu use - is that it makes the machine unresponsive. So doing a thread dump is next to impossible....

    I had a similar experience with my work VM this week. I "done something stupid" - executed a complex mock cleanup, and the cpu use jumped to 100%. Ssh session stopped responding, can't stop the processing, nothing.
    Had to wait until the operation was done, to investigate. No usable info - no longer 100% cpu use....

    In your case I would implement something that takes an automatic thread dump when cpu use is over 95%, and automatic upload to monitoring Elastic when machine is restarted. That way - PROD is not involved, and you get the required info.
  • 3
    @magicMirror 100% cpu usage should not make the server unresponsive. OS Task Scheduler is responsible to make sure that's the case (unless you are not working with a time-sharing system).

    I can think of 2 exceptions: fork bombs and processes in D state (heavy, long IO or mem% 100%)

    Anyway, that's not the case for us. CPU 100% is not a blocker to make a TD.
    They even asked us to make tools to make TD easier. We did. And still they fail to understand that they need to visit a doctor while they are sick, not after they got better!

    Even if I automated the TD process, I am almost certain they will find a way to fail at that too. IDK how, don't ask me. Ask them.
  • 3
    CPU limits could help mitigate the problem of CPU spikes, as should process niceness (emergency response can be done at -20, i.e. real-time priority). You can do CPU limits through anything that supports cgroups (e.g. LXC), or directly with it, and you can change niceness by logging in on another SSH session and running "renice -n-20 $$".

    That being said, processes going haywire and starving the whole box of all its resources sure are a pain in the ass to deal with...
  • 4
    @Condor I agree. However, setting limits won't do me any good in containers - I need the app to be performant, not just spinning in some endless while(true) loop :) It's not the container I want to save. It's the app. And I need its diagnostical info (i.e. TD) while the problem is observed. NOT after.
  • 4
    Clear case of a race condition:
    You have to kick them in the arse right after you tell them to do the dump to get the task sheduling the machine restart preempted by the pain response task. After which the next task to get a time slice likely is the task processing your actual request. So if one time slice is enough, you will get the core dump sceduled just before the reboot.

    If they are slow, you have to time multiple kicks right at the continuations of the reboot sheduling task, to keep it interrupted...
    Might need to experiment with boot weight, acceleration, and interval duration though.
  • 0
    @netikras a browser can make a computer unresponsive and not reach 100% CPU use.

    100% CPU hse would mean any new work will be scheduled and worked on whenever there is COU time to do those two things, and if there is a process giving enough work to the CPU constantly to be at 100% that means nothing will be worked on and every active process will get less time too, effectively freezing everything.

    And since the work load will not go away because you have a process generating work constantly, yor system will never unfreeze.

    So basically, I don't believe you.
  • 0
    @mundo03 you are contradicting yourself.

    100% cpu usage DOES NOT make a machine unresponsive, if you are working with a time sharing system [which you are]. A task scheduler accepts new tasks, runs each of them for a few nanoseconds and switches to another task.The only 3 ways I can think of that could make a system unresponsive are:
    - new tasks are being spawned so fast that they overwhelm the scheduler, and the scheduler maintenance itself consumes lots of cpu%
    - processes are freezing the cpu, which is often the case with hardware-related operations [i/o, radio,...], where synchronous syscalls take far longer than they should.
    - processes are saturating user session resources [i.E. Limits]

    Newly spawned tasks are added to the tasks list and the scheduler works on all of them little by little. If you have thousands of tasks - all of them will be ran concurrently and in parallel [multicpu] nonetheless.

    Whay you prolly were referring to is cpu queue, but it's not the reason of system being unresponsive - it's a metric that might suggest the system is sluggish.

    How do you freeze a pc with a browser? Let me guess - by performing excessive graphical operations [i.e. hardware-related tasks]? Because a simple while(1) won't cut it
  • 0
    @netikras
    That does make sense indeed.
    Thank you sir, I believe you now.
Add Comment