12

FUUCCKKKK!! I need to hit smth. Or rant..

So that flaky ec2 issue.. These ec2s act as a shared environment for multiple apps. Our app is one of them. I have no access to those ec2s at all.

What I have access to is my app and some monitoring. Now the app randomly starts lagging while nearly idling. At the same random times monitoring stops completely and doesn't come back up. This happens to random app instances at random times.

Reached out to infra support, managed to get attention from the big boys [mgmt]. Today we got the fix deployed. I test it out -- problem persists.

I find this behaviour somewhat familiar. Managed to get some server stats from infra folks. Apparently cpu% is high as well as load avg [cpu queue]. Bingo! I know how to fix it!

So I write a long comment w/ all the commands and all the 'if that, do this'. Send it to one of the infra technitians

and I get a reply: 'we will apply cpu usage limitations to fix the issue'

wait... Cpu% limitations will do nothing but highlight the underlying problem...

'no, instances have high cpu utilisation which is causing those lags. We will limit cpu resources and it will be fixed'

oh ffs... Cpu utilization and cpu queue are VERY different things.. I tried explaining that to them like 7-9 times. And all I get is:

'yes, cpu utilization is the problem. We will limit it and solve the problem'

I would surely escalate all of this through higher channels if only I could get my hands on those ec2s and have a proof. But that is not happening and I'm forced to sit back and watch them break things even worse until they are out of options and mark my query as 'wont fix'....

Fuck that's frustrating....

*thinking to myself* so I've read about that new vulnerability 2 days ago that allows one to escape from docker container to the host... What if <...>

Comments
  • 3
    That's the dumbest "solution" for a problem i've seen in a long time. Get a water pistol and piss in the tank and assault those lazy motherfuckers from infra with your piss.
  • 1
    @heyheni I wish I could... they are 7 timezones away from me :/

    but I could piss into that pistol and send it to them via snail mail. Maybe w/ a note 'sir, aim it at yourself and pull the trigger'. That would work I think
  • 1
    @irene
    1. while :; do ps aux | awk '{if ($8 ~ /D/){print $0}}' ; done
    2. strace -T -cvf -s4096 <pid>
    3. Find lagging syscalls
    4. lsof | grep <pid>
    5. Find those resources
    6. Get them provisioned for higher iops
  • 1
    Lol why can't you access your EC2's?! - oh because they are shared with other app teams, I guess... By the time I got to the bottom I forget you said that. 😁
Add Comment