Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
Afrographics7225dWhat is RCA?
netikras2298324dGather all server and network metrics, if applicable - also SAN metrics for that period with 1hour before and after. And I mean all - everything you can get. From ctx switches rate to swapping, io, cpu% by category and pid rotations and more.
Get all the app metrics you can get for that time as well.
Get syslog and authlogs from those servers, dump dmesg.
Now analyze all that and find a correlation. You'll most likely see some pattern. Try to correlate with authlogs too
there's so much you can do, but without the context I cannot say anything more. Prolly you don't need all of this, just some parts. But again, without the context......