how do you deal with the situations where you have no clue what happened in production ??

we have a Spark job and suddenly the job execution stopped in the middle without any error log. after sometime it started working again and all the master and workers were fine at that time.
now client wants RCA for that. 😟

  • 0
    What is RCA?
  • 0
    @Afrographics root cause analysis
  • 1
    Gather all server and network metrics, if applicable - also SAN metrics for that period with 1hour before and after. And I mean all - everything you can get. From ctx switches rate to swapping, io, cpu% by category and pid rotations and more.

    Get all the app metrics you can get for that time as well.

    Get syslog and authlogs from those servers, dump dmesg.

    Now analyze all that and find a correlation. You'll most likely see some pattern. Try to correlate with authlogs too

    there's so much you can do, but without the context I cannot say anything more. Prolly you don't need all of this, just some parts. But again, without the context......
Add Comment