unskilleddev

5y

how do you deal with the situations where you have no clue what happened in production ??

we have a Spark job and suddenly the job execution stopped in the middle without any error log. after sometime it started working again and all the master and workers were fine at that time.
now client wants RCA for that. 😟

rant

scala

spark

yarn

Ranter

Comments

0

Afrographics

330

5y

What is RCA?
0

unskilleddev

128

5y

@Afrographics root cause analysis
1

netikras

34576

5y

Gather all server and network metrics, if applicable - also SAN metrics for that period with 1hour before and after. And I mean all - everything you can get. From ctx switches rate to swapping, io, cpu% by category and pid rotations and more.

Get all the app metrics you can get for that time as well.

Get syslog and authlogs from those servers, dump dmesg.

Now analyze all that and find a correlation. You'll most likely see some pattern. Try to correlate with authlogs too

there's so much you can do, but without the context I cannot say anything more. Prolly you don't need all of this, just some parts. But again, without the context......

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service