Recently we noticed a part of our web application wasn't working. After some hours of looking into it (it's an old, convoluted application), it became clear another part of the application timed out trying to get a connection from the db connection pool.

We call db admins, they respond "oh yeah looks like the DB CPUs are at 100% load. I'll do something about it." and a short while later everything was working. So now I think, our hours of looking into it and a lot of people not being able to work could have been avoided if the DB admins had some form of alerting. But also we could improve our monitoring too, had we tracked calls made to our DB.

Question: Do you think I should call the DB guys, telling them they need alerting, or should I add tracing/monitoring around our DB calls, or both? Do you think I should consider any additional actions I haven't thought of?

  • 2
    The tone is muey importante when you call the admins.

    You must implement alerting
    - wrong

    You need alerting
    - better

    We realized we need to monitor if a db connection times out to prevent this... Could you implement alerting and notify our team if the server misbehaves?
    - best imho

    Since you shift the responsibility to your team and there is no blame game.

    It's just a few words more, but the difference in meaning is extreme.
  • 0
    @IntrusionCM I read some things about non-confrontational speech, I'll try to put it to good use when I contact them 🙂

    I hope there will be slow changed at least. It's just that I feel, generally, people here aren't used to make sure incidents don't happen again - they are in long calls involving many people when something happens, and that is seen as "doing the dirty work". Improving stuff is often seen as "useless stuff that doesn't pay off" somehow. At least that's the vibe I got from taking to some.

    But now I tracked how many hours were spend by how many people and how many people were affected, so I have some hard facts to back me up.
  • 0
    This reminds me I should make a redirect to an error page for if the DB is unreachable instead of letting it fail to '"error in '/'".

    Wonder if I'll remember that in two weeks when I get back :-p
  • 1
Add Comment