8

Worst week ever.

Servers are on fire. Respoinse times out of control

Some SIMPLE SQL queries (literaly select * from whatever where Id = id) timouts at 30 seconds.

No idea what's goining on (And I have full logs of all api calls and all DB queries). No way to find how to corelate this data.

Ok, I added 1000$/month on Azure and the problem is "masked", but not resolved.

I have dumps, I have logs I have everything, why the fuck I can't find the 1 or 2 APIs causing that ?!!!

Now I feel better.

Comments
  • 1
    What does Azure support say?

    Maybe a restart would work? Not enough IOPS? Is it a self-hosted DB service or Azure Managed? Simple id-based queries going slow indicate a bigger problem than an app performance issue
  • 2
    Instead of trying to do a wild goose chase...

    Slow down. Drink a cup of your fav beverage.

    Gather only the facts.

    What time ranges, what data is available for that time range, what do the usual tools (tuning advisor / query analyser / explain statements) say, what does monitoring ( IO / hard drive usage / server variable state ) say .

    Try to narrow down the time range as much as possible .

    When you've gathered the data, drink another beverage and don't fall into the tarpit of anticipation. Just look at the facts.

    Databases are - thanks to their complexity - especially good liers.

    The select statement can be e.g. a temp table issue as it gathers too many rows and the temp tables are written to HDD due to misconfigured configuration...

    Or it could be just missing quotes (string comparison, conversion of full table to another type for matching) and so on.

    If you anticipate you most likely not see the massive fat stinky finger the database is pointing at ya.
  • 0
    @asgs it's not on Azure.

    I see in logs a picks on 5000 (!!) sql queries by second, trying to colerate it with API calls, but there are like 1000 API calls in this range.

    So now I'm looking on queries and trying to find where in the code they are made.

    Azure support answered in 4 hours and told me to make dumps (Already did Duh).
  • 0
    @IntrusionCM Ok to be way more precise, the problems are visible when LogIO is toping at 100% in database. While CPU is around 5%.

    My guess we have some APIs which do WAY to much inserts/updatdes.

    I'm searching by time range, but so far, I identified 3 potential sources and still not 100% sur.

    I need another beer ;p
  • 0
    @IntrusionCM LogIO is thwe white line on this graph. Once it's at 100% for more than 5 seconds, every quries fail, api fail, user complain.

    Edit : this is over 7 last days

    Edit 2 : And that is a Database which cost around 1500$ / month
  • 1
    @NoToJavaScript hehe.

    Log IO i as in IO related to logging only?
  • 0
    @IntrusionCM So it’s way more complex when you use Database as a service. It’s important,. It’s not a sql VM, it’s really Azure sql as a service.
    If you look there :
    https://stackoverflow.com/questions...
    You’ll get that response:
    “The metrics represents the writes to the transaction log of the database. The transaction log tracks the changes to the data and hence this for the most part is a function of the amount of data you insert or update. As you said auditing has no impact on the log rate in SQL Database.”

    But what I got in testing : LogIO it’s the thingy which make restore in time possible (Right now, we are running on 7 days of retore in time (https://docs.microsoft.com/en-us/...) possible with a precision of a … millisecond)
    So if there too much writes/u[dates it “clogs” logio.
  • 0
    @IntrusionCM At least it is what I understood. And there are several other posts saying that logio scales badly with vCpus. Right Now I went nuclear option and put 6 VcPU premium lol.
    But I just need to find this offending API !!! I’m sure I can fix that in 2-3 hours once I found it !
  • 1
    @NoToJavaScript ah.

    In a nutshell (don't know MSSQL) it's the binary log for replication like WAL replica Postgres / MySQL binary log.

    And yes, as it tracks changes to allow just in point recovery every DML / data manipulation language statement must be logged, hence the correlation (SELECT excluded ;) )

    Most likely inserts going crazy as u said, possible - if red is for reading statements only, there might be some sort of lock going as lot of data is written and suddenly red goes spiky boo.
  • 0
    @IntrusionCM
    Reads only fails when this logio is maxed out. It is crazy lol.
    I can have a whole day where everything is just “blazing fast” and a day where the whole system is sooo slow.
    I know it’s a scaling problem which I ignored for the past 2 years and took a lot of technical debt.
    Now it’s the time: the debt is due.
    It’s ok, we can afford (for now) extra 1000$. But I still need to fix the problem because it can onl;y get worse from now. As we are closing more and more clients.
Add Comment