31
irene
363d

Testing demands a “bug” fixed. It isn’t a bug. It is a limit where as the amount of records updated in a single request overloads the RAM on the pod overloads and the request fails. I say, “That isn’t a bug, it fits within the engineering spec, is known and accepted by the PO, and the service sending requests never has a case for that scale. We can make an improvement ticket and let the PO prioritize the work.

Testing says, “IF IT BREAKS IT BUG. END STORY”

Your hubcaps stay on your car at 100km/h? Have you tried them at 500km/h? Did something else fail before you got to 500km/h? Operating specs are not bugs.

Comments
  • 17
    What, you don’t test the car at super sonic speeds? How would you know it handles crossing the sound barrier then :P
  • 2
    They are right though... If user can kill an app, the app is easily DoS'able w/o even a distributed attack
  • 4
    @netikras as stated it is a known and accepted limitation, which is good development practice, if you do not need facebook scale, do not waste time and resources on building it just in case.
  • 5
    @netikras It isn’t user triggered. It is a route to support another microservice. The only time that the route is triggered manually is when QA is doing their thing. It is gated using an auth system and there is no public access.

    The other server would have to have a single grouping of a lot of db records. The largest real world grouping is 800 with the company’s largest customer. That represents 16 years of records for them. The breaking point number is 15000.

    So yes the other server can crash one pod with a bad request.
  • 0
    Also we are planning to move to gRPC for system to system things in the next year. Complete replacement in the next two years is certain.
  • 2
    @irene it's good practice to treat all applications as public facing (zero trust, no more walled gardens). It's unlikely someone will hack in to DoS you, it's just more likely a sales guy will sell it to a third party and demand it be hosted publicly ASAP.

    Fire a rate limiting check in there and return the appropriate error.
  • 5
    I would argue on a different basis.

    It is not an bug, which would imply an error.

    It is mostly an architectural problem, leading to excessive resource usage, ending in - worst case - consecutive restarts and thus disruption of service.

    Currently, no consumer has enough records to trigger the necessary threshold of 15_000 records.

    A bug usually means that the software produces incorrect results, just terminates etc etc.

    An architectural problem could be seen as a special form of bug, though there is one distinction that should and must be made: as long as a certain threshold isn't reached / a certain input isn't made, the software works as expected.

    This distinction is important.

    Cause otherwise every software would be a bug. Every software has limits.
    You cannot scale limitless. Science exists. Physical limits exist. Software can only be designed to best knowledge within a specific resource set.

    N + 1 database problem is another nice example of this. It is an architecture problem, it will explode when a certain threshold is reached - but until that threshold is reached, it will work as intended.

    Why this long explanation?

    Because QA needs to be careful of their classification.

    A bug demands fixing, no questions asked / discussion necessary.

    An architecture problem though should be given to the backlog, then handed over to DevOps / IT to include explicit alerting when a value "under" the threshold is reached, so one can fix the problem in time or be notified if it occurs unexpectedly.

    If time is found, it will be fixed. It just isn't high priority, given that QA should have found out the necessary threshold - and unless the threshold isn't reachable in a "short time"… the architecture problem can be analysed and fixed thoroughly and with necessary diligence.

    Nothings more shitty than treating architecture problems as high priority, fixing them half assed and creating a bigger mess.
  • 1
    @Voxera @irene

    as a SRE, DevOps and Performance engineer I refuse to sign off on such cases as "accepted limitation" and "good development practice". It's cases like this that keep me up at nights with annoying calls: "netikras, I know it's 3:15 am, but please join the call, our service has crashed".

    If there's a known way to crash a service doing its BAU, it's not a good practice nor something to be accepted - it's a bug or a tech debt.

    Issues like this can usually be mitigated by batching and limits

    In a few previous projects we had some OLAP components biting too much and going OOM. We agreed with the PO, PM, Architect, Ops and whatnot that there is a design issue, we'd explained how devs can fix it, and somehow it also became an "accepted limitation". Everyone was ignoring us. At least until it crashed. And every time it did, we kept on getting an escalation. And every time we did, we referred client to the same analysis/recommendations. Several times a week...
  • 0
    @netikras that sounds like an organizational problem.

    An acceptable limitation is when you can be sure it will not be a problem and when proven wrong you allocate resources to fix it.

    Almost every system can be ddosed, is that a bug? And the OP never specified if this was external but since they mention “the service” I assume they have a known consumer in which case I would accept that as long as there is enough margin we would not spend excessive time on preparing for situations that should never ever happen.

    If its external, sure then there are other considerations and then I might have a different opinion, external users are more if a risk.

    Another thing I also weight in is the consequence of failure, will it recover by it self or will it die and require manual intervention.
  • 0
    @Voxera If I read the OP's message right, "the service" is what causes the outage. But I failed to notice whether the pod only serves "the service", or other clients too. Because "the service" can cause a 503 for everybody if they kill that pod.

    Let's agree to disagree. I'm not a perfectionist, but I always favour a stable system. It can be slower, it can have fewer features, but damn it must have an impressive uptime! (read: reachable by clients any time they need it). And implementing safeguards for stability really isn't that much of a hassle. FFS, a stupid apache or nginx proxy can limit payload size, if that's what's causing the OOM!

    Too large exploded payloaded during procession? recurse and distribute horizontally that bugger (+perf gain!) - 1 day of work.
  • 0
    Operational limitations should be technically enforced limits with graceful errors as well. Your car doesn't have to work at 500km/h because it's not possible to accelerate it to 500km/h. If it were possible and it didn't refuse to accelerate further than the operational limit or at least require explicit driver action to exceed it, no amount of operational constraints would save you from the responsibility.
  • 1
    @netikras I interpreted it as that the request failed if the payload was to big, and that the reason it was not deemed a bug was that the calling service would never have such big payloads.

    And if that is the extent if the problem, then in my opinion its an acceptable limitation.

    I would prefer a good error message but I was unsure if that actually was in place as the limitation or if the tester asked for the endpoint to always work for any size, like using internal buffering or so.

    Either way, if the only consequence of a to big payload is that the call fails and the limit is known for the caller, yes then I think its a reasonable acceptable limit.

    But I agree that both opinions can be argued for so lets save that for a time if we ever end up working on the same project :P
  • 0
    @Voxera if the failure is only isolated to the offending request [i.E. other clients/requests not affected during the failure, i.E. the app doesn't crash, only barfs out some error] - then I completely agree with you.

    I was under impression that we're talking about OOMK-induced- SIGTERM
  • 1
    @netikras ah, yes in that case I would at least require a set limit that throws an exception.

    Still, it its internal I would accept that to not be a blocker for release if there is any time criticality involved.

    If its external, no then I would agree that its a blocking bug.
  • 1
    @netikras me too.

    Though Kubernetes should induce restart (if policy configured), it's never the less a downtime.

    Worse when it goes into a "OOMK, restart, OOMK" loop.
  • 1
    Program endpoint to return an appropriate status code (413 Request Entity Too Large) and, if appropriate update UI to react to status code 413 with a useful message.

    That way you guard users from trying to use the service out of spec and stop your test team from crawling up your arse.
Add Comment