What is the plan to anticipate and mitigate future problems like those seen with S230518h (DB being overwhelmed by internal LVK or external public page loads)?
First, thanks @alexander.pace for the quick id and fix of this problem. (If there is a postmortem, or relevant tickets, or even LIGO Chat URL of the debugging to link to, please add them to this ticket here for context.)
What is the plan to anticipate and mitigate future problems like those seen with S230518h (GraceDB being overwhelmed by internal LVK or external public page loads)?
This plan should probably include:
- Load specification & testing
- Define a spec / requirements for how many simultaneous public & private requests (and of what sort) a production GracedB instance should be able to respond to with a certain latency.
- As part of the CI process, perform synthetic load testing w/simulated public & private users up to the spec and ensure it passes.
- Outside of CI, perform synthetic load testing w/simulated public & private users beyond the spec to understand where it fails and why.
- Do some cost/benefit analysis of preemptive fixes to the bottlenecks identified in the beyond-spec load testing, so we can decide whether to fix them now or wait until it's known to be necessary.
- Document an emergency procedure for an overwhelmed DB that someone other than Alex can execute.
- How to identify when the DB is overwhelmed (vs. other problems – AWS down, network down, auth down, etc.)
- How to temporarily turn off public access (and turn it back on again, and how/when to decide)
- Can/should we give privileged access to certain MM partners in such an emergency, so they can follow up?
- Anything else that can temporarily speed things up in a pinch.
I would expect this to be part of Phase II of the O4 LLAI review. (If there is a milestone or tag you'd like to use for such LLAI review tasks, please add it to this ticket so we can find them all easily in in the future.)