Study: increase overall request throughput.
Problem
The last 16 months of improving the REST API performance via the MDC on gracedb-playground
has been fruitful, but gracedb-playground
's use case differs from the production server in one critical way. gracedb-playground
is constantly getting hammered by API requests at a rate that isn't seen in production. Also, it does not see nearly as many visitors on the web as the production server does. And production sees a mix of public and LVK users, where the public users to gracedb-playground
are nearly non-existent.
The last issue rapidly became apparent when the database was overloaded, but it's been addressed and fixed. But the issue of people visiting the site and complaining about slow load times remains.
I'm going to approach this problem by throwing more gunicorn workers at the problem in an attempt to increase overall throughput. Why not just throw a bigger computer at the problem? Below is the last 24 hours of CPU usage on production GraceDB, including a spike when S230601bf dropped:
Even in periods of peak activity we're not getting anywhere near to saturating the cpu, so that makes me think any perceived slowness is due to a limited number of workers that are IO-bound and not serving new requests.
Method
I'm going to focus on the gunicorn sync worker for this. Despite the claim that "each connection is closed after response has been sent", I've shown (to myself..?) previously that while the connection to the worker is closed, the outside connection is maintained from user--> apache-> gunicorn server. The byproduct of this is that there's not a TLS renegotiation after each request, which is what really murders latency. Also I've noticed that the gthread worker can provide some more overall throughput, at lower memory use, but at the expense of memory leaks and "stuck" threads (issues)
The systematic testing will take place on gracedb-dev1 (a dual-core, 4GB RAM VM), and then translated to playground (which has the same specs at production), and then moved to production. Note that playground/production are running are c4.xlarge nodes (4 core, 8GB RAM), so I hope I can just... double the results from dev1 and it will scale. We'll see.
Testing will take place with the <code data-sourcepos="19:36-19:40">siege</code> command. Sample output is below:
$ siege https://gracedb-dev1.ligo.org/superevents/S230525c/view/ -c 1 -t 30S
...
...
Lifting the server siege...
Transactions: 222 hits
Availability: 100.00 %
Elapsed time: 29.72 secs
Data transferred: 6.72 MB
Response time: 0.13 secs
Transaction rate: 7.47 trans/sec
Throughput: 0.23 MB/sec
Concurrency: 0.99
Successful transactions: 222
Failed transactions: 0
Longest transaction: 3.87
Shortest transaction: 0.06
I think I should focus the Transaction rate (trans/sec), and Concurrency (specified as 1 in this example, realized as 0.99) 'Concurrency Efficiency' (Realized Concurrency / Specified Concurrency). What I think will happen if you were run a siege over and over and increase the specified concurrency each time, then the transaction rate would eventually plateau, and then the efficiency would continue to decrease as a limited number of workers.
I'm also going to look at three scenarios to see if I can find an optimal setup for each (or all):
- Siege of easy requests (loading one public superevent page)
- Siege of DB-heavy requests (loading all public superevents, from the API)
- Siege of easy requests WHILE under load (loading a superevent page, while i'm bombarding the server with GETs to both use connections and use the CPU).
Other notes:
- Gunicorn recommends 2*N_{cpu} + 1 workers, which is what we've been using. I think baselining the number of workers vs the number of CPUs is smart. So try [0, 1, 2, 4, 8,...]*N_{cpu}+1 and see how it scales.
- It's also important to establish a baseline with the current setup to see if there's a performance increase or what.
I'll update this ticket with results as i get them