Draft: ongoing study of gunicorn worker types
In theory, async
workers are more performant for I/O-bound activities. For instance, the workers should be able to handle new requests while existing workers are making database queries, sending lvalerts, or ingesting new file uploads.
In a previous study I did in 2019 (no source or documentation given, come at me), the async worker types (gevent
, eventlet
) weren't compatible with the multiprocessing
module which is used to send lvalerts. Why is that module used to send lvalerts? GraceDB spawns a twisted reactor to take a message and then connect over TCP to port 8000, then push the message to the lvalert-overseer. As I discovered (and presumedly branson knew back in 2016), you can't just re-start a twisted reactor for some reason, but you also don't want to maintain an active connection because it would block other threads/workers from connecting. That's the entire point of each worker using a TCP connection reactor: it's essentially a queue that allows the workers to connect, send to the overseer, then disconnect. Currently, an entirely new thread is forked with multiprocessing
, then the thread is killed when it gets rdict=success
signal from the overseer.
For ha-ha's I tried gevent
again, and miraculously it worked, but unreliably. It served requests just fine, but when I tried sending lvalerts it failed with failed with [Errno 11] Resource temporarily unavailable
, and didn't restore again until i rebooted the VM. that's a problem.
Other notes about this:
- The latest version of gunicorn (20.1.0) straight-up fails to start, without logging, when deployed in the cloud (I saw it on gracedb-dev). I don't know why or how to get around it.
- I made a modification to the lvalert/igwn-alert overseer that allows either a single string input for
node_name
(which is the default), or a list of topic strings to send messages to. This was an attempt to reduce the number of forks that are spawned at alert time. It works fine either way, but gracedb is pinned to a previous version just for safety's sake. My thinking was, if the worker could just spawn a single reactor, then die, then that would save on time and resources. I haven't tested this yet.