Traefik returns 502 when Gunicorn restarts
Here's the scenario: GraceDB will instantly (no 30 second timeout) return a 502 proxy error to the client, then the client code retries and everything works.
Further investigation will show that there's no errors in the gracedb (django/gunicorn) logs. but there will be one in the webgateway/traefik logs, ex:
# grep -n '" 502 ' *.log
gracedb-swarm-production-us-west-2c-docker-mgr-01.log:49112:Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_webgateway_webgateway.3.91ly1u1voskgcnyo78u2t0xq1: 131.215.113.150 - - [01/Aug/2023:13:45:34 +0000] "POST /api/events/ HTTP/1.1" 502 11 "-" "-" 613542 "gracedb@docker" "http://10.0.1.56:80" 1ms
At the same time there's a block in gracedb's logs like:
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Autorestarting worker after current request.
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Worker exiting (pid: 3589)
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Booting worker with pid: 11186
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Worker spawned (pid: 11186)
Automatic restarting is controlled here and here, and it used to avoid possible memory leaks. As far as I can tell, this restart/502 hasn't actually affected low latency operations, as gwcelery has retried and succeeded each time. pycbclive
did ping about a 502 twice (2023-07-24 15:21:20 UTC on playground and 2023-07-29 10:48:19 on prod), but as far as I can tell the request was subsequently retried by the client code and succeeded.
Possible solutions could be...? to do nothing, since clients are retrying and succeeding. We could also try increasing the maximum number of requests, and the jitter to see we can space them out and make it less frequent.