Skip to content

Celery bug where worker stops consuming tasks sometimes when redis reconnects

I think we're being affected by this celery bug: https://github.com/celery/celery/discussions/7276

Evidence: We stopped producing superevents on 2022-11-11 and didn't resume until the pipeline was re-deployed. The last superevent on the 11th was created at 08:56:47 (S221111da). However, g-events created after that time were still being annotated like normal (see e.g. G766495). Also, the gwcelery-worker logs showed superevent handler task calls still being added to the queue

[2022-11-11 01:23:09,855: INFO/MainProcess/MainThread] Task gwcelery.tasks.superevents.handle[bb773101-f730-4ead-8607-579ac6d8ca11] received

Redis reconnected to the superevent worker at 08:36:31, which is 20 minutes before the superevents stopped appearing so it's possible this is a different bug. However, I think it's likely the bug I linked to since the symptoms are identical other than this 20 minute gap.

In the short term, I think we need to prioritize #443 (closed)

Edited by Cody Messick
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information