Main worker gets stuck in infinite loop when PE fails to find data
GWCelery is getting stuck in an infinite loop when it tries to perform PE for events who's online data can't be found. The loop consists of trying and failing to query data for PE jobs, and generating the SNRSummary.png
and nEvtSummary.png
files. For example, G783810 has 21572 copies of SNRSummary.png
ande 21569 copies of nEvtSummary.png
.
This means that we can't recover when GWCelery goes down long enough for the low latency data to not be available, which makes fixing this extremely high priority.
The behavior that I saw in our monitoring tools is that gwcelery.tasks.inference.query_data
keeps failing with errors like this:
Traceback (most recent call last):
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/autoretry.py", line 34, in run
return task._orig_run(*args, **kwargs)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwcelery/tasks/inference.py", line 80, in query_data
raise NotEnoughData
gwcelery.tasks.inference.NotEnoughData
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 207, in _inner
reraise(*exc_info)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 57, in reraise
raise value
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 202, in _inner
return f(*args, **kwargs)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__
return self.run(*args, **kwargs)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/autoretry.py", line 54, in run
ret = task.retry(exc=exc, **retry_kwargs)
File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/task.py", line 738, in retry
raise ret
celery.exceptions.Retry: Retry in 447s: NotEnoughData()
And we keep uploading the files I pointed out above.
Edited by Cody Messick