detchar.check_vectors killing workers by exiting with SIGABRT
I have seen a number of detchar.check_vectors
tasks fail in the past few hours with this traceback:
Traceback (most recent call last):
File "/home/emfollow/.local/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT).
This is not terribly informative; all it tells is us that the worker killed itself by raising SIGABRT
.
But it's a clue some C code is calling abort()
. And in the log, you can see a lot more detail. I'm appending the first few (and probably the most relevant) lines:
XLAL Error - XLALResizeUINT4Sequence (Sequence_source.c:147): Internal function call failed
XLAL Error - XLALResizeUINT4TimeSeries (TimeSeries_source.c:99): Internal function call failed
*** Error in `/opt/rh/rh-python36/root/usr/bin/python3': double free or corruption (fasttop): 0x0000000004609760 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81429)[0x2b9892fa4429]
/home/emfollow/.local/lib/python3.6/site-packages/lal/../lalsuite.libs/liblal-97033ae5.so.14.0.0(XLALDestroyUINT4Sequence+0x12)[0x2b98f1f5e9b2]
/home/emfollow/.local/lib/python3.6/site-packages/lal/../lalsuite.libs/liblal-97033ae5.so.14.0.0(XLALDestroyUINT4TimeSeries+0x12)[0x2b98f1f62952]
/home/emfollow/.local/lib/python3.6/site-packages/lalframe/../lalsuite.libs/liblalframe-96d3c2d5.so.10.0.2(XLALFrStreamReadUINT4TimeSeries+0x1ed)[0x2b99013649ad]
/home/emfollow/.local/lib/python3.6/site-packages/lalframe/_lalframe.cpython-36m-x86_64-linux-gnu.so(+0x41a01)[0x2b99010e6a01]
...
I am also attaching the full backtrace (backtrace.txt).
Luckily, Celery is smart enough to restart the worker right away. And I think that any tasks that were executing at that time will be retried automatically. But we should get to the bottom of this. @duncanmmacleod, have you seen these errors before?