Main worker going down after several days: errors about clock drift, removed Redis keys, then worker stops
@alexander.pace noticed that GWCelery is down on emfollow-playground again. I have captured the relevant part of the log here.
I logged in and found that although the gwcelery-worker process was shown by Condor and top to still be running, the worker itself had tried to cleanly shut itself down. Leading up to that there were some warnings about clock drift and some errors about keys having been removed from Redis.
[2021-12-04 11:41:51,603: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-2@node2146 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.603016 recv: 2021-12-04 11:05:23.378921]
[2021-12-04 11:41:51,604: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-10@node2154 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.604485 recv: 2021-12-04 11:05:23.937268]
[2021-12-04 11:41:51,605: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-13@node2135 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.605108 recv: 2021-12-04 11:05:23.938005]
[2021-12-04 11:41:51,605: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-9@node2180 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.605699 recv: 2021-12-04 11:05:23.938303]
[2021-12-04 11:41:51,606: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-7@node2162 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.606303 recv: 2021-12-04 11:05:23.938248]
[2021-12-04 11:41:51,606: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-3@node2189 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.606909 recv: 2021-12-04 11:05:23.938863]
[2021-12-04 11:41:51,607: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-15@node2176 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.607492 recv: 2021-12-04 11:05:23.939806]
[2021-12-04 11:41:51,608: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-1@node2159 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.608175 recv: 2021-12-04 11:05:23.940026]
[2021-12-04 11:41:51,608: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-superevent-worker@emfollow-playground may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.608778 recv: 2021-12-04 11:05:23.940311]
[2021-12-04 11:41:51,609: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-exttrig-worker@emfollow-playground may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.609437 recv: 2021-12-04 11:05:23.940421]
[2021-12-04 11:41:51,610: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-11@node2151 may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.610045 recv: 2021-12-04 11:05:23.940394]
[2021-12-04 11:41:51,610: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-voevent-worker@emfollow-playground may mean clocks are out of sync. Current drift is
2188 seconds. [orig: 2021-12-04 11:41:51.610713 recv: 2021-12-04 11:05:23.941106]
[2021-12-04 11:41:51,611: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-0@node2150 may mean clocks are out of sync. Current drift is
2187 seconds. [orig: 2021-12-04 11:41:51.611269 recv: 2021-12-04 11:05:24.037255]
[2021-12-04 11:41:51,611: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-5@node2152 may mean clocks are out of sync. Current drift is
2187 seconds. [orig: 2021-12-04 11:41:51.611937 recv: 2021-12-04 11:05:24.038415]
[2021-12-04 11:41:51,612: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-14@node2170 may mean clocks are out of sync. Current drift is
2187 seconds. [orig: 2021-12-04 11:41:51.612592 recv: 2021-12-04 11:05:24.167836]
[2021-12-04 11:41:51,613: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-12@node2142 may mean clocks are out of sync. Current drift is
2187 seconds. [orig: 2021-12-04 11:41:51.613232 recv: 2021-12-04 11:05:24.716904]
[2021-12-04 11:41:51,613: WARNING/MainProcess/MainThread] Substantial drift from gwcelery-openmp-worker-4@node2163 may mean clocks are out of sync. Current drift is
2187 seconds. [orig: 2021-12-04 11:41:51.613910 recv: 2021-12-04 11:05:24.717904]
[2021-12-04 11:41:51,681: ERROR/MainProcess/MainThread] Control command error: OperationalError("\nCannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.\n")
Traceback (most recent call last):
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/connection.py", line 448, in _reraise_as_library_errors
yield
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/connection.py", line 525, in _ensured
return fun(*args, **kwargs)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/messaging.py", line 199, in _publish
return channel.basic_publish(
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/virtual/base.py", line 601, in basic_publish
return self.typeof(exchange).deliver(
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/virtual/exchange.py", line 69, in deliver
for queue in _lookup(exchange, routing_key):
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/virtual/base.py", line 711, in _lookup
self.get_table(exchange),
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 884, in get_table
raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))
kombu.exceptions.InconsistencyError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/worker/pidbox.py", line 44, in on_message
self.node.handle_message(body, message)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/pidbox.py", line 143, in handle_message
return self.dispatch(**body)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/pidbox.py", line 110, in dispatch
self.reply({self.hostname: reply},
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/pidbox.py", line 147, in reply
self.mailbox._publish_reply(data, exchange, routing_key, ticket,
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/pidbox.py", line 278, in _publish_reply
producer.publish(
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/messaging.py", line 177, in publish
return _publish(
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/connection.py", line 558, in _ensured
errback and errback(exc, 0)
File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py38-20211019/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/connection.py", line 452, in _reraise_as_library_errors
raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
[2021-12-04 11:41:51,726: INFO/MainProcess/MainThread] Task gwcelery.tasks.orchestrator.handle_superevent[ee82fd97-e0ae-4e39-844c-37c438e5e97d] received
[2021-12-04 11:41:51,727: INFO/MainProcess/MainThread] Scaling up 1 processes.
[2021-12-04 11:41:52,099: INFO/ForkPoolWorker-5510/MainThread] Task gwcelery.tasks.external_triggers.handle_grb_lvalert[149ef5eb-801d-4451-b95a-4c82b71cb3ad] succeeded in 0.0008274707943201065s: None
[2021-12-04 11:41:52,094: CRITICAL/MainProcess/MainThread] Unrecoverable error: RuntimeError('pubsub connection not set: did you forget to call subscribe() or psubscribe()?')
Traceback (most recent call last):
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 326, in start
blueprint.start(self)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 618, in start
c.loop(*c.loop_args())
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 81, in asynloop
next(loop)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 361, in create_loop
cb(*cbargs)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 1128, in on_readable
self.cycle.on_readable(fileno)
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 398, in on_readable
chan.handlers[type]()
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 733, in _receive
ret.append(self._receive_one(c))
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 743, in _receive_one
response = c.parse_response()
File "/home/emfollow-playground/.local/lib/python3.8/site-packages/redis/client.py", line 3497, in parse_response
raise RuntimeError(
RuntimeError: pubsub connection not set: did you forget to call subscribe() or psubscribe()?
[2021-12-04 11:41:52,116: WARNING/MainProcess/MainThread] No hostname was supplied. Reverting to default 'localhost'
[2021-12-04 11:41:52,124: INFO/MainProcess/MainThread] Stopping email client
[2021-12-04 11:41:53,033: INFO/MainProcess/MainThread] Stopping LVAlert client
[2021-12-04 11:41:53,034: INFO/MainProcess/MainThread] Waiting for </stream:stream> from server
[2021-12-04 11:41:56,645: INFO/ForkPoolWorker-5507/MainThread] Task gwcelery.tasks.external_triggers.handle_snews_lvalert[e66fd348-9947-4a95-bec8-41a2059fc7ad] succeeded in 4.547337795142084s: None