Exception in IGWNReceiverThread
We've been seeing issues where the IGWN-Alert listener shuts down and does not start back up, resulting in GWCelery silently no longer listening to IGWN-Alerts. We initially found behavior like this when investigating #424 (closed), which contains a history of the errors we've received. We believe the very first error in that issue is related to a problem in !896 (merged), but we've since reverted that change and are now seeing the error I've pasted below.
I can't think of anything else that's changed on our end other than !896 (merged), and with these errors showing up after completely reverting it, I wonder if it's something either on IGWN-Alert's side or if it's even further upstream than that and an issue on SCiMMA's broker's side.
Tagging @alexander.pace about this to help us check on the IGWN-Alert side, and I'm emailing Chris Weaver from SCiMMA with a link to this issue.
Digging through the logs, I see this error has popped up at the following times. Note that if we missed seeing this error several times until we were notified that superevents stopped showing up, so if this is based on some reoccurring event on SCiMMA's side then these times may only be a subset of the times that event occurred. Once this error pops up, we have to re-deploy to get the listener working again.
EDIT: I'm going to update this list of times as we see the error, with latest times on top. I've written a monitoring script so we should catch these much quicker now.
2022-08-28 13:36:38,090
2022-08-25 21:06:35,867
2022-08-25 14:07:50,659
2022-08-25 12:30:09,052
2022-08-25 12:01:28,639
2022-08-24 09:51:39,240
2022-08-24 06:54:50,012
2022-08-22 18:06:53,846
2022-08-17 10:24:10,838
In case it's useful for the SCiMMA folks, the instance of our listener that's failing is listening to the following topics.
gracedb-playground.burst_olib
gracedb-playground.burst_cwb
gracedb-playground.mdc_superevent
gracedb-playground.superevent
gracedb-playground.burst_mly
gracedb-playground.external_agile
gracedb-playground.cbc_pycbc
gracedb-playground.cbc_mbtaonline
gracedb-playground.external_snews
gracedb-playground.cbc_gstlal
gracedb-playground.external_integral
gracedb-playground.external_swift
gracedb-playground.external_fermi
gracedb-playground.cbc_mbta
gracedb-playground.cbc_spiir
And here is the error we're seeing.
[2022-08-26 14:06:18,352: WARNING/MainProcess/IGWNReceiverThread] Exception in thread
[2022-08-26 14:06:18,354: WARNING/MainProcess/IGWNReceiverThread] IGWNReceiverThread
[2022-08-26 14:06:18,354: WARNING/MainProcess/IGWNReceiverThread] :
[2022-08-26 14:06:18,355: WARNING/MainProcess/IGWNReceiverThread] Traceback (most recent call last):
[2022-08-26 14:06:18,355: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/igwn_alert/client.py", line 226, in listen
[2022-08-26 14:06:18,356: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,357: WARNING/MainProcess/IGWNReceiverThread] for payload, metadata in s.read(
[2022-08-26 14:06:18,357: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 340, in read
[2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread] for message in self._consumer.stream(autocommit=autocommit, **kwargs):
[2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/consumer.py", line 120, in _stream_forever
[2022-08-26 14:06:18,359: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,359: WARNING/MainProcess/IGWNReceiverThread] messages = self._consumer.consume(batch_size, batch_timeout.total_seconds())
[2022-08-26 14:06:18,360: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/errors.py", line 22, in log_client_errors
[2022-08-26 14:06:18,360: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] raise(KafkaException.from_kafka_error(kafka_error))
[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] adc.errors
[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] .
[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] KafkaException
[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread] :
[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread] Error communicating with Kafka: code=_TIMED_OUT GroupCoordinator: kb-1.prod.hop.scimma.org:9092: 10 request(s) timed out: disconnect (after 1660526ms in state UP)
[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread]
During handling of the above exception, another exception occurred:
[2022-08-26 14:06:18,363: WARNING/MainProcess/IGWNReceiverThread] Traceback (most recent call last):
[2022-08-26 14:06:18,363: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/threading.py", line 973, in _bootstrap_inner
[2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread] self.run()
[2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py", line 69, in run
[2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread] reraise(*_capture_exception())
[2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 54, in reraise
[2022-08-26 14:06:18,366: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,366: WARNING/MainProcess/IGWNReceiverThread] raise value
[2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py", line 67, in run
[2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread] return old_run_func(self, *a, **kw)
[2022-08-26 14:06:18,368: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/threading.py", line 910, in run
[2022-08-26 14:06:18,368: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,369: WARNING/MainProcess/IGWNReceiverThread] self._target(*self._args, **self._kwargs)
[2022-08-26 14:06:18,369: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/igwn_alert/client.py", line 243, in listen
[2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread] callback(topic=metadata.topic.split('.')[1],
[2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 396, in __exit__
[2022-08-26 14:06:18,371: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,371: WARNING/MainProcess/IGWNReceiverThread] self.close()
[2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 374, in close
[2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread] self._consumer.close()
[2022-08-26 14:06:18,373: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/consumer.py", line 179, in close
[2022-08-26 14:06:18,373: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,374: WARNING/MainProcess/IGWNReceiverThread] self._consumer.close()
[2022-08-26 14:06:18,374: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/errors.py", line 22, in log_client_errors
[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread]
[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] raise(KafkaException.from_kafka_error(kafka_error))
[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] adc.errors
[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] .
[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] KafkaException
[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] :
[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] Error communicating with Kafka: code=_TIMED_OUT GroupCoordinator: kb-1.prod.hop.scimma.org:9092: 146 request(s) timed out: disconnect (after 60055ms in state UP)
[2022-08-26 14:06:21,126: INFO/ForkPoolWorker-43/MainThread] Task gwcelery.tasks.condor.submit[1d96c7b4-7b0c-411c-a7f6-011d8183bbfb] retry: Retry in 4s: JobRunning({'ExecuteHost': '<10.14.150.10:9618?addrs=10.14.150.10-9618&alias=emfollow-playground.ligo.caltech.edu&noUDP&sock=schedd_2392_4aa0>', 'MyType': 'ExecuteEvent', 'EventTypeNumber': 1, 'Subproc': 0, 'EventTime': '2022-08-26T14:06:14.020', 'Cluster': 23563, 'Proc': 0})