Maintenance will be performed on git.ligo.org, containers.ligo.org, and docs.ligo.org on Tuesday 25 March 2025 starting at approximately 8:30am PDT. It is expected to take around 30 minutes and there will be several periods of downtime throughout the maintenance. Please address any comments, concerns, or questions to the helpdesk.
Regression: igwn-alert bootstep does not terminate, blocks shutdown of application
The igwn-alert bootstep never terminates, which blocks the application from shutting down. The bootstep stop() method must trigger some sentinel that tells the client's run loop to stop.
So...calling disconnect should terminate the listen loop. Is there anything in the log files that you see to conclude that is not being called. Also, the last commit on this is yours.
Leo P. Singerchanged title from igwn-alert bootstep does not terminate, blocks shutdown of application to Regression: igwn-alert bootstep does not terminate, blocks shutdown of application
changed title from igwn-alert bootstep does not terminate, blocks shutdown of application to Regression: igwn-alert bootstep does not terminate, blocks shutdown of application
Given that this problem is more involved than we initially thought, should we move this to the O4 milestone? Reminder that the review readiness milestone is due Sep 1 (Thursday).
The tldr of the situation thus far is that setting until_eos = True not only causes the problem that Deep mentioned above, but also the intended behavior of until_eos = True does not do what we want. Its intended behavior is that it will shut the listener down after the listener processes the current batch of messages, i.e. the listener will turn on, grab whatever is waiting in the kafka queue to be consumed immediately, and then shut down. If we find evidence that setting until_eos = True does not result in this behavior, we should let the hop folks know that there's a bug.
The underlying problem, as I understand it from talking to @patrick.godwin, is that adc streaming's _stream_forever function (link) never returns anything, meaning that the while self.running line (link) will only ever get evaulated once. Until this is fixed we'll need a workaround.
Got another of these messages after reverting all recent changes to the IGWN-Alert bootstep, returning it to its original state before !896 (merged). I'm going to make a new issue since this doesn't seem to be connected to the issue of gracefully shutting down.
[2022-08-26 14:06:18,352: WARNING/MainProcess/IGWNReceiverThread] Exception in thread[2022-08-26 14:06:18,354: WARNING/MainProcess/IGWNReceiverThread] IGWNReceiverThread[2022-08-26 14:06:18,354: WARNING/MainProcess/IGWNReceiverThread] :[2022-08-26 14:06:18,355: WARNING/MainProcess/IGWNReceiverThread] Traceback (most recent call last):[2022-08-26 14:06:18,355: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/igwn_alert/client.py", line 226, in listen[2022-08-26 14:06:18,356: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,357: WARNING/MainProcess/IGWNReceiverThread] for payload, metadata in s.read([2022-08-26 14:06:18,357: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 340, in read[2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread] for message in self._consumer.stream(autocommit=autocommit, **kwargs):[2022-08-26 14:06:18,358: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/consumer.py", line 120, in _stream_forever[2022-08-26 14:06:18,359: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,359: WARNING/MainProcess/IGWNReceiverThread] messages = self._consumer.consume(batch_size, batch_timeout.total_seconds())[2022-08-26 14:06:18,360: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/errors.py", line 22, in log_client_errors[2022-08-26 14:06:18,360: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] raise(KafkaException.from_kafka_error(kafka_error))[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] adc.errors[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] .[2022-08-26 14:06:18,361: WARNING/MainProcess/IGWNReceiverThread] KafkaException[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread] :[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread] Error communicating with Kafka: code=_TIMED_OUT GroupCoordinator: kb-1.prod.hop.scimma.org:9092: 10 request(s) timed out: disconnect (after 1660526ms in state UP)[2022-08-26 14:06:18,362: WARNING/MainProcess/IGWNReceiverThread]During handling of the above exception, another exception occurred:[2022-08-26 14:06:18,363: WARNING/MainProcess/IGWNReceiverThread] Traceback (most recent call last):[2022-08-26 14:06:18,363: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/threading.py", line 973, in _bootstrap_inner[2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread] self.run()[2022-08-26 14:06:18,364: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py", line 69, in run[2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread] reraise(*_capture_exception())[2022-08-26 14:06:18,365: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 54, in reraise[2022-08-26 14:06:18,366: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,366: WARNING/MainProcess/IGWNReceiverThread] raise value[2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py", line 67, in run[2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,367: WARNING/MainProcess/IGWNReceiverThread] return old_run_func(self, *a, **kw)[2022-08-26 14:06:18,368: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/threading.py", line 910, in run[2022-08-26 14:06:18,368: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,369: WARNING/MainProcess/IGWNReceiverThread] self._target(*self._args, **self._kwargs)[2022-08-26 14:06:18,369: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/igwn_alert/client.py", line 243, in listen[2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread] callback(topic=metadata.topic.split('.')[1],[2022-08-26 14:06:18,370: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 396, in __exit__[2022-08-26 14:06:18,371: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,371: WARNING/MainProcess/IGWNReceiverThread] self.close()[2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread] File "/home/emfollow-playground/.local/lib/python3.9/site-packages/hop/io.py", line 374, in close[2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,372: WARNING/MainProcess/IGWNReceiverThread] self._consumer.close()[2022-08-26 14:06:18,373: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/consumer.py", line 179, in close[2022-08-26 14:06:18,373: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,374: WARNING/MainProcess/IGWNReceiverThread] self._consumer.close()[2022-08-26 14:06:18,374: WARNING/MainProcess/IGWNReceiverThread] File "/cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py39-20220317/lib/python3.9/site-packages/adc/errors.py", line 22, in log_client_errors[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread][2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] raise(KafkaException.from_kafka_error(kafka_error))[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] adc.errors[2022-08-26 14:06:18,375: WARNING/MainProcess/IGWNReceiverThread] .[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] KafkaException[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] :[2022-08-26 14:06:18,376: WARNING/MainProcess/IGWNReceiverThread] Error communicating with Kafka: code=_TIMED_OUT GroupCoordinator: kb-1.prod.hop.scimma.org:9092: 146 request(s) timed out: disconnect (after 60055ms in state UP)[2022-08-26 14:06:21,126: INFO/ForkPoolWorker-43/MainThread] Task gwcelery.tasks.condor.submit[1d96c7b4-7b0c-411c-a7f6-011d8183bbfb] retry: Retry in 4s: JobRunning({'ExecuteHost': '<10.14.150.10:9618?addrs=10.14.150.10-9618&alias=emfollow-playground.ligo.caltech.edu&noUDP&sock=schedd_2392_4aa0>', 'MyType': 'ExecuteEvent', 'EventTypeNumber': 1, 'Subproc': 0, 'EventTime': '2022-08-26T14:06:14.020', 'Cluster': 23563, 'Proc': 0})
BTW, I wanted to add that the pull request to support graceful termination when running in a thread has been merged (https://github.com/astronomy-commons/adc-streaming/pull/58). We need to wait until there is a new release for adc-streaming. Change to O4 milestone sounds good.