After the v2.1.5 release we have a Kafka consumer that is no longer throwing up obvious errors and should be working. However, there have been no successful examples of an ingested events, even with multiple instances of GraceDB and topics (where Fermi publishes numerous events on their production topic per week):
We can see some examples of successful events on the dev server for reference (where these were created internally by gwcelery using O3 replay code, not ingesting from external topics):
To help debug, I found a few suppressed errors in the worker logs:
test:
[2023-08-02 17:06:02,653: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TIMED_OUT,val=-185,str="GroupCoordinator: kafka3.gcn.nasa.gov:9092: 2 request(s) timed out: disconnect (after 609872ms in state UP)"}[2023-08-02 17:06:02,680: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka3.gcn.nasa.gov:9092/3: Connection setup timed out in state CONNECT (after 30032ms in state CONNECT)"}[2023-08-02 17:06:02,721: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka.gcn.nasa.gov:9092/bootstrap: Disconnected (after 10820537ms in state UP)"}[2023-08-02 17:06:32,684: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="GroupCoordinator: kafka3.gcn.nasa.gov:9092: Connection setup timed out in state CONNECT (after 30031ms in state CONNECT)"}[2023-08-02 17:06:32,708: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka3.gcn.nasa.gov:9092/3: Connection setup timed out in state CONNECT (after 30028ms in state CONNECT, 1 identical error(s) suppressed)"}[2023-08-02 17:06:32,751: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka2.gcn.nasa.gov:9092/2: Connection setup timed out in state CONNECT (after 30030ms in state CONNECT)"}[2023-08-02 17:06:32,754: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka.gcn.nasa.gov:9092/bootstrap: Connection setup timed out in state CONNECT (after 30033ms in state CONNECT)"}[2023-08-02 17:07:02,717: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="GroupCoordinator: kafka3.gcn.nasa.gov:9092: Connection setup timed out in state CONNECT (after 30033ms in state CONNECT, 1 identical error(s) suppressed)"}
playground:
[2023-08-04 17:40:01,281: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kafka3.gcn.nasa.gov:9092/3: Connection setup timed out in state CONNECT (after 30031ms in state CONNECT, 1 identical error(s) suppressed)"}[2023-08-04 17:40:02,093: ERROR/MainProcess/swift_KafkaConsumerThread] internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="GroupCoordinator: kafka3.gcn.nasa.gov:9092: Connection setup timed out in state CONNECT (after 30048ms in state CONNECT, 1 identical error(s) suppressed)"}
Are there similar errors coming from fermi_KafkaConsumerThread? These seem like temporary connection errors, and only affecting the swift topic listener for whatever reason. I don't see any messages on the Swift topics on GCN, so those shouldn't have affected anything? Don't have access to the private Fermi topic.
Some of these are the same errors as in #457 (closed)#458 (closed) . Those discussion imply that they are temporary log messages and the consumer should recover
Thanks for pointing that out @albertcheng.zhang. I guess I'm still concerned because we still haven't successfully ingested an event on any GraceDB instance and that makes me worried something is still wrong with our consumer.
I tested out the listener on my laptop with a dummy handler task and was able to ingest events fine from a GCN Kafka topic that I have access to. Are there log messages like
[...: INFO/MainProcess/fermi_KafkaConsumerThread] calling handlers [<@task: gwcelery.tasks.external_triggers.handle_targeted_kafka_alert of gwcelery at ...>] for key 'fermi'
@albertcheng.zhang the log messages look like the following in the production gwcelery-kafka-consumer.py log:
[2023-08-09 10:46:58,446: INFO/MainProcess/MainThread] mingle: searching for neighbors[2023-08-09 10:46:58,449: WARNING/MainProcess/MainThread] No hostname was supplied. Reverting to default 'localhost'[2023-08-09 10:46:59,455: INFO/MainProcess/MainThread] mingle: all alone[2023-08-09 10:46:59,473: INFO/MainProcess/MainThread] Starting Kafka consumer, topics: kafka://kafka.test.gcn.nasa.gov/fermi.gbm.targeted.private.igwn kafka://kafka.gcn.nasa.gov/gcn.notices.swift.bat.guano[2023-08-09 10:46:59,476: INFO/MainProcess/MainThread] group ID not specified, generating a random group ID: 3rl5lao6s8s0732dfpmpalqavo-47FMXTKO2S[2023-08-09 10:46:59,477: INFO/MainProcess/MainThread] connecting to kafka://kafka.test.gcn.nasa.gov[2023-08-09 10:47:00,009: INFO/MainProcess/MainThread] subscribing to topics: ['fermi.gbm.targeted.private.igwn'][2023-08-09 10:47:00,663: INFO/MainProcess/MainThread] group ID not specified, generating a random group ID: 4pi5pg81ij238ttb1a3bd6ccuh-8VKP23Y5NG[2023-08-09 10:47:00,663: INFO/MainProcess/MainThread] connecting to kafka://kafka.gcn.nasa.gov[2023-08-09 10:47:01,019: INFO/MainProcess/MainThread] subscribing to topics: ['gcn.notices.swift.bat.guano'][2023-08-09 10:47:01,734: INFO/MainProcess/fermi_KafkaConsumerThread] processing messages from stream[2023-08-09 10:47:01,734: INFO/MainProcess/swift_KafkaConsumerThread] processing messages from stream[2023-08-09 10:47:01,734: INFO/MainProcess/MainThread] gwcelery-kafka-consumer-worker@emfollow-test.ligo.caltech.edu ready.[2023-08-09 10:47:03,303: INFO/MainProcess/MainThread] Events of group {task} enabled by remote.
Are we sure that messages have actually been sent over the fermi topic since the listener was implemented? The gcn.notices.swift.bat.guano topic is completely empty based on hop subscribe -s EARLIEST <kafka url>
I don't think anything has been sent over either tbh in quite some time. Many events do get pushed to the production Fermi topic and I think that would be more helpful as a testing tool for playground and test. I've also noticed some credential issues with those two instances that I should look into.
@brandon.piotrzkowski in the logs you posted above, the listener subscribes to kafka://kafka.test.gcn.nasa.gov instead of kafka://kafka.gcn.nasa.gov, so it would explain things if that broker doesn't have any messages. Has !1275 (merged) made it to deployment yet? If it has, the production instance isn't getting the right URL...
Yes production is listening to the correct (non test) topics:
[2023-08-09 08:26:51,248: INFO/MainProcess/MainThread] mingle: searching for neighbors[2023-08-09 08:26:51,256: WARNING/MainProcess/MainThread] No hostname was supplied. Reverting to default 'localhost'[2023-08-09 08:26:52,261: INFO/MainProcess/MainThread] mingle: all alone[2023-08-09 08:26:52,281: INFO/MainProcess/MainThread] Starting Kafka consumer, topics: kafka://kafka.gcn.nasa.gov/fermi.gbm.targeted.private.igwn kafka://kafka.gcn.nasa.gov/gcn.notices.swift.bat.guano[2023-08-09 08:26:52,283: INFO/MainProcess/MainThread] group ID not specified, generating a random group ID: h208u3m5eimufpih1o8fi25rd-EI9D1ZCJD7[2023-08-09 08:26:52,284: INFO/MainProcess/MainThread] connecting to kafka://kafka.gcn.nasa.gov[2023-08-09 08:26:52,709: INFO/MainProcess/MainThread] subscribing to topics: ['fermi.gbm.targeted.private.igwn'][2023-08-09 08:26:53,350: INFO/MainProcess/MainThread] group ID not specified, generating a random group ID: lnrsrnki3012536p12077bluc-1WMELT51CB[2023-08-09 08:26:53,352: INFO/MainProcess/MainThread] connecting to kafka://kafka.gcn.nasa.gov[2023-08-09 08:26:53,739: INFO/MainProcess/MainThread] subscribing to topics: ['gcn.notices.swift.bat.guano'][2023-08-09 08:26:54,458: INFO/MainProcess/fermi_KafkaConsumerThread] processing messages from stream[2023-08-09 08:26:54,459: INFO/MainProcess/swift_KafkaConsumerThread] processing messages from stream
I think I'm leaning towards the consumer working (at least up to the point of the last confirmed error) rather than something being catastrophically wrong. Again, I'd prefer us to listen to the production instance of Fermi due to the constant testing it provides.