look into robust restarting
I noticed some instances of the igwn_alert_overseer
hanging up in AWS with the following error:
Jan 6 16:33:33 gracedb-swarm-test-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.1sa688fm1dirtua9dsp3tvp08: %5|1641486813.878|REQTMOUT|rdkafka#producer-81| [thrd:sasl_ssl://kb-1.prod.hop.scimma.org:9092/1]: sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: Timed out ProduceRequest in flight (after 10236ms, timeout #0)
Jan 6 16:33:33 gracedb-swarm-test-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.1sa688fm1dirtua9dsp3tvp08: %4|1641486813.878|REQTMOUT|rdkafka#producer-81| [thrd:sasl_ssl://kb-1.prod.hop.scimma.org:9092/1]: sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Jan 6 16:33:33 gracedb-swarm-test-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.1sa688fm1dirtua9dsp3tvp08: %3|1641486813.878|FAIL|rdkafka#producer-81| [thrd:sasl_ssl://kb-1.prod.hop.scimma.org:9092/1]: sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: 1 request(s) timed out: disconnect (after 1717039944ms in state UP)
make the overseer more robust against these types of hangs... either through changes to the client code or overseer.