GWCelery is reporting periodic SSLEOFErrors that have occurred when trying to interact with the GraceDB API using gracedb-client. A Sentry page which is tracking these issues is here.
The error looks like
SSLEOFErrorEOF occurred in violation of protocol (_ssl.c:777)
All three of the cases where the error occurred when trying to get the root API resource were actually attempting to do a POST or PUT request to create a label or VOEvent. But there is some functionality built into the client which does a "priming request" to GET the root API resource whenever we are doing a POST or PUT (see here). This functionality is tied to some SSL handshake issues which I think Branson or Brian saw in the past and I experienced again recently when migrating the API to use only a single endpoint. This "priming request" passes a Connection: Keep-Alive header - not sure if that is somehow related.
Copying over some info from bugs.ligo.org that was submitted by Keith Thorne:
Both LIGO Control Rooms run scripts to query the GraceDB database for alerts. We updated them in December, 2018 for O3 and they worked fine.
However now they work for a few minutes and then die with a RunTime error in rest.py
On the SL7 machine at LLO, the error log is
ext-alert 1237219963] INFO: Iteration complete[ext-alert 1237219963] DEBUG: caget L1:FEC-117_TIME_DIAG = 1237219963.0[ext-alert 1237219963] DEBUG: Submitting query: 'External 1237183963 .. 1237219963'[ext-alert 1237219963] DEBUG: Submitting query: 'label: L1OPS 1237183963 .. 1237219963'[ext-alert 1237219966] DEBUG: Retrieved 2 events from gracedb[ext-alert 1237219966] DEBUG: caput L1:CAL-INJ_EXTTRIG_ALERT_QUERY_TIME = 1237219963.0[ext-alert 1237219966] DEBUG: No new events detected[ext-alert 1237219966] DEBUG: Waiting...[ext-alert 1237219973] INFO: Iteration complete[ext-alert 1237219973] DEBUG: caget L1:FEC-117_TIME_DIAG = 1237219973.0[ext-alert 1237219973] DEBUG: Submitting query: 'External 1237183973 .. 1237219973'[ext-alert 1237219975] DEBUG: Submitting query: 'label: L1OPS 1237183973 .. 1237219973'DEPRECATED: this attribute has been moved to '_service_url'Traceback (most recent call last):File "/opt/rtcds/userapps/release/cal/common/scripts/ext_alert.py", line 498, in <module>far=args.far, test=args.test))File "/opt/rtcds/userapps/release/cal/common/scripts/ext_alert.py", line 136, in query_gracedb'label: %sOPS %d .. %d' % (ifo, start, end))File "/opt/rtcds/userapps/release/cal/common/scripts/ext_alert.py", line 162, in log_queryreturn list(connection.events(query))File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 690, in eventsresponse = self.get(uri).json()File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 280, in getreturn self.request("GET", url, headers=headers)File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 521, in requestreturn GsiRest.request(self, method, *args, **kwargs)File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 260, in requestself.make_request(conn, method, url, body, headers or {})File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 227, in make_requestself.output_and_die(msg)File "/usr/lib/python2.7/site-packages/ligo/gracedb/rest.py", line 328, in output_and_dieraise RuntimeError(msg)RuntimeError: ERRORProblem establishing secure connection: EOF occurred in violation of protocol (_ssl.c:579)
It also does not work at LHO, with likely has Debian8/9
We had been at python2-ligo-gracedb 2.0.0. We are now trying python2-ligo-gracedb 2.2.0
This is quite serious and prevent our operators from getting alerts !!
Script attached: ext_alert.py
After discussing with Tom, we think this might be due to Traefik passing a packet to a gunicorn socket that is not closed, but closes before the packet gets there. We are increasing the gunicorn keep-alive timeout to be longer than that of Traefik (Traefik is 90s, hard-coded; gunicorn is now set to 100 s) to see if that resolves the issue. See 7d3138f8.