update README to add a common issues section

8f3ea65f · Jameson Rollins · f9cfd7b7 · 8f3ea65f
Commit 8f3ea65f authored 3 years ago by Jameson Rollins
--- a/README.md
+++ b/README.md
@@ -124,6 +124,118 @@ in new condor executions:
 $ locklost online restart
 ```
+## common issues
+There are a couple of issues that crop up periodically, usually due to
+problems with the site LDAS clusters where the jobs run.
+### online analysis restarting due to NDS problems
+One of the most common problems is that the online analysis is falling
+over because it can't connect to the site NDS server, for example:
+```
+2021-05-12_17:18:31 2021-05-12 17:18:31,317 NDS connect: 10.21.2.4:31200
+2021-05-12_17:18:31 Error in write(): Connection refused
+2021-05-12_17:18:31 Error in write(): Connection refused
+2021-05-12_17:18:31 Traceback (most recent call last):
+2021-05-12_17:18:31   File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
+2021-05-12_17:18:31     "__main__", mod_spec)
+2021-05-12_17:18:31   File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
+2021-05-12_17:18:31     exec(code, run_globals)
+2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/online.py", line 109, in <module>
+2021-05-12_17:18:31     stat_file=stat_file,
+2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/search.py", line 72, in search_iterate
+2021-05-12_17:18:31     for bufs in data.nds_iterate([channel], start_end=segment):
+2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 48, in nds_iterate
+2021-05-12_17:18:31     with closing(nds_connection()) as conn:
+2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 28, in nds_connection
+2021-05-12_17:18:31     conn = nds2.connection(HOST, PORT)
+2021-05-12_17:18:31   File "/usr/lib64/python3.6/site-packages/nds2.py", line 3172, in __init__
+2021-05-12_17:18:31     _nds2.connection_swiginit(self, _nds2.new_connection(*args))
+2021-05-12_17:18:31 RuntimeError: Failed to establish a connection[INFO: Error occurred trying to write to socket]
+```
+This is always because the NDS server itself has died and needs to be
+restarted/reset (frequently due to Tuesday maintenance).
+Unfortunately the site admins aren't necessarily aware of this issue
+and need to be poked about it.
+Once the NDS server is back the job should just pick up on it's own.
+If the notifications are too much, you can stop the online analysis
+until the server is back up:
+```shell
+$ locklost online stop
+```
+Don't forget to restart once things are fully recovered:
+```shell
+$ locklost online start
+```
+### analyze jobs failing because of cluster data problems
+Another common failure mode is failing follow-up analysis jobs due to
+data access problems in the cluster.  These often occur during Tuesday
+maintenance, but often mysteriously at other times as well.  These
+kinds of failures are indicated by the following exceptions in the
+event analyze log:
+```
+2021-05-07 16:02:51,722 [analyze.analyze_event] exception in discover_data:
+Traceback (most recent call last):
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
+    func(event)
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/discover.py", line 67, in discover_data
+    raise RuntimeError("data discovery timeout reached, data not found")
+RuntimeError: data discovery timeout reached, data not found
+```
+or:
+```
+2021-05-07 16:02:51,750 [analyze.analyze_event] exception in find_previous_state:
+Traceback (most recent call last):
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
+    func(event)
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/history.py", line 26, in find_previous_state
+    gbuf = data.fetch(channels, segment)[0]
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 172, in fetch
+    bufs = func(channels, start, stop)
+  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 150, in frame_fetch_gwpy
+    data = gwpy.timeseries.TimeSeriesDict.find(channels, start, stop, frametype=config.IFO+'_R')
+  File "/usr/lib/python3.6/site-packages/gwpy/timeseries/core.py", line 1291, in find
+    on_gaps="error" if pad is None else "warn",
+  File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 335, in wrapped
+    return func(*args, **kwargs)
+  File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 642, in find_urls
+    on_gaps=on_gaps)
+  File "/usr/lib/python3.6/site-packages/gwdatafind/http.py", line 433, in find_urls
+    raise RuntimeError(msg)
+RuntimeError: Missing segments: 
+[1304463181 ... 1304463182)
+```
+This problem sometimes just corrects itself, but often needs admin
+poking as well.
+## back-filling events
+After things have recovered from any of the issues mentioned above,
+you'll want to back-fill any missed events.  The best way to do that
+is to run a condor `search` for missed event (e.g. from "4 weeks ago"
+until "now"):
+```shell
+$ locklost search --condor '4 weeks ago' now
+```
+followed by a condor `analyze` to analyze any newly found events:
+```shell
+$ locklost analyze --condor '4 weeks ago' now
+```
+The `analyze` command will analyze both missed events but also
+"failed" events.
+Make sure to wait until the event search job is finished before you
+run the event analyze job, so you don't miss analyzing any new events.
 # Developing and contributing