Skip to content
Snippets Groups Projects
Commit 8f3ea65f authored by Jameson Rollins's avatar Jameson Rollins
Browse files

update README to add a common issues section

parent f9cfd7b7
Branches mass-ordering-check
No related tags found
No related merge requests found
...@@ -124,6 +124,118 @@ in new condor executions: ...@@ -124,6 +124,118 @@ in new condor executions:
$ locklost online restart $ locklost online restart
``` ```
## common issues
There are a couple of issues that crop up periodically, usually due to
problems with the site LDAS clusters where the jobs run.
### online analysis restarting due to NDS problems
One of the most common problems is that the online analysis is falling
over because it can't connect to the site NDS server, for example:
```
2021-05-12_17:18:31 2021-05-12 17:18:31,317 NDS connect: 10.21.2.4:31200
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Traceback (most recent call last):
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
2021-05-12_17:18:31 "__main__", mod_spec)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
2021-05-12_17:18:31 exec(code, run_globals)
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/online.py", line 109, in <module>
2021-05-12_17:18:31 stat_file=stat_file,
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/search.py", line 72, in search_iterate
2021-05-12_17:18:31 for bufs in data.nds_iterate([channel], start_end=segment):
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 48, in nds_iterate
2021-05-12_17:18:31 with closing(nds_connection()) as conn:
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 28, in nds_connection
2021-05-12_17:18:31 conn = nds2.connection(HOST, PORT)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/site-packages/nds2.py", line 3172, in __init__
2021-05-12_17:18:31 _nds2.connection_swiginit(self, _nds2.new_connection(*args))
2021-05-12_17:18:31 RuntimeError: Failed to establish a connection[INFO: Error occurred trying to write to socket]
```
This is always because the NDS server itself has died and needs to be
restarted/reset (frequently due to Tuesday maintenance).
Unfortunately the site admins aren't necessarily aware of this issue
and need to be poked about it.
Once the NDS server is back the job should just pick up on it's own.
If the notifications are too much, you can stop the online analysis
until the server is back up:
```shell
$ locklost online stop
```
Don't forget to restart once things are fully recovered:
```shell
$ locklost online start
```
### analyze jobs failing because of cluster data problems
Another common failure mode is failing follow-up analysis jobs due to
data access problems in the cluster. These often occur during Tuesday
maintenance, but often mysteriously at other times as well. These
kinds of failures are indicated by the following exceptions in the
event analyze log:
```
2021-05-07 16:02:51,722 [analyze.analyze_event] exception in discover_data:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/discover.py", line 67, in discover_data
raise RuntimeError("data discovery timeout reached, data not found")
RuntimeError: data discovery timeout reached, data not found
```
or:
```
2021-05-07 16:02:51,750 [analyze.analyze_event] exception in find_previous_state:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/history.py", line 26, in find_previous_state
gbuf = data.fetch(channels, segment)[0]
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 172, in fetch
bufs = func(channels, start, stop)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 150, in frame_fetch_gwpy
data = gwpy.timeseries.TimeSeriesDict.find(channels, start, stop, frametype=config.IFO+'_R')
File "/usr/lib/python3.6/site-packages/gwpy/timeseries/core.py", line 1291, in find
on_gaps="error" if pad is None else "warn",
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 335, in wrapped
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 642, in find_urls
on_gaps=on_gaps)
File "/usr/lib/python3.6/site-packages/gwdatafind/http.py", line 433, in find_urls
raise RuntimeError(msg)
RuntimeError: Missing segments:
[1304463181 ... 1304463182)
```
This problem sometimes just corrects itself, but often needs admin
poking as well.
## back-filling events
After things have recovered from any of the issues mentioned above,
you'll want to back-fill any missed events. The best way to do that
is to run a condor `search` for missed event (e.g. from "4 weeks ago"
until "now"):
```shell
$ locklost search --condor '4 weeks ago' now
```
followed by a condor `analyze` to analyze any newly found events:
```shell
$ locklost analyze --condor '4 weeks ago' now
```
The `analyze` command will analyze both missed events but also
"failed" events.
Make sure to wait until the event search job is finished before you
run the event analyze job, so you don't miss analyzing any new events.
# Developing and contributing # Developing and contributing
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment