Commit a5297c2a authored by Yannick Lecoeuche's avatar Yannick Lecoeuche
Browse files

Merge branch 'yannick.lecoeuche/locklost-refine_fix' into 'refine_fix'

Merging to pull master changes and change name of refine.py gps variable

See merge request !2
parents 76f3215d 8bef086d
......@@ -124,6 +124,118 @@ in new condor executions:
$ locklost online restart
```
## common issues
There are a couple of issues that crop up periodically, usually due to
problems with the site LDAS clusters where the jobs run.
### online analysis restarting due to NDS problems
One of the most common problems is that the online analysis is falling
over because it can't connect to the site NDS server, for example:
```
2021-05-12_17:18:31 2021-05-12 17:18:31,317 NDS connect: 10.21.2.4:31200
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Traceback (most recent call last):
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
2021-05-12_17:18:31 "__main__", mod_spec)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
2021-05-12_17:18:31 exec(code, run_globals)
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/online.py", line 109, in <module>
2021-05-12_17:18:31 stat_file=stat_file,
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/search.py", line 72, in search_iterate
2021-05-12_17:18:31 for bufs in data.nds_iterate([channel], start_end=segment):
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 48, in nds_iterate
2021-05-12_17:18:31 with closing(nds_connection()) as conn:
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 28, in nds_connection
2021-05-12_17:18:31 conn = nds2.connection(HOST, PORT)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/site-packages/nds2.py", line 3172, in __init__
2021-05-12_17:18:31 _nds2.connection_swiginit(self, _nds2.new_connection(*args))
2021-05-12_17:18:31 RuntimeError: Failed to establish a connection[INFO: Error occurred trying to write to socket]
```
This is always because the NDS server itself has died and needs to be
restarted/reset (frequently due to Tuesday maintenance).
Unfortunately the site admins aren't necessarily aware of this issue
and need to be poked about it.
Once the NDS server is back the job should just pick up on it's own.
If the notifications are too much, you can stop the online analysis
until the server is back up:
```shell
$ locklost online stop
```
Don't forget to restart once things are fully recovered:
```shell
$ locklost online start
```
### analyze jobs failing because of cluster data problems
Another common failure mode is failing follow-up analysis jobs due to
data access problems in the cluster. These often occur during Tuesday
maintenance, but often mysteriously at other times as well. These
kinds of failures are indicated by the following exceptions in the
event analyze log:
```
2021-05-07 16:02:51,722 [analyze.analyze_event] exception in discover_data:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/discover.py", line 67, in discover_data
raise RuntimeError("data discovery timeout reached, data not found")
RuntimeError: data discovery timeout reached, data not found
```
or:
```
2021-05-07 16:02:51,750 [analyze.analyze_event] exception in find_previous_state:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/history.py", line 26, in find_previous_state
gbuf = data.fetch(channels, segment)[0]
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 172, in fetch
bufs = func(channels, start, stop)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 150, in frame_fetch_gwpy
data = gwpy.timeseries.TimeSeriesDict.find(channels, start, stop, frametype=config.IFO+'_R')
File "/usr/lib/python3.6/site-packages/gwpy/timeseries/core.py", line 1291, in find
on_gaps="error" if pad is None else "warn",
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 335, in wrapped
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 642, in find_urls
on_gaps=on_gaps)
File "/usr/lib/python3.6/site-packages/gwdatafind/http.py", line 433, in find_urls
raise RuntimeError(msg)
RuntimeError: Missing segments:
[1304463181 ... 1304463182)
```
This problem sometimes just corrects itself, but often needs admin
poking as well.
## back-filling events
After things have recovered from any of the issues mentioned above,
you'll want to back-fill any missed events. The best way to do that
is to run a condor `search` for missed event (e.g. from "4 weeks ago"
until "now"):
```shell
$ locklost search --condor '4 weeks ago' now
```
followed by a condor `analyze` to analyze any newly found events:
```shell
$ locklost analyze --condor '4 weeks ago' now
```
The `analyze` command will analyze both missed events but also
"failed" events.
Make sure to wait until the event search job is finished before you
run the event analyze job, so you don't miss analyzing any new events.
# Developing and contributing
......
......@@ -94,6 +94,7 @@ def analyze_condor(event):
'analyze',
[str(event.id)],
local=False,
notify_user=os.getenv('CONDOR_NOTIFY_USER'),
)
sub.write()
sub.submit()
......
......@@ -167,14 +167,17 @@ RETRY {jid} 1
logging.info("condor DAG {} jobs".format(jid+1))
return s
@property
def lock(self):
return os.path.join(self.condor_dir, 'dag.lock')
@property
def has_lock(self):
lock = os.path.join(self.condor_dir, 'dag.lock')
return os.path.exists(lock)
return os.path.exists(self.lock)
def write(self):
if self.has_lock:
raise RuntimeError("DAG already running: {}".format(lock))
raise RuntimeError("DAG already running: {}".format(self.lock))
shutil.rmtree(self.condor_dir)
try:
os.makedirs(os.path.join(self.condor_dir, 'logs'))
......@@ -187,7 +190,7 @@ RETRY {jid} 1
def submit(self):
assert os.path.exists(self.dag_path), "Must write() before submitting"
if self.has_lock:
raise RuntimeError("DAG already running: {}".format(lock))
raise RuntimeError("DAG already running: {}".format(self.lock))
logging.info("condor submit dag: {}".format(self.dag_path))
subprocess.call(['condor_submit_dag', self.dag_path])
print("""
......
......@@ -102,7 +102,8 @@ if __name__ == '__main__':
format='%(asctime)s %(message)s'
)
stat_file = os.path.join(config.CONDOR_ONLINE_DIR, 'stat')
open(stat_file, 'w').close()
if not os.path.exists(stat_file):
open(stat_file, 'w').close()
search.search_iterate(
event_callback=analyze.analyze_condor,
stat_file=stat_file,
......
......@@ -25,9 +25,6 @@ def register_plugin(func):
from .discover import discover_data
register_plugin(discover_data)
from .history import find_previous_state
register_plugin(find_previous_state)
from .refine import refine_time
register_plugin(refine_time)
......@@ -70,3 +67,6 @@ register_plugin(check_fss)
# add last since this needs to wait for additional data
from .seismic import check_seismic
register_plugin(check_seismic)
from .history import find_previous_state
register_plugin(find_previous_state)
......@@ -141,12 +141,14 @@ def plot_indicators(event, params, refined_gps=None, threshold=None):
fig.savefig(outpath)
def find_transition(channel, segment, std_threshold, gps, minimum=None):
def find_transition(channel, segment, std_threshold, max_time, minimum=None):
"""Find transition in channel
`segment` is the time segment to search, `std_threshold` is the % std
from mean defining the transition threshold, `minimum` is an optional
minimum value that the channel will be checked against.
from mean defining the transition threshold, `max_time` is the event.gps
lockloss gps time from the GRD state transition used as a maximum
refined time, `minimum` is an optional minimum value that the channel
will be checked against.
returns `threshold`, `refined_gps` tuple
......@@ -176,9 +178,9 @@ def find_transition(channel, segment, std_threshold, gps, minimum=None):
else:
if std_threshold > 0:
inds = np.where((buf.data > threshold) & (buf.tarray < gps))[0]
inds = np.where((buf.data > threshold) & (buf.tarray < max_time))[0]
else:
inds = np.where((buf.data < threshold) & (buf.tarray < gps))[0]
inds = np.where((buf.data < threshold) & (buf.tarray < max_time))[0]
if inds.any():
ind = np.min(inds)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment