Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
L
locklost
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Requirements
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Locked files
Deploy
Releases
Package Registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Benjamin Mannix
locklost
Commits
8f3ea65f
Commit
8f3ea65f
authored
3 years ago
by
Jameson Rollins
Browse files
Options
Downloads
Patches
Plain Diff
update README to add a common issues section
parent
f9cfd7b7
Branches
mass-ordering-check
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+112
-0
112 additions, 0 deletions
README.md
with
112 additions
and
0 deletions
README.md
+
112
−
0
View file @
8f3ea65f
...
@@ -124,6 +124,118 @@ in new condor executions:
...
@@ -124,6 +124,118 @@ in new condor executions:
$
locklost online restart
$
locklost online restart
```
```
## common issues
There are a couple of issues that crop up periodically, usually due to
problems with the site LDAS clusters where the jobs run.
### online analysis restarting due to NDS problems
One of the most common problems is that the online analysis is falling
over because it can't connect to the site NDS server, for example:
```
2021-05-12_17:18:31 2021-05-12 17:18:31,317 NDS connect: 10.21.2.4:31200
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Traceback (most recent call last):
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
2021-05-12_17:18:31 "__main__", mod_spec)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
2021-05-12_17:18:31 exec(code, run_globals)
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/online.py", line 109, in <module>
2021-05-12_17:18:31 stat_file=stat_file,
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/search.py", line 72, in search_iterate
2021-05-12_17:18:31 for bufs in data.nds_iterate([channel], start_end=segment):
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 48, in nds_iterate
2021-05-12_17:18:31 with closing(nds_connection()) as conn:
2021-05-12_17:18:31 File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 28, in nds_connection
2021-05-12_17:18:31 conn = nds2.connection(HOST, PORT)
2021-05-12_17:18:31 File "/usr/lib64/python3.6/site-packages/nds2.py", line 3172, in __init__
2021-05-12_17:18:31 _nds2.connection_swiginit(self, _nds2.new_connection(*args))
2021-05-12_17:18:31 RuntimeError: Failed to establish a connection[INFO: Error occurred trying to write to socket]
```
This is always because the NDS server itself has died and needs to be
restarted/reset (frequently due to Tuesday maintenance).
Unfortunately the site admins aren't necessarily aware of this issue
and need to be poked about it.
Once the NDS server is back the job should just pick up on it's own.
If the notifications are too much, you can stop the online analysis
until the server is back up:
```
shell
$
locklost online stop
```
Don't forget to restart once things are fully recovered:
```
shell
$
locklost online start
```
### analyze jobs failing because of cluster data problems
Another common failure mode is failing follow-up analysis jobs due to
data access problems in the cluster. These often occur during Tuesday
maintenance, but often mysteriously at other times as well. These
kinds of failures are indicated by the following exceptions in the
event analyze log:
```
2021-05-07 16:02:51,722 [analyze.analyze_event] exception in discover_data:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/discover.py", line 67, in discover_data
raise RuntimeError("data discovery timeout reached, data not found")
RuntimeError: data discovery timeout reached, data not found
```
or:
```
2021-05-07 16:02:51,750 [analyze.analyze_event] exception in find_previous_state:
Traceback (most recent call last):
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
func(event)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/history.py", line 26, in find_previous_state
gbuf = data.fetch(channels, segment)[0]
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 172, in fetch
bufs = func(channels, start, stop)
File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 150, in frame_fetch_gwpy
data = gwpy.timeseries.TimeSeriesDict.find(channels, start, stop, frametype=config.IFO+'_R')
File "/usr/lib/python3.6/site-packages/gwpy/timeseries/core.py", line 1291, in find
on_gaps="error" if pad is None else "warn",
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 335, in wrapped
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 642, in find_urls
on_gaps=on_gaps)
File "/usr/lib/python3.6/site-packages/gwdatafind/http.py", line 433, in find_urls
raise RuntimeError(msg)
RuntimeError: Missing segments:
[1304463181 ... 1304463182)
```
This problem sometimes just corrects itself, but often needs admin
poking as well.
## back-filling events
After things have recovered from any of the issues mentioned above,
you'll want to back-fill any missed events. The best way to do that
is to run a condor
`search`
for missed event (e.g. from "4 weeks ago"
until "now"):
```
shell
$
locklost search
--condor
'4 weeks ago'
now
```
followed by a condor
`analyze`
to analyze any newly found events:
```
shell
$
locklost analyze
--condor
'4 weeks ago'
now
```
The
`analyze`
command will analyze both missed events but also
"failed" events.
Make sure to wait until the event search job is finished before you
run the event analyze job, so you don't miss analyzing any new events.
# Developing and contributing
# Developing and contributing
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment