Skip to content
Snippets Groups Projects
Forked from Jameson Rollins / locklost
189 commits behind the upstream repository.

locklost: aLIGO IFO lock loss tracking and analysis

This package provides a set of tools for analyzing LIGO detector "lock losses". It consists of four main components:

  • search for detector lock losses in past data based on guardian state transitions.
  • analyze individual lock losses to generate plots and look for identifying features.
  • online search to monitor for lock losses and automatically run follow-up analyses.
  • web interface to view lock loses event pages.

A command line interface is provided to launch the online analysis and condor jobs to search for and analyze events.

Usage

To start/stop/restart the online analysis use the online command:

$ locklost online start

This launches a condor job that runs the online analysis.

To launch a condor job to search for lock losses within some time window:

$ locklost search --condor START END

This will find lock losses with the specified time range, but will not run the follow-up analyses. This is primarily needed to backfill times when the online analysis was not running (see below).

Any time argument ('START', 'END', 'TIME', etc.) can be either a GPS times or a full (even relative) date/time string, e.g. '1 week ago'.

To run a full analysis on a specific lock loss time found from the search above:

$ locklost analyze TIME

To launch a condor job to analyze all un-analyzed events within a time range:

$ locklost analyze --condor START END

To re-analyze events add the --rerun flag e.g.:

$ locklost analyze TIME --rerun

or

$ locklost analyze --condor START END --rerun

It has happened that analysis jobs are improperly killed by condor, not giving them a chance to clean up their run locks. The site locklost deployments include a command line utility to find and remove any old, stale analysis locks:

$ find-locks -r

Analysis plugins

Lock loss event analysis consists of a set of follow-up "plugin" analyses, located in the locklost.plugins sub-package:

Each follow-up module is registered in locklost/plugins/__init__.py. Some of the currently enabled follow-up are:

  • discover.discover_data wait for data to be available
  • refine.refine_event refine event time
  • saturations.find_saturations find saturating channels before event
  • lpy.find_lpy find length/pitch/yaw oscillations in suspensions
  • glitch.analyze_glitches look for glitches around event
  • overflows.find_overflows look for ADC overflows
  • state_start.find_lock_start find the start the lock leading to current lock loss

Each plugin does it's own analysis, although some depend on the output of other analyses. The output from any analysis (e.g. plots or data) should be written into the event directory.

Site deployments

Each site (LHO and LLO) has a dedicated "lockloss" account on their local LDAS cluster where locklost is running:

These accounts have deployments of the latest locklost package release, run the production online and analysis condor jobs, and host the web pages.

deploying new versions

When a new version is ready for release, create an annotated tag for the release and push it to the main repo (https://git.ligo.org/jameson.rollins/locklost):

$ git tag -m release 0.16
$ git push --tags

In the LDAS lockloss account, pull the new tag and run the test/deploy script:

$ ssh lockloss@detchar.ligo-la.caltech.edu
$ cd ~/src/locklost
$ git pull
$ locklost-deploy

If there are changes to the online search, restart the online condor process, otherwise analysis changes should get picked up automatically in new condor executions:

$ locklost online restart

common issues

There are a couple of issues that crop up periodically, usually due to problems with the site LDAS clusters where the jobs run.

online analysis restarting due to NDS problems

One of the most common problems is that the online analysis is falling over because it can't connect to the site NDS server, for example:

2021-05-12_17:18:31 2021-05-12 17:18:31,317 NDS connect: 10.21.2.4:31200
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Error in write(): Connection refused
2021-05-12_17:18:31 Traceback (most recent call last):
2021-05-12_17:18:31   File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
2021-05-12_17:18:31     "__main__", mod_spec)
2021-05-12_17:18:31   File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
2021-05-12_17:18:31     exec(code, run_globals)
2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/online.py", line 109, in <module>
2021-05-12_17:18:31     stat_file=stat_file,
2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/search.py", line 72, in search_iterate
2021-05-12_17:18:31     for bufs in data.nds_iterate([channel], start_end=segment):
2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 48, in nds_iterate
2021-05-12_17:18:31     with closing(nds_connection()) as conn:
2021-05-12_17:18:31   File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.3-py3.6.egg/locklost/data.py", line 28, in nds_connection
2021-05-12_17:18:31     conn = nds2.connection(HOST, PORT)
2021-05-12_17:18:31   File "/usr/lib64/python3.6/site-packages/nds2.py", line 3172, in __init__
2021-05-12_17:18:31     _nds2.connection_swiginit(self, _nds2.new_connection(*args))
2021-05-12_17:18:31 RuntimeError: Failed to establish a connection[INFO: Error occurred trying to write to socket]

This is always because the NDS server itself has died and needs to be restarted/reset (frequently due to Tuesday maintenance). Unfortunately the site admins aren't necessarily aware of this issue and need to be poked about it.

Once the NDS server is back the job should just pick up on it's own. If the notifications are too much, you can stop the online analysis until the server is back up:

$ locklost online stop

Don't forget to restart once things are fully recovered:

$ locklost online start

analyze jobs failing because of cluster data problems

Another common failure mode is failing follow-up analysis jobs due to data access problems in the cluster. These often occur during Tuesday maintenance, but often mysteriously at other times as well. These kinds of failures are indicated by the following exceptions in the event analyze log:

2021-05-07 16:02:51,722 [analyze.analyze_event] exception in discover_data:
Traceback (most recent call last):
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
    func(event)
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/discover.py", line 67, in discover_data
    raise RuntimeError("data discovery timeout reached, data not found")
RuntimeError: data discovery timeout reached, data not found

or:

2021-05-07 16:02:51,750 [analyze.analyze_event] exception in find_previous_state:
Traceback (most recent call last):
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/analyze.py", line 56, in analyze_event
    func(event)
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/plugins/history.py", line 26, in find_previous_state
    gbuf = data.fetch(channels, segment)[0]
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 172, in fetch
    bufs = func(channels, start, stop)
  File "/home/lockloss/.local/lib/python3.6/site-packages/locklost-0.21.0-py3.6.egg/locklost/data.py", line 150, in frame_fetch_gwpy
    data = gwpy.timeseries.TimeSeriesDict.find(channels, start, stop, frametype=config.IFO+'_R')
  File "/usr/lib/python3.6/site-packages/gwpy/timeseries/core.py", line 1291, in find
    on_gaps="error" if pad is None else "warn",
  File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 335, in wrapped
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/gwpy/io/datafind.py", line 642, in find_urls
    on_gaps=on_gaps)
  File "/usr/lib/python3.6/site-packages/gwdatafind/http.py", line 433, in find_urls
    raise RuntimeError(msg)
RuntimeError: Missing segments: 
[1304463181 ... 1304463182)

This problem sometimes just corrects itself, but often needs admin poking as well.

back-filling events

After things have recovered from any of the issues mentioned above, you'll want to back-fill any missed events. The best way to do that is to run a condor search for missed event (e.g. from "4 weeks ago" until "now"):

$ locklost search --condor '4 weeks ago' now

followed by a condor analyze to analyze any newly found events:

$ locklost analyze --condor '4 weeks ago' now

The analyze command will analyze both missed events but also "failed" events.

Make sure to wait until the event search job is finished before you run the event analyze job, so you don't miss analyzing any new events.

Developing and contributing

See the CONTRIBUTING for instructions on how to contribute to locklost.