Skip to content
Snippets Groups Projects
Commit 92f0d6b7 authored by Patrick Godwin's avatar Patrick Godwin Committed by ChiWai Chan
Browse files

add offline workflow documentation

parent 9187270e
No related branches found
No related tags found
No related merge requests found
This commit is part of merge request !114. Comments created here will be created in the context of that merge request.
.. _cbc-analysis:
CBC Analysis
================
CBC Analysis (Offline)
========================
WRITEME
To start an offline CBC analysis, you'll need a configuration file
to point at the start/end times to analyze, input data products
(e.g. template bank, mass model) and other workflow-related configuration needed.
All the below steps assume a Singularity container with the GstLAL software
stack installed. Other methods of installation will follow a similar
procedure, however, with one caveat that workflows will not work on the
Open Science Grid (OSG).
Running Workflows
^^^^^^^^^^^^^^^^^^
1. Build Singularity image (optional)
""""""""""""""""""""""""""""""""""""""
NOTE: If you are using a reference Singularity container (suitable in most cases), you can skip this step.
The ``<image>`` throughout this doc refers to ``singularity-image`` specified in the ``condor`` section of your configuration.
If not using the reference Singularity container, say for local development, you can specify a path
to a local container and use that for the workflow (non-OSG).
To pull a container with gstlal installed, run:
.. code:: bash
$ singularity build --sandbox --fix-perms <image-name> docker://containers.ligo.org/lscsoft/gstlal:master
2. Set up workflow
""""""""""""""""""""
First, we create a new analysis directory and switch to it:
.. code:: bash
$ mkdir <analysis-dir>
$ cd <analysis-dir>
Default configuration files and data files (template bank/mass model) for a variety of different banks
are contained in the `offline-configuration <https://git.ligo.org/gstlal/offline-configuration>`_ repository.
For example, to grab the configuration and data files for the BNS test bank:
.. code:: bash
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/config.yml
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/mass_model/mass_model_small.h5
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/bank/gstlal_bank_small.xml.gz
Alternatively, one can clone the repository and copy files as needed into the analysis directory.
Now, we'll need to modify the configuration as needed to run the analysis. At the very least, setting the start/end times and the instruments to run over:
.. code-block:: yaml
start: 1187000000
stop: 1187100000
instruments: H1L1
We also required template bank(s) and a mass model. Ensure these are pointed to the right place in the configuration:
.. code-block:: yaml
data:
template-bank: gstlal_bank_small.xml.gz
.. code-block:: yaml
prior:
mass-model: mass_model_small.h5
If you're creating a summary page for results, you'll need to point at a location where they are web-viewable:
.. code-block:: yaml
summary:
webdir: /path/to/summary
If you're running on LIGO compute resources and your username doesn't match your albert.einstein username, you'll also additionally need to specify the accounting group user for condor to track accounting information:
.. code-block:: yaml
condor:
accounting-group-user: albert.einstein
In addition, update the ``singularity-image`` the ``condor`` section of your configuration if needed:
.. code-block:: yaml
condor:
singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
If not using the reference Singularity image, you can replace this line with the full path to a local container.
For more detailed configuration options, take a look at the `configuration section <analysis-confiuration>` below.
Once you have the configuration and data products needed, you can set up the Makefile using the configuration,
which we'll then use for everything else, including the data file needed for the workflow, the workflow itself,
the summary page, etc.
.. code:: bash
$ gstlal_inspiral_workflow init -c config.yml
By default, this will generate the full workflow. If you want to only run the filtering step, a rerank, or an injection-only
workflow, you can instead specify the workflow as well, e.g.
.. code:: bash
$ gstlal_inspiral_workflow init -c config.yml -w injection
for an injection-only workflow.
If you already have a Makefile and need to update it based on an updated configuration, run ``gstlal_inspiral_workflow`` with ``--force``.
Next, set up your proxy to ensure you can get access to LIGO data:
.. code:: bash
$ ligo-proxy-init -p albert.einstein
$ make x509_proxy
Note that we are running this step outside of Singularity. This is because ``ligo-proxy-init``
is not installed within the image currently.
If you haven't installed site-specific profiles yet, you can run:
.. code:: bash
$ singularity exec <image> gstlal_grid_profile install
which will install configurations that are site-specific, i.e. ``ldas`` and ``ics``.
You can select which profile to use in the ``condor`` section:
.. code-block:: yaml
condor:
profile: ldas
To view which profiles are available, you can run:
.. code:: bash
$ singularity exec <image> gstlal_grid_profile list
Note, you can install `custom-profiles <install-custom-profiles>` as well.
Finally, set up the rest of the workflow including the DAG for submission:
.. code:: bash
$ singularity exec -B $TMPDIR <image> make dag
This should create condor DAGs for the workflow. Mounting a temporary directory
is important as some of the steps will leverage a temporary space to generate files.
3. Launch workflows
"""""""""""""""""""""""""
.. code:: bash
$ make launch
You can monitor the dag with Condor CLI tools such as ``condor_q``.
4. Generate Summary Page
"""""""""""""""""""""""""
After the DAG has completed, you can generate the summary page for the analysis:
.. code:: bash
$ make summary
.. _analysis-configuration:
Configuration
^^^^^^^^^^^^^^
The top-level configuration consists of the analysis times and detector configuration:
.. code-block:: yaml
start: 1187000000
stop: 1187100000
instruments: H1L1
min-instruments: 1
These set the start and stop times of the analysis, plus the detectors to use
(H1=Hanford, L1=Livingston, V1=Virgo). The start and stop times are gps times,
there is a nice online converter that can be used here:
https://www.gw-openscience.org/gps/. You can also use the program `gpstime` as
well. Note that these start and stop times have no knowledge about science
quality data, the actual science quality data that are analyzed is typically a
subset of the total time.
``min-instruments`` sets the minimum number of instruments we will allow to form
an event, e.g. setting it to 1 means the analysis will consider single detector
events, 2 means we will only consider events that are coincident across at least
2 detectors.
Section: Data
""""""""""""""
.. code-block:: yaml
data:
template-bank: bank/gstlal_bank_small.xml.gz
analysis-dir: /path/to/analysis/dir
The ``template-bank`` option points to the template bank file. These
are xml files that follow the LIGOLW (LIGO light weight) schema. The template
bank in particular contains a table that lists the parameters of all of the
templates, it does not contain the actual waveforms themselves. Metadata such as
the waveform approximant and the frequency cutoffs are also listed in this file.
The ``analysis-dir`` option is used if the user wishes to point to an existing
analysis to perform a rerank or an injection-only workflow. This grabs existing files
from this directory to seed the rerank/injection workflows.
Section: Source
""""""""""""""""
.. code-block:: yaml
source:
data-find-server: datafind.gw-openscience.org
frame-type:
H1: H1_GWOSC_O2_16KHZ_R1
L1: L1_GWOSC_O2_16KHZ_R1
channel-name:
H1: GWOSC-16KHZ_R1_STRAIN
L1: GWOSC-16KHZ_R1_STRAIN
frame-segments-file: segments.xml.gz
frame-segments-name: datasegments
The ``data-find-server`` option points to a server that is queried to find the
location of frame files. The address shown above is a publicly available server
that will return the locations of public frame files on cvmfs. Each frame file
has a type that describes the contents of the frame file, and may contain
multiple channels of data, hence the channel names must also be specified.
``frame-segments-file`` points to a LIGOLW xml file that describes the actual
times to analyze, i.e. it lists the time that science quality data are
available. These files are generalized enough that they could describe different
types of data, so ``frame-segments-name`` is used to specify which segment to
consider. In practice, the segments file we produce will only contain the
segments we want. Users will typically not change any of these options once they
are set for a given instrument and observing run.
Section: PSD
""""""""""""""
.. code-block:: yaml
psd:
fft-length: 8
The PSD estimation method used by GstLAL is a modified median-Welch method that
is described in detail in Section IIB of Ref [1]. The FFT length sets the length
of each section that is Fourier transformed. The default whitener will use
zero-padding of one-fourth the FFT length on either side and will overlap
fourier transformed segments by one-fourth the FFT length. For example, an
``fft-length`` of 8 means that each Fourier transformed segment used in the PSD
estimation (and consequently the whitener) will contain 4 seconds of data with 2
seconds of zero padding on either side, and will overlap the next segment by 2
seconds (i.e. the last two seconds of data in one segment will be the first two
seconds of data in the following window).
Section: SVD
""""""""""""""
.. code-block:: yaml
svd:
f-low: 20.0
num-chi-bins: 1
approximant:
- 0:1000:TaylorF2
tolerance: 0.9999
max-f-final: 512.0
sample-rate: 1024
num-split-templates: 200
overlap: 30
num-banks: 5
samples-min: 2048
samples-max-64: 2048
samples-max-256: 2048
samples-max: 4096
autocorrelation-length: 351
manifest: svd_manifest.json
``f-low`` sets the lower frequency cutoff for the analysis in Hz.
``num-chi-bins`` is a tunable parameter related to the template bank binning
procedure; specifically, sets the number of effective spin parameter bins to use
in the chirp-mass / effective spin binning procedure described in Sec. IID and
Fig. 6 of [1].
``approximant`` specifies the waveform approximant that should be used along
with chirp mass bounds to use that approximant in. 0:1000:TaylorF2 means use the
TaylorF2 approximant for waveforms from systems with chirp-masses between 0 and
1000 solar masses. Multiple waveforms and chirp-mass bounds can be provided.
``tolerance`` is a tunable parameter related to the truncation of SVD basis
vectors. A tolerance of 0.9999 means the targeted matched-filter inner-product
of the original waveform and the waveform reconstructed from the SVD is 0.9999.
``max-f-final`` sets the max frequency of the template.
``num-split-templates``, ``overlap``, ``num-banks``, are tunable parameters
related to the SVD process. ``num-split-templates`` sets the number of templates
to decompose at a time; ``overlap`` sets the number of templates from adjacent
template bank regions to pad to the region being considered in order to actually
compute the SVD (this helps the performance of the SVD, and these pad templates
are not reconstructed); ``num-banks`` sets the number of sets of decomposed
templates to include in a given bin for the analysis. For example,
``num-split-templates`` of 200, ``overlap`` of 30, and ``num-banks`` of 5 means
that each SVD bank file will contain 5 decomposed sets of 200 templates, where
the SVD was computed using an additional 15 templates on either side of the 200
(as defined by the binning procedure).
``samples-min``, ``samples-max-64``, ``samples-max-256``, and ``samples-max``
are tunable parameters related to the template time slicing procedure used by
GstLAL (described in Sec. IID and Fig. 7 of Ref. [1], and references therein).
Templates are slice in time before the SVD is applied, and only sampled at the
rate necessary for the highest frequency in each time slice (rounded up to a
power of 2). For example, the low frequency part of a waveform may only be
sampled at 32 Hz, while the high frequency part may be sampled at 2048 Hz
(depending on user settings). ``samples-min`` sets the minimum number of samples
to use in any time slice. ``samples-max`` sets the maximum number of samples to
use in any time slice with a sample rate below 64 Hz; ``samples-max-64`` sets
the maximum number of samples to use in any time slice with sample rates between
64 Hz and 256 Hz; ``samples-max-256`` sets the maximum number of samples to use
in any time slice with a sample rate greater than 256 Hz.
``autocorrelation-length`` sets the number of samples to use when computing the
autocorrelation-based test-statistic, described in IIIC of Ref [1].
``manifest`` sets the name of a file that will contain metadata about the
template bank bins.
Users will not typically change these options.
Section: Filter
""""""""""""""""
.. code-block:: yaml
filter:
fir-stride: 1
coincidence-threshold: 0.01
ht-gate-threshold: 0.8:15.0-45.0:100.0
veto-segments-file: vetoes.xml.gz
time-slide-file: tisi.xml
injection-time-slide-file: inj_tisi.xml
injections:
bns:
file: injections/bns_injections.xml
range: 0.01:1000.0
``fir-stride`` is a tunable parameter related to the matched-filter procedure,
setting the length in seconds of the output of the matched-filter element.
``coincidence-threshold`` is the time in seconds to add to the light-travel time
when searching for coincidences between detectors.
``ht-gate-threshold`` sets the h(t) gate threshold as a function of chirp-mass.
The h(t) gate threshold is a value over which the output of the whitener plus
some padding will be set to zero (as described in IIC of Ref. [1]).
0.8:15.0-45.0:100.0 mean that a template bank bin that that has a max chirp-mass
template of 0.8 solar masses will use a gate threshold of 15, a bank bin with a
max chirp-mass of 100 will use a threshold of 45, and all other thresholds are
described by a linear function between those two points.
``veto-segments-file`` sets the name of a LIGOLW xml file that contains any
vetoes used for the analysis, even if there are no vetoes.
``time-slide-file`` and ``inj-time-slide-file`` are LIGOLW xml files that
describe any time slides used in the analysis. A typical analysis will only
analyze injections with the zerolag “time slide” (i.e. the data are not slid in
time), and will consider the zerolag and one other time slide for the
non-injection analysis. The time slide is used to perform a blind sanity check
of the noise model.
injections will list a set of injections, each with their own label. In this
example, there is only one injection set, and it is labeled “bns”. file is a
relative path to the injection file (a LIGOLW xml file that contains the
parameters of the injections, but not the actual waveforms themselves). range
sets the chirp-mass range that should be considered when searching for this
particular set of injections. Multiple injection files can be provided, each
with their own label, file, and range.
The only option here that a user will normally interact with is the injections
option.
Section: Injections
""""""""""""""""""""
.. code-block:: yaml
injections:
expected-snr:
f-low: 15.0
sets:
bns:
f-low: 14.0
seed: 72338
time:
step: 32
interval: 1
shift: 0
waveform: SpinTaylorT4threePointFivePN
mass-distr: componentMass
mass1:
min: 1.1
max: 2.8
mass2:
min: 1.1
max: 2.8
spin1:
min: 0
max: 0.05
spin2:
min: 0
max: 0.05
distance:
min: 10000
max: 80000
file: bns_injections.xml
The ``sets`` subsection is used to create injection sets to be used within the
analysis, and referenced to by name in the ``filter`` section. In ``sets``, the
injections are grouped by key. In this case, one ``bns`` injection set which
creates the ``bns_injections.xml`` file and used in the ``injections`` section
of the ``filter`` section.
Besides creating injection sets, the ``expected-snr`` subsection is used for the
expected SNR jobs. These settings are used to override defaults as needed.
In the case of multiple injection sets that need to be combined, one can add
a few options to create a combined file and reference that within the filter
jobs. This can be useful for large banks with a large set of templates. To
do this, one can add the following:
.. code-block:: yaml
injections:
combine: true
combined-file: combined_injections.xml
The injections created are generated from the ``lalapps_inspinj`` program, with
the following mapping between configuration and command line options:
* ``f-low``: ``--f-lower``
* ``seed``: ``--seed``
* ``time`` section: ``-time-step``, ``--time-interval``. ``shift`` adjusts the
start time appropriately.
* ``waveform``: ``--waveform``
* ``mass-distr``: ``--m-distr``
* ``mass/spin/distance`` sections: maps to options like ``--min-mass1``
Section: Prior
""""""""""""""""
.. code-block:: yaml
prior:
mass-model: model/mass_model_small.h5
``mass-model`` is a relative path to the file that contains the mass model. This
model is used to weight templates appropriately when assigning ranking
statistics based on our understanding of the astrophysical distribution of
signals. Users will not typically change this option.
Section: Rank
""""""""""""""""
.. code-block:: yaml
rank:
ranking-stat-samples: 4194304
``ranking-stat-samples`` sets the number of samples to draw from the noise model
when computing the distribution of log likelihood-ratios (the ranking statistic)
under the noise hypothesis. Users will not typically change this option.
Section: Summary
""""""""""""""""""
.. code-block:: yaml
summary:
webdir: /path/to/public_html/folder
``webdir`` sets the path of the output results webpages produced by the
analysis. Users will typically change this option for each analysis.
Section: Condor
""""""""""""""""""
.. code-block:: yaml
condor:
profile: osg-public
accounting-group: ligo.dev.o3.cbc.uber.gstlaloffline
singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
``profile`` sets a base level of configuration options for condor.
``accounting-group`` sets accounting group details on LDG resources. Currently
the machinery to produce an analysis dag requires this option, but the option is
not actually used by analyses running on non-LDG resources.
``singularity-image`` sets the path of the container on cvmfs that the analysis
should use. Users will not typically change this option.
.. _install-custom-profiles:
Installing Custom Site Profiles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can define a site profile as YAML. As an example, we can create a file called ``custom.yml``:
.. code-block:: yaml
scheduler: condor
requirements:
- "(IS_GLIDEIN=?=True)"
Both the directives and requirements sections are optional.
To install one so it's available for use, run:
.. code:: bash
$ singularity exec <image> gstlal_grid_profile install custom.yml
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment