add offline workflow documentation

92f0d6b7 · Patrick Godwin · ChiWai Chan · 9187270e · 92f0d6b7
Commit 92f0d6b7 authored 3 years ago by Patrick Godwin Committed by ChiWai Chan 3 years ago
--- a/doc/source/cbc_analysis.rst
+++ b/doc/source/cbc_analysis.rst
 .. _cbc-analysis:

-CBC Analysis
-================
+CBC Analysis (Offline)
+========================

-WRITEME
+To start an offline CBC analysis, you'll need a configuration file
+to point at the start/end times to analyze, input data products
+(e.g. template bank, mass model) and other workflow-related configuration needed.
+
+All the below steps assume a Singularity container with the GstLAL software
+stack installed. Other methods of installation will follow a similar
+procedure, however, with one caveat that workflows will not work on the
+Open Science Grid (OSG).
+
+Running Workflows
+^^^^^^^^^^^^^^^^^^
+
+1. Build Singularity image (optional)
+""""""""""""""""""""""""""""""""""""""
+
+NOTE: If you are using a reference Singularity container (suitable in most cases), you can skip this step.
+The ``<image>`` throughout this doc refers to ``singularity-image`` specified in the ``condor`` section of your configuration.
+
+If not using the reference Singularity container, say for local development, you can specify a path
+to a local container and use that for the workflow (non-OSG).
+
+To pull a container with gstlal installed, run:
+
+.. code:: bash
+
+    $ singularity build --sandbox --fix-perms <image-name> docker://containers.ligo.org/lscsoft/gstlal:master
+
+2. Set up workflow
+""""""""""""""""""""
+
+First, we create a new analysis directory and switch to it:
+
+.. code:: bash
+
+   $ mkdir <analysis-dir>
+   $ cd <analysis-dir>
+
+Default configuration files and data files (template bank/mass model) for a variety of different banks
+are contained in the `offline-configuration <https://git.ligo.org/gstlal/offline-configuration>`_ repository.
+
+For example, to grab the configuration and data files for the BNS test bank:
+
+.. code:: bash
+
+    $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/config.yml
+    $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/mass_model/mass_model_small.h5
+    $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/bank/gstlal_bank_small.xml.gz
+
+Alternatively, one can clone the repository and copy files as needed into the analysis directory.
+
+Now, we'll need to modify the configuration as needed to run the analysis. At the very least, setting the start/end times and the instruments to run over:
+
+.. code-block:: yaml
+
+    start: 1187000000
+    stop: 1187100000
+
+    instruments: H1L1
+
+We also required template bank(s) and a mass model. Ensure these are pointed to the right place in the configuration:
+
+.. code-block:: yaml
+
+    data: 
+      template-bank: gstlal_bank_small.xml.gz
+
+.. code-block:: yaml
+
+    prior:
+      mass-model: mass_model_small.h5
+
+If you're creating a summary page for results, you'll need to point at a location where they are web-viewable:
+
+.. code-block:: yaml
+
+    summary:
+      webdir: /path/to/summary
+
+If you're running on LIGO compute resources and your username doesn't match your albert.einstein username, you'll also additionally need to specify the accounting group user for condor to track accounting information:
+
+.. code-block:: yaml
+
+    condor:
+      accounting-group-user: albert.einstein
+
+In addition, update the ``singularity-image`` the ``condor`` section of your configuration if needed:
+
+.. code-block:: yaml
+
+    condor:
+      singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
+
+If not using the reference Singularity image, you can replace this line with the full path to a local container.
+
+For more detailed configuration options, take a look at the  `configuration section <analysis-confiuration>` below.
+
+Once you have the configuration and data products needed, you can set up the Makefile using the configuration,
+which we'll then use for everything else, including the data file needed for the workflow, the workflow itself,
+the summary page, etc.
+
+.. code:: bash
+
+    $ gstlal_inspiral_workflow init -c config.yml
+
+By default, this will generate the full workflow. If you want to only run the filtering step, a rerank, or an injection-only
+workflow, you can instead specify the workflow as well, e.g.
+
+.. code:: bash
+
+    $ gstlal_inspiral_workflow init -c config.yml -w injection
+
+for an injection-only workflow.
+
+If you already have a Makefile and need to update it based on an updated configuration, run ``gstlal_inspiral_workflow`` with ``--force``.
+
+Next, set up your proxy to ensure you can get access to LIGO data:
+
+.. code:: bash
+
+    $ ligo-proxy-init -p albert.einstein
+    $ make x509_proxy
+
+Note that we are running this step outside of Singularity. This is because ``ligo-proxy-init``
+is not installed within the image currently.
+
+If you haven't installed site-specific profiles yet, you can run:
+
+.. code:: bash
+
+    $ singularity exec <image> gstlal_grid_profile install
+
+which will install configurations that are site-specific, i.e. ``ldas`` and ``ics``.
+You can select which profile to use in the ``condor`` section:
+
+.. code-block:: yaml
+
+    condor:
+      profile: ldas
+
+To view which profiles are available, you can run:
+
+.. code:: bash
+
+    $ singularity exec <image> gstlal_grid_profile list
+
+
+Note, you can install `custom-profiles <install-custom-profiles>` as well.
+
+Finally, set up the rest of the workflow including the DAG for submission:
+
+.. code:: bash
+
+    $ singularity exec -B $TMPDIR <image> make dag
+
+
+This should create condor DAGs for the workflow. Mounting a temporary directory
+is important as some of the steps will leverage a temporary space to generate files.
+
+3. Launch workflows
+"""""""""""""""""""""""""
+
+.. code:: bash
+
+    $ make launch
+
+You can monitor the dag with Condor CLI tools such as ``condor_q``.
+
+4. Generate Summary Page
+"""""""""""""""""""""""""
+
+After the DAG has completed, you can generate the summary page for the analysis:
+
+.. code:: bash
+
+    $ make summary
+
+
+.. _analysis-configuration:
+
+Configuration
+^^^^^^^^^^^^^^
+
+The top-level configuration consists of the analysis times and detector configuration:
+
+.. code-block:: yaml
+
+    start: 1187000000
+    stop: 1187100000
+
+    instruments: H1L1
+    min-instruments: 1
+
+These set the start and stop times of the analysis, plus the detectors to use
+(H1=Hanford, L1=Livingston, V1=Virgo). The start and stop times are gps times,
+there is a nice online converter that can be used here:
+https://www.gw-openscience.org/gps/. You can also use the program `gpstime` as
+well. Note that these start and stop times have no knowledge about science
+quality data, the actual science quality data that are analyzed is typically a
+subset of the total time.
+
+``min-instruments`` sets the minimum number of instruments we will allow to form
+an event, e.g. setting it to 1 means the analysis will consider single detector
+events, 2 means we will only consider events that are coincident across at least
+2 detectors.
+
+Section: Data
+""""""""""""""
+
+.. code-block:: yaml
+
+    data:
+      template-bank: bank/gstlal_bank_small.xml.gz
+      analysis-dir: /path/to/analysis/dir
+
+The ``template-bank`` option points to the template bank file. These
+are xml files that follow the LIGOLW (LIGO light weight) schema. The template
+bank in particular contains a table that lists the parameters of all of the
+templates, it does not contain the actual waveforms themselves. Metadata such as
+the waveform approximant and the frequency cutoffs are also listed in this file.
+
+The ``analysis-dir`` option is used if the user wishes to point to an existing
+analysis to perform a rerank or an injection-only workflow. This grabs existing files
+from this directory to seed the rerank/injection workflows.
+
+Section: Source
+""""""""""""""""
+
+.. code-block:: yaml
+
+    source:
+      data-find-server: datafind.gw-openscience.org
+      frame-type:
+        H1: H1_GWOSC_O2_16KHZ_R1
+        L1: L1_GWOSC_O2_16KHZ_R1
+      channel-name:
+        H1: GWOSC-16KHZ_R1_STRAIN
+        L1: GWOSC-16KHZ_R1_STRAIN
+      frame-segments-file: segments.xml.gz
+      frame-segments-name: datasegments
+
+The ``data-find-server`` option points to a server that is queried to find the
+location of frame files. The address shown above is a publicly available server
+that will return the locations of public frame files on cvmfs. Each frame file
+has a type that describes the contents of the frame file, and may contain
+multiple channels of data, hence the channel names must also be specified.
+``frame-segments-file`` points to a LIGOLW xml file that describes the actual
+times to analyze, i.e. it lists the time that science quality data are
+available. These files are generalized enough that they could describe different
+types of data, so ``frame-segments-name`` is used to specify which segment to
+consider. In practice, the segments file we produce will only contain the
+segments we want. Users will typically not change any of these options once they
+are set for a given instrument and observing run.
+
+Section: PSD
+""""""""""""""
+
+.. code-block:: yaml
+
+    psd:
+      fft-length: 8
+
+The PSD estimation method used by GstLAL is a modified median-Welch method that
+is described in detail in Section IIB of Ref [1]. The FFT length sets the length
+of each section that is Fourier transformed. The default whitener will use
+zero-padding of one-fourth the FFT length on either side and will overlap
+fourier transformed segments by one-fourth the FFT length. For example, an
+``fft-length`` of 8 means that each Fourier transformed segment used in the PSD
+estimation (and consequently the whitener) will contain 4 seconds of data with 2
+seconds of zero padding on either side, and will overlap the next segment by 2
+seconds (i.e. the last two seconds of data in one segment will be the first two
+seconds of data in the following window).
+
+Section: SVD
+""""""""""""""
+
+.. code-block:: yaml
+
+    svd:
+      f-low: 20.0
+      num-chi-bins: 1
+      approximant:
+        - 0:1000:TaylorF2
+      tolerance: 0.9999
+      max-f-final: 512.0
+      sample-rate: 1024
+      num-split-templates: 200
+      overlap: 30
+      num-banks: 5
+      samples-min: 2048
+      samples-max-64: 2048
+      samples-max-256: 2048
+      samples-max: 4096
+      autocorrelation-length: 351
+      manifest: svd_manifest.json
+
+``f-low`` sets the lower frequency cutoff for the analysis in Hz. 
+
+``num-chi-bins`` is a tunable parameter related to the template bank binning
+procedure; specifically, sets the number of effective spin parameter bins to use
+in the chirp-mass / effective spin binning procedure described in Sec. IID and
+Fig. 6 of [1].
+
+``approximant`` specifies the waveform approximant that should be used along
+with chirp mass bounds to use that approximant in. 0:1000:TaylorF2 means use the
+TaylorF2 approximant for waveforms from systems with chirp-masses between 0 and
+1000 solar masses. Multiple waveforms and chirp-mass bounds can be provided.
+
+``tolerance`` is a tunable parameter related to the truncation of SVD basis
+vectors. A tolerance of 0.9999 means the targeted matched-filter inner-product
+of the original waveform and the waveform reconstructed from the SVD is 0.9999.
+
+``max-f-final`` sets the max frequency of the template.
+
+``num-split-templates``, ``overlap``, ``num-banks``, are tunable parameters
+related to the SVD process. ``num-split-templates`` sets the number of templates
+to decompose at a time; ``overlap`` sets the number of templates from adjacent
+template bank regions to pad to the region being considered in order to actually
+compute the SVD (this helps the performance of the SVD, and these pad templates
+are not reconstructed); ``num-banks`` sets the number of sets of decomposed
+templates to include in a given bin for the analysis. For example,
+``num-split-templates`` of 200, ``overlap`` of 30, and ``num-banks`` of 5 means
+that each SVD bank file will contain 5 decomposed sets of 200 templates, where
+the SVD was computed using an additional 15 templates on either side of the 200
+(as defined by the binning procedure). 
+
+``samples-min``, ``samples-max-64``, ``samples-max-256``, and ``samples-max``
+are tunable parameters related to the template time slicing procedure used by
+GstLAL (described in Sec. IID and Fig. 7 of Ref. [1], and references therein).
+Templates are slice in time before the SVD is applied, and only sampled at the
+rate necessary for the highest frequency in each time slice (rounded up to a
+power of 2). For example, the low frequency part of a waveform may only be
+sampled at 32 Hz, while the high frequency part may be sampled at 2048 Hz
+(depending on user settings). ``samples-min`` sets the minimum number of samples
+to use in any time slice. ``samples-max`` sets the maximum number of samples to
+use in any time slice with a sample rate below 64 Hz; ``samples-max-64`` sets
+the maximum number of samples to use in any time slice with sample rates between
+64 Hz and 256 Hz; ``samples-max-256`` sets the maximum number of samples to use
+in any time slice with a sample rate greater than 256 Hz.
+
+``autocorrelation-length`` sets the number of samples to use when computing the
+autocorrelation-based test-statistic, described in IIIC of Ref [1].
+
+``manifest`` sets the name of a file that will contain metadata about the
+template bank bins.
+
+Users will not typically change these options.
+
+Section: Filter
+""""""""""""""""
+
+.. code-block:: yaml
+
+    filter:
+      fir-stride: 1
+      coincidence-threshold: 0.01
+      ht-gate-threshold: 0.8:15.0-45.0:100.0
+      veto-segments-file: vetoes.xml.gz
+      time-slide-file: tisi.xml
+      injection-time-slide-file: inj_tisi.xml
+      injections:
+        bns:
+          file: injections/bns_injections.xml
+          range: 0.01:1000.0
+
+``fir-stride`` is a tunable parameter related to the matched-filter procedure,
+setting the length in seconds of the output of the matched-filter element.
+
+``coincidence-threshold`` is the time in seconds to add to the light-travel time
+when searching for coincidences between detectors.
+
+``ht-gate-threshold`` sets the h(t) gate threshold as a function of chirp-mass.
+The h(t) gate threshold is a value over which the output of the whitener plus
+some padding will be set to zero (as described in IIC of Ref. [1]).
+0.8:15.0-45.0:100.0 mean that a template bank bin that that has a max chirp-mass
+template of 0.8 solar masses will use a gate threshold of 15, a bank bin with a
+max chirp-mass of 100 will use a threshold of 45, and all other thresholds are
+described by a linear function between those two points.
+
+``veto-segments-file`` sets the name of a LIGOLW xml file that contains any
+vetoes used for the analysis, even if there are no vetoes.
+
+``time-slide-file`` and ``inj-time-slide-file`` are LIGOLW xml files that
+describe any time slides used in the analysis. A typical analysis will only
+analyze injections with the zerolag “time slide” (i.e. the data are not slid in
+time), and will consider the zerolag and one other time slide for the
+non-injection analysis. The time slide is used to perform a blind sanity check
+of the noise model.
+
+injections will list a set of injections, each with their own label. In this
+example, there is only one injection set, and it is labeled “bns”. file is a
+relative path to the injection file (a LIGOLW xml file that contains the
+parameters of the injections, but not the actual waveforms themselves). range
+sets the chirp-mass range that should be considered when searching for this
+particular set of injections. Multiple injection files can be provided, each
+with their own label, file, and range. 
+
+The only option here that a user will normally interact with is the injections
+option. 
+
+Section: Injections
+""""""""""""""""""""
+
+.. code-block:: yaml
+
+    injections:
+      expected-snr:
+        f-low: 15.0
+      sets:
+        bns:
+          f-low: 14.0
+          seed: 72338
+          time:
+            step: 32
+            interval: 1
+            shift: 0
+          waveform: SpinTaylorT4threePointFivePN
+          mass-distr: componentMass
+          mass1:
+            min: 1.1
+            max: 2.8
+          mass2:
+            min: 1.1
+            max: 2.8
+          spin1:
+            min: 0
+            max: 0.05
+          spin2:
+            min: 0
+            max: 0.05
+          distance:
+            min: 10000
+            max: 80000
+          file: bns_injections.xml
+
+The ``sets`` subsection is used to create injection sets to be used within the
+analysis, and referenced to by name in the ``filter`` section. In ``sets``, the
+injections are grouped by key. In this case, one ``bns`` injection set which
+creates the ``bns_injections.xml`` file and used in the ``injections`` section
+of the ``filter`` section.
+
+Besides creating injection sets, the ``expected-snr`` subsection is used for the
+expected SNR jobs. These settings are used to override defaults as needed.
+
+In the case of multiple injection sets that need to be combined, one can add
+a few options to create a combined file and reference that within the filter
+jobs. This can be useful for large banks with a large set of templates. To
+do this, one can add the following:
+
+.. code-block:: yaml
+
+    injections:
+      combine: true
+      combined-file: combined_injections.xml
+
+The injections created are generated from the ``lalapps_inspinj`` program, with
+the following mapping between configuration and command line options:
+
+* ``f-low``: ``--f-lower``
+* ``seed``: ``--seed``
+* ``time`` section: ``-time-step``, ``--time-interval``. ``shift`` adjusts the
+  start time appropriately.
+* ``waveform``: ``--waveform``
+* ``mass-distr``: ``--m-distr``
+* ``mass/spin/distance`` sections: maps to options like ``--min-mass1``
+
+Section: Prior
+""""""""""""""""
+
+.. code-block:: yaml
+
+    prior:
+      mass-model: model/mass_model_small.h5
+
+``mass-model`` is a relative path to the file that contains the mass model. This
+model is used to weight templates appropriately when assigning ranking
+statistics based on our understanding of the astrophysical distribution of
+signals. Users will not typically change this option.
+
+Section: Rank
+""""""""""""""""
+
+.. code-block:: yaml
+
+    rank:
+      ranking-stat-samples: 4194304
+
+``ranking-stat-samples`` sets the number of samples to draw from the noise model
+when computing the distribution of log likelihood-ratios (the ranking statistic)
+under the noise hypothesis. Users will not typically change this option.
+
+Section: Summary
+""""""""""""""""""
+
+.. code-block:: yaml
+
+    summary:
+      webdir: /path/to/public_html/folder
+
+``webdir`` sets the path of the output results webpages produced by the
+analysis. Users will typically change this option for each analysis.
+
+Section: Condor
+""""""""""""""""""
+
+.. code-block:: yaml
+
+    condor:
+      profile: osg-public
+      accounting-group: ligo.dev.o3.cbc.uber.gstlaloffline
+      singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
+
+``profile`` sets a base level of configuration options for condor.
+
+``accounting-group`` sets accounting group details on LDG resources. Currently
+the machinery to produce an analysis dag requires this option, but the option is
+not actually used by analyses running on non-LDG resources.
+
+``singularity-image`` sets the path of the container on cvmfs that the analysis
+should use. Users will not typically change this option.
+
+.. _install-custom-profiles:
+
+Installing Custom Site Profiles
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can define a site profile as YAML. As an example, we can create a file called ``custom.yml``:
+
+.. code-block:: yaml
+
+    scheduler: condor
+    requirements:
+      - "(IS_GLIDEIN=?=True)"
+
+Both the directives and requirements sections are optional.
+
+To install one so it's available for use, run:
+
+.. code:: bash
+
+    $ singularity exec <image> gstlal_grid_profile install custom.yml