diff --git a/doc/source/cbc_analysis.rst b/doc/source/cbc_analysis.rst index 0396d5c63f10871d67f0ce58193574e2cb4be45b..9b0996355ded8974f03639140b63c96248351be5 100644 --- a/doc/source/cbc_analysis.rst +++ b/doc/source/cbc_analysis.rst @@ -1,6 +1,545 @@ .. _cbc-analysis: -CBC Analysis -================ +CBC Analysis (Offline) +======================== -WRITEME +To start an offline CBC analysis, you'll need a configuration file +to point at the start/end times to analyze, input data products +(e.g. template bank, mass model) and other workflow-related configuration needed. + +All the below steps assume a Singularity container with the GstLAL software +stack installed. Other methods of installation will follow a similar +procedure, however, with one caveat that workflows will not work on the +Open Science Grid (OSG). + +Running Workflows +^^^^^^^^^^^^^^^^^^ + +1. Build Singularity image (optional) +"""""""""""""""""""""""""""""""""""""" + +NOTE: If you are using a reference Singularity container (suitable in most cases), you can skip this step. +The ``<image>`` throughout this doc refers to ``singularity-image`` specified in the ``condor`` section of your configuration. + +If not using the reference Singularity container, say for local development, you can specify a path +to a local container and use that for the workflow (non-OSG). + +To pull a container with gstlal installed, run: + +.. code:: bash + + $ singularity build --sandbox --fix-perms <image-name> docker://containers.ligo.org/lscsoft/gstlal:master + +2. Set up workflow +"""""""""""""""""""" + +First, we create a new analysis directory and switch to it: + +.. code:: bash + + $ mkdir <analysis-dir> + $ cd <analysis-dir> + +Default configuration files and data files (template bank/mass model) for a variety of different banks +are contained in the `offline-configuration <https://git.ligo.org/gstlal/offline-configuration>`_ repository. + +For example, to grab the configuration and data files for the BNS test bank: + +.. code:: bash + + $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/config.yml + $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/mass_model/mass_model_small.h5 + $ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/bns-small/bank/gstlal_bank_small.xml.gz + +Alternatively, one can clone the repository and copy files as needed into the analysis directory. + +Now, we'll need to modify the configuration as needed to run the analysis. At the very least, setting the start/end times and the instruments to run over: + +.. code-block:: yaml + + start: 1187000000 + stop: 1187100000 + + instruments: H1L1 + +We also required template bank(s) and a mass model. Ensure these are pointed to the right place in the configuration: + +.. code-block:: yaml + + data: + template-bank: gstlal_bank_small.xml.gz + +.. code-block:: yaml + + prior: + mass-model: mass_model_small.h5 + +If you're creating a summary page for results, you'll need to point at a location where they are web-viewable: + +.. code-block:: yaml + + summary: + webdir: /path/to/summary + +If you're running on LIGO compute resources and your username doesn't match your albert.einstein username, you'll also additionally need to specify the accounting group user for condor to track accounting information: + +.. code-block:: yaml + + condor: + accounting-group-user: albert.einstein + +In addition, update the ``singularity-image`` the ``condor`` section of your configuration if needed: + +.. code-block:: yaml + + condor: + singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master + +If not using the reference Singularity image, you can replace this line with the full path to a local container. + +For more detailed configuration options, take a look at the `configuration section <analysis-confiuration>` below. + +Once you have the configuration and data products needed, you can set up the Makefile using the configuration, +which we'll then use for everything else, including the data file needed for the workflow, the workflow itself, +the summary page, etc. + +.. code:: bash + + $ gstlal_inspiral_workflow init -c config.yml + +By default, this will generate the full workflow. If you want to only run the filtering step, a rerank, or an injection-only +workflow, you can instead specify the workflow as well, e.g. + +.. code:: bash + + $ gstlal_inspiral_workflow init -c config.yml -w injection + +for an injection-only workflow. + +If you already have a Makefile and need to update it based on an updated configuration, run ``gstlal_inspiral_workflow`` with ``--force``. + +Next, set up your proxy to ensure you can get access to LIGO data: + +.. code:: bash + + $ ligo-proxy-init -p albert.einstein + $ make x509_proxy + +Note that we are running this step outside of Singularity. This is because ``ligo-proxy-init`` +is not installed within the image currently. + +If you haven't installed site-specific profiles yet, you can run: + +.. code:: bash + + $ singularity exec <image> gstlal_grid_profile install + +which will install configurations that are site-specific, i.e. ``ldas`` and ``ics``. +You can select which profile to use in the ``condor`` section: + +.. code-block:: yaml + + condor: + profile: ldas + +To view which profiles are available, you can run: + +.. code:: bash + + $ singularity exec <image> gstlal_grid_profile list + + +Note, you can install `custom-profiles <install-custom-profiles>` as well. + +Finally, set up the rest of the workflow including the DAG for submission: + +.. code:: bash + + $ singularity exec -B $TMPDIR <image> make dag + + +This should create condor DAGs for the workflow. Mounting a temporary directory +is important as some of the steps will leverage a temporary space to generate files. + +3. Launch workflows +""""""""""""""""""""""""" + +.. code:: bash + + $ make launch + +You can monitor the dag with Condor CLI tools such as ``condor_q``. + +4. Generate Summary Page +""""""""""""""""""""""""" + +After the DAG has completed, you can generate the summary page for the analysis: + +.. code:: bash + + $ make summary + + +.. _analysis-configuration: + +Configuration +^^^^^^^^^^^^^^ + +The top-level configuration consists of the analysis times and detector configuration: + +.. code-block:: yaml + + start: 1187000000 + stop: 1187100000 + + instruments: H1L1 + min-instruments: 1 + +These set the start and stop times of the analysis, plus the detectors to use +(H1=Hanford, L1=Livingston, V1=Virgo). The start and stop times are gps times, +there is a nice online converter that can be used here: +https://www.gw-openscience.org/gps/. You can also use the program `gpstime` as +well. Note that these start and stop times have no knowledge about science +quality data, the actual science quality data that are analyzed is typically a +subset of the total time. + +``min-instruments`` sets the minimum number of instruments we will allow to form +an event, e.g. setting it to 1 means the analysis will consider single detector +events, 2 means we will only consider events that are coincident across at least +2 detectors. + +Section: Data +"""""""""""""" + +.. code-block:: yaml + + data: + template-bank: bank/gstlal_bank_small.xml.gz + analysis-dir: /path/to/analysis/dir + +The ``template-bank`` option points to the template bank file. These +are xml files that follow the LIGOLW (LIGO light weight) schema. The template +bank in particular contains a table that lists the parameters of all of the +templates, it does not contain the actual waveforms themselves. Metadata such as +the waveform approximant and the frequency cutoffs are also listed in this file. + +The ``analysis-dir`` option is used if the user wishes to point to an existing +analysis to perform a rerank or an injection-only workflow. This grabs existing files +from this directory to seed the rerank/injection workflows. + +Section: Source +"""""""""""""""" + +.. code-block:: yaml + + source: + data-find-server: datafind.gw-openscience.org + frame-type: + H1: H1_GWOSC_O2_16KHZ_R1 + L1: L1_GWOSC_O2_16KHZ_R1 + channel-name: + H1: GWOSC-16KHZ_R1_STRAIN + L1: GWOSC-16KHZ_R1_STRAIN + frame-segments-file: segments.xml.gz + frame-segments-name: datasegments + +The ``data-find-server`` option points to a server that is queried to find the +location of frame files. The address shown above is a publicly available server +that will return the locations of public frame files on cvmfs. Each frame file +has a type that describes the contents of the frame file, and may contain +multiple channels of data, hence the channel names must also be specified. +``frame-segments-file`` points to a LIGOLW xml file that describes the actual +times to analyze, i.e. it lists the time that science quality data are +available. These files are generalized enough that they could describe different +types of data, so ``frame-segments-name`` is used to specify which segment to +consider. In practice, the segments file we produce will only contain the +segments we want. Users will typically not change any of these options once they +are set for a given instrument and observing run. + +Section: PSD +"""""""""""""" + +.. code-block:: yaml + + psd: + fft-length: 8 + +The PSD estimation method used by GstLAL is a modified median-Welch method that +is described in detail in Section IIB of Ref [1]. The FFT length sets the length +of each section that is Fourier transformed. The default whitener will use +zero-padding of one-fourth the FFT length on either side and will overlap +fourier transformed segments by one-fourth the FFT length. For example, an +``fft-length`` of 8 means that each Fourier transformed segment used in the PSD +estimation (and consequently the whitener) will contain 4 seconds of data with 2 +seconds of zero padding on either side, and will overlap the next segment by 2 +seconds (i.e. the last two seconds of data in one segment will be the first two +seconds of data in the following window). + +Section: SVD +"""""""""""""" + +.. code-block:: yaml + + svd: + f-low: 20.0 + num-chi-bins: 1 + approximant: + - 0:1000:TaylorF2 + tolerance: 0.9999 + max-f-final: 512.0 + sample-rate: 1024 + num-split-templates: 200 + overlap: 30 + num-banks: 5 + samples-min: 2048 + samples-max-64: 2048 + samples-max-256: 2048 + samples-max: 4096 + autocorrelation-length: 351 + manifest: svd_manifest.json + +``f-low`` sets the lower frequency cutoff for the analysis in Hz. + +``num-chi-bins`` is a tunable parameter related to the template bank binning +procedure; specifically, sets the number of effective spin parameter bins to use +in the chirp-mass / effective spin binning procedure described in Sec. IID and +Fig. 6 of [1]. + +``approximant`` specifies the waveform approximant that should be used along +with chirp mass bounds to use that approximant in. 0:1000:TaylorF2 means use the +TaylorF2 approximant for waveforms from systems with chirp-masses between 0 and +1000 solar masses. Multiple waveforms and chirp-mass bounds can be provided. + +``tolerance`` is a tunable parameter related to the truncation of SVD basis +vectors. A tolerance of 0.9999 means the targeted matched-filter inner-product +of the original waveform and the waveform reconstructed from the SVD is 0.9999. + +``max-f-final`` sets the max frequency of the template. + +``num-split-templates``, ``overlap``, ``num-banks``, are tunable parameters +related to the SVD process. ``num-split-templates`` sets the number of templates +to decompose at a time; ``overlap`` sets the number of templates from adjacent +template bank regions to pad to the region being considered in order to actually +compute the SVD (this helps the performance of the SVD, and these pad templates +are not reconstructed); ``num-banks`` sets the number of sets of decomposed +templates to include in a given bin for the analysis. For example, +``num-split-templates`` of 200, ``overlap`` of 30, and ``num-banks`` of 5 means +that each SVD bank file will contain 5 decomposed sets of 200 templates, where +the SVD was computed using an additional 15 templates on either side of the 200 +(as defined by the binning procedure). + +``samples-min``, ``samples-max-64``, ``samples-max-256``, and ``samples-max`` +are tunable parameters related to the template time slicing procedure used by +GstLAL (described in Sec. IID and Fig. 7 of Ref. [1], and references therein). +Templates are slice in time before the SVD is applied, and only sampled at the +rate necessary for the highest frequency in each time slice (rounded up to a +power of 2). For example, the low frequency part of a waveform may only be +sampled at 32 Hz, while the high frequency part may be sampled at 2048 Hz +(depending on user settings). ``samples-min`` sets the minimum number of samples +to use in any time slice. ``samples-max`` sets the maximum number of samples to +use in any time slice with a sample rate below 64 Hz; ``samples-max-64`` sets +the maximum number of samples to use in any time slice with sample rates between +64 Hz and 256 Hz; ``samples-max-256`` sets the maximum number of samples to use +in any time slice with a sample rate greater than 256 Hz. + +``autocorrelation-length`` sets the number of samples to use when computing the +autocorrelation-based test-statistic, described in IIIC of Ref [1]. + +``manifest`` sets the name of a file that will contain metadata about the +template bank bins. + +Users will not typically change these options. + +Section: Filter +"""""""""""""""" + +.. code-block:: yaml + + filter: + fir-stride: 1 + coincidence-threshold: 0.01 + ht-gate-threshold: 0.8:15.0-45.0:100.0 + veto-segments-file: vetoes.xml.gz + time-slide-file: tisi.xml + injection-time-slide-file: inj_tisi.xml + injections: + bns: + file: injections/bns_injections.xml + range: 0.01:1000.0 + +``fir-stride`` is a tunable parameter related to the matched-filter procedure, +setting the length in seconds of the output of the matched-filter element. + +``coincidence-threshold`` is the time in seconds to add to the light-travel time +when searching for coincidences between detectors. + +``ht-gate-threshold`` sets the h(t) gate threshold as a function of chirp-mass. +The h(t) gate threshold is a value over which the output of the whitener plus +some padding will be set to zero (as described in IIC of Ref. [1]). +0.8:15.0-45.0:100.0 mean that a template bank bin that that has a max chirp-mass +template of 0.8 solar masses will use a gate threshold of 15, a bank bin with a +max chirp-mass of 100 will use a threshold of 45, and all other thresholds are +described by a linear function between those two points. + +``veto-segments-file`` sets the name of a LIGOLW xml file that contains any +vetoes used for the analysis, even if there are no vetoes. + +``time-slide-file`` and ``inj-time-slide-file`` are LIGOLW xml files that +describe any time slides used in the analysis. A typical analysis will only +analyze injections with the zerolag “time slide†(i.e. the data are not slid in +time), and will consider the zerolag and one other time slide for the +non-injection analysis. The time slide is used to perform a blind sanity check +of the noise model. + +injections will list a set of injections, each with their own label. In this +example, there is only one injection set, and it is labeled “bnsâ€. file is a +relative path to the injection file (a LIGOLW xml file that contains the +parameters of the injections, but not the actual waveforms themselves). range +sets the chirp-mass range that should be considered when searching for this +particular set of injections. Multiple injection files can be provided, each +with their own label, file, and range. + +The only option here that a user will normally interact with is the injections +option. + +Section: Injections +"""""""""""""""""""" + +.. code-block:: yaml + + injections: + expected-snr: + f-low: 15.0 + sets: + bns: + f-low: 14.0 + seed: 72338 + time: + step: 32 + interval: 1 + shift: 0 + waveform: SpinTaylorT4threePointFivePN + mass-distr: componentMass + mass1: + min: 1.1 + max: 2.8 + mass2: + min: 1.1 + max: 2.8 + spin1: + min: 0 + max: 0.05 + spin2: + min: 0 + max: 0.05 + distance: + min: 10000 + max: 80000 + file: bns_injections.xml + +The ``sets`` subsection is used to create injection sets to be used within the +analysis, and referenced to by name in the ``filter`` section. In ``sets``, the +injections are grouped by key. In this case, one ``bns`` injection set which +creates the ``bns_injections.xml`` file and used in the ``injections`` section +of the ``filter`` section. + +Besides creating injection sets, the ``expected-snr`` subsection is used for the +expected SNR jobs. These settings are used to override defaults as needed. + +In the case of multiple injection sets that need to be combined, one can add +a few options to create a combined file and reference that within the filter +jobs. This can be useful for large banks with a large set of templates. To +do this, one can add the following: + +.. code-block:: yaml + + injections: + combine: true + combined-file: combined_injections.xml + +The injections created are generated from the ``lalapps_inspinj`` program, with +the following mapping between configuration and command line options: + +* ``f-low``: ``--f-lower`` +* ``seed``: ``--seed`` +* ``time`` section: ``-time-step``, ``--time-interval``. ``shift`` adjusts the + start time appropriately. +* ``waveform``: ``--waveform`` +* ``mass-distr``: ``--m-distr`` +* ``mass/spin/distance`` sections: maps to options like ``--min-mass1`` + +Section: Prior +"""""""""""""""" + +.. code-block:: yaml + + prior: + mass-model: model/mass_model_small.h5 + +``mass-model`` is a relative path to the file that contains the mass model. This +model is used to weight templates appropriately when assigning ranking +statistics based on our understanding of the astrophysical distribution of +signals. Users will not typically change this option. + +Section: Rank +"""""""""""""""" + +.. code-block:: yaml + + rank: + ranking-stat-samples: 4194304 + +``ranking-stat-samples`` sets the number of samples to draw from the noise model +when computing the distribution of log likelihood-ratios (the ranking statistic) +under the noise hypothesis. Users will not typically change this option. + +Section: Summary +"""""""""""""""""" + +.. code-block:: yaml + + summary: + webdir: /path/to/public_html/folder + +``webdir`` sets the path of the output results webpages produced by the +analysis. Users will typically change this option for each analysis. + +Section: Condor +"""""""""""""""""" + +.. code-block:: yaml + + condor: + profile: osg-public + accounting-group: ligo.dev.o3.cbc.uber.gstlaloffline + singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master + +``profile`` sets a base level of configuration options for condor. + +``accounting-group`` sets accounting group details on LDG resources. Currently +the machinery to produce an analysis dag requires this option, but the option is +not actually used by analyses running on non-LDG resources. + +``singularity-image`` sets the path of the container on cvmfs that the analysis +should use. Users will not typically change this option. + +.. _install-custom-profiles: + +Installing Custom Site Profiles +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can define a site profile as YAML. As an example, we can create a file called ``custom.yml``: + +.. code-block:: yaml + + scheduler: condor + requirements: + - "(IS_GLIDEIN=?=True)" + +Both the directives and requirements sections are optional. + +To install one so it's available for use, run: + +.. code:: bash + + $ singularity exec <image> gstlal_grid_profile install custom.yml