Skip to content
Snippets Groups Projects
Commit cedd780f authored by Patrick Godwin's avatar Patrick Godwin
Browse files

add tutorials and overview for feature extraction docs

parent d3b1e609
No related branches found
No related tags found
No related merge requests found
......@@ -7,5 +7,161 @@ Overview
Feature Extraction
====================================================================================================
The `fxtools` module contains relevant libraries to identify glitches in low-latency using auxiliary
channel data.
The `fxtools` module and related feature-based executables contain relevant libraries to identify
glitches in low-latency using auxiliary channel data.
`gstlal_feature_extractor` functions as a modeled search for data quality by applying matched filtering
on auxiliary channel timeseries using waveforms that model a large number of glitch classes. Its primary
purpose is to whiten incoming auxiliary channels and extract relevant features in low-latency.
There are two different modes of output `gstlal_feature_extractor` can function in:
1. **Timeseries:** Production of regularly-spaced feature rows, containing the SNR, waveform parameters,
and the time of the loudest event in a sampling time interval.
2. **ETG:** This produces output that resembles that of a traditional event trigger generator (ETG), in
which only feature rows above an SNR threshold will be produced.
One useful feature in using a matched filter approach to detect glitches is the ability to switch between
different glitch templates or generate a heterogeneous bank of templates.. Currently, there are Sine-Gaussian
and half-Sine-Gaussian waveforms implemented for use in detecting glitches, but the feature extractor was
designed to be fairly modular and so it isn't difficult to design and add new waveforms for use.
Since the GstLAL feature extractor uses time-domain convolution to matched filter auxiliary channel timeseries
with glitch waveforms, this allows latencies to be much lower than in traditional ETGs. The latency upon writing
features to disk are O(5 s) in the current layout when using waveforms where the peak occurs at the edge of the
template (zero-latency templates). Otherwise, there is extra latency incurred due to the non-causal nature of
the waveform itself.
.. graphviz::
digraph llpipe {
labeljust = "r";
label="gstlal_feature_extractor"
rankdir=LR;
graph [fontname="Roman", fontsize=24];
edge [ fontname="Roman", fontsize=10 ];
node [fontname="Roman", shape=box, fontsize=11];
subgraph clusterNodeN {
style=rounded;
label="gstreamer pipeline";
labeljust = "r";
fontsize = 14;
H1L1src [label="H1(L1) data source:\n mkbasicmultisrc()", color=red4];
Aux1 [label="Auxiliary channel 1", color=red4];
Aux2 [label="Auxiliary channel 2", color=green4];
AuxN [label="Auxiliary channel N", color=magenta4];
Multirate1 [label="Auxiliary channel 1\nWhiten/Downsample", color=red4];
Multirate2 [label="Auxiliary channel 2\nWhiten/Downsample", color=green4];
MultirateN [label="Auxiliary channel N\nWhiten/Downsample", color=magenta4];
FilterBankAux1Rate1 [label="Auxiliary Channel 1:\nGlitch Filter Bank", color=red4];
FilterBankAux1Rate2 [label="Auxiliary Channel 1:\nGlitch Filter Bank", color=red4];
FilterBankAux1RateN [label="Auxiliary Channel 1:\nGlitch Filter Bank", color=red4];
FilterBankAux2Rate1 [label="Auxiliary Channel 2:\nGlitch Filter Bank", color=green4];
FilterBankAux2Rate2 [label="Auxiliary Channel 2:\nGlitch Filter Bank", color=green4];
FilterBankAux2RateN [label="Auxiliary Channel 2:\nGlitch Filter Bank", color=green4];
FilterBankAuxNRate1 [label="Auxiliary Channel N:\nGlitch Filter Bank", color=magenta4];
FilterBankAuxNRate2 [label="Auxiliary Channel N:\nGlitch Filter Bank", color=magenta4];
FilterBankAuxNRateN [label="Auxiliary Channel N:\nGlitch Filter Bank", color=magenta4];
TriggerAux1Rate1 [label="Auxiliary Channel 1:\nMax SNR Feature (N Hz)", color=red4];
TriggerAux1Rate2 [label="Auxiliary Channel 1:\nMax SNR Feature (N Hz)", color=red4];
TriggerAux1RateN [label="Auxiliary Channel 1:\nMax SNR Feature (N Hz)", color=red4];
TriggerAux2Rate1 [label="Auxiliary Channel 2:\nMax SNR Feature (N Hz)", color=green4];
TriggerAux2Rate2 [label="Auxiliary Channel 2:\nMax SNR Feature (N Hz)", color=green4];
TriggerAux2RateN [label="Auxiliary Channel 2:\nMax SNR Feature (N Hz)", color=green4];
TriggerAuxNRate1 [label="Auxiliary Channel N:\nMax SNR Feature (N Hz)", color=magenta4];
TriggerAuxNRate2 [label="Auxiliary Channel N:\nMax SNR Feature (N Hz)", color=magenta4];
TriggerAuxNRateN [label="Auxiliary Channel N:\nMax SNR Feature (N Hz)", color=magenta4];
H1L1src -> Aux1;
H1L1src -> Aux2;
H1L1src -> AuxN;
Aux1 -> Multirate1;
Aux2 -> Multirate2;
AuxN -> MultirateN;
Multirate1 -> FilterBankAux1Rate1 [label="4096Hz"];
Multirate2 -> FilterBankAux2Rate1 [label="4096Hz"];
MultirateN -> FilterBankAuxNRate1 [label="4096Hz"];
Multirate1 -> FilterBankAux1Rate2 [label="2048Hz"];
Multirate2 -> FilterBankAux2Rate2 [label="2048Hz"];
MultirateN -> FilterBankAuxNRate2 [label="2048Hz"];
Multirate1 -> FilterBankAux1RateN [label="Nth-pow-of-2 Hz"];
Multirate2 -> FilterBankAux2RateN [label="Nth-pow-of-2 Hz"];
MultirateN -> FilterBankAuxNRateN [label="Nth-pow-of-2 Hz"];
FilterBankAux1Rate1 -> TriggerAux1Rate1;
FilterBankAux1Rate2 -> TriggerAux1Rate2;
FilterBankAux1RateN -> TriggerAux1RateN;
FilterBankAux2Rate1 -> TriggerAux2Rate1;
FilterBankAux2Rate2 -> TriggerAux2Rate2;
FilterBankAux2RateN -> TriggerAux2RateN;
FilterBankAuxNRate1 -> TriggerAuxNRate1;
FilterBankAuxNRate2 -> TriggerAuxNRate2;
FilterBankAuxNRateN -> TriggerAuxNRateN;
}
Synchronize [label="Synchronize buffers by timestamp"];
Extract [label="Extract features from buffer"];
Save [label="Save triggers to disk"];
Kafka [label="Push features to queue"];
TriggerAux1Rate1 -> Synchronize;
TriggerAux1Rate2 -> Synchronize;
TriggerAux1RateN -> Synchronize;
TriggerAux2Rate1 -> Synchronize;
TriggerAux2Rate2 -> Synchronize;
TriggerAux2RateN -> Synchronize;
TriggerAuxNRate1 -> Synchronize;
TriggerAuxNRate2 -> Synchronize;
TriggerAuxNRateN -> Synchronize;
Synchronize -> Extract;
Extract -> Save [label="Option 1"];
Extract -> Kafka [label="Option 2"];
}
**Highlights:**
* Launch feature extractor jobs in online or offline mode:
* Online: Using /shm or framexmit protocol
* Offline: Read frames off disk
* Online/Offline DAGs available for launching jobs.
* Offline DAG parallelizes by time, channels are processed sequentially by subsets to reduce I/O concurrency issues. There are options to allow flexibility in choosing this, however.
* On-the-fly PSD generation (or take in a prespecified PSD)
* Auxiliary channels to be processed can be specified in two ways:
* Channel list .INI file, provided by DetChar. This provides ways to filter channels by safety and subsystem.
* Channel list .txt file, one line per channel in the form H1:CHANNEL_NAME:2048.
* Configurable min/max frequency bands for aux channel processing in powers of two. The default here is 32 - 2048 Hz.
* Verbose latency output at various stages of the pipeline. If regular verbosity is specified, latencies are given only when files are written to disk.
* Various file transfer/saving options:
* Disk: HDF5
* Transfer: Kafka (used for low-latency implementation)
* Various waveform configuration options:
* Waveform type (currently Sine-Gaussian and half-Sine-Gaussian only)
* Specify parameter ranges (frequency, Q for Sine-Gaussian based)
* Min mismatch between templates
......@@ -2,4 +2,76 @@
Running Offline Jobs
####################################################################################################
TODO
An offline DAG is provided in /gstlal-burst/share/feature_extractor/Makefile.gstlal_feature_extractor_offline
in order to provide a convenient way to launch offline feature extraction jobs. A condensed list of
instructions for use is also provided within the Makefile itself.
For general use cases, the only configuration options that need to be changed are:
* User/Accounting tags: GROUP_USER, ACCOUNTING_TAG
* Analysis times: START, STOP
* Data ingestion: IFO, CHANNEL_LIST
* Waveform parameters: WAVEFORM, MISMATCH, QHIGH
Launching DAGs
====================================================================================================
In order to start up offline runs, you'll need an installation of gstlal. An installation Makefile that
includes Kafka dependencies are located at: gstlal/gstlal-burst/share/feature_extractor/Makefile.gstlal_idq_icc
To generate a DAG, making sure that the correct environment is sourced:
$ make -f Makefile.gstlal_feature_extractor_offline
Then launch the DAG with:
$ condor_submit_dag feature_extractor_pipe.dag
Configuration options
====================================================================================================
Analysis times:
* START: set the analysis gps start time
* STOP: set the analysis gps stop time
Data ingestion:
* IFO: select the IFO for auxiliary channels to be ingested (H1/L1).
* CHANNEL_LIST: a list of channels for the feature extractor to process. Provided
lists for O1/O2 and H1/L1 lists are in gstlal/gstlal-burst/share/feature_extractor.
* MAX_SERIAL_STREAMS: Maximum # of streams that a single gstlal_feature_extractor job will
process at once. This is determined by sum_i(channel_i * # rates_i). Number of rates for a
given channels is determined by log2(max_rate/min_rate) + 1.
* MAX_PARALLEL_STREAMS: Maximum # of streams that a single job will run in the lifespan of a job.
This is distinct from serial streams since when a job is first launched, it will cache
auxiliary channel frames containing all channels that meet the criterion here, and then process
each channel subset sequentially determined by the serial streams. This is to save on input I/O.
* CONCURRENCY: determines the maximum # of concurrent reads from the same frame file. For most
purposes, it will be set to 1. Use this at your own risk.
Waveform parameters:
* WAVEFORM: type of waveform used to perform matched filtering (sine_gaussian/half_sine_gaussian).
* MISMATCH: maximum mismatch between templates (corresponding to Omicron's mismatch definition).
* QHIGH: maximum value of Q
Data transfer/saving:
* OUTPATH: directory in which to save features.
* SAVE_CADENCE: span of a typical dataset within an hdf5 file.
* PERSIST_CADENCE: span of a typical hdf5 file.
Setting the number of streams (ADVANCED USAGE)
====================================================================================================
NOTE: This won't have to be changed for almost all use cases, and the current configuration has been
optimized to aim for short run times.
Definition: Target number of streams (N_channels x N_rates_per_channel) that each cpu will process.
* if max_serial_streams > max_parallel_streams, all jobs will be parallelized by channel
* if max_parallel_streams > num_channels in channel list, all jobs will be processed serially,
with processing driven by max_serial_streams.
* any other combination will produce a mix of parallelization by channels and processing channels serially per job.
Playing around with combinations of MAX_SERIAL_STREAMS, MAX_PARALLEL_STREAMS, CONCURRENCY, will entirely
determine the structure of the offline DAG. Doing so will also change the memory usage for each job, and so you'll
need to tread lightly. Changing CONCURRENCY in particular may cause I/O locks due to jobs fighting to read from the same
frame file.
......@@ -2,4 +2,85 @@
Running Online Jobs
####################################################################################################
TODO
An online DAG is provided in /gstlal-burst/share/feature_extractor/Makefile.gstlal_feature_extractor_online
in order to provide a convenient way to launch online feature extraction jobs as well as auxiliary jobs as
needed (synchronizer/hdf5 file sinks). A condensed list of instructions for use is also provided within the Makefile itself.
There are four separate modes that can be used to launch online jobs:
1. Auxiliary channel ingestion:
a. Reading from framexmit protocol (DATA_SOURCE=framexmit).
This mode is recommended when reading in live data from LHO/LLO.
b. Reading from shared memory (DATA_SOURCE=lvshm).
This mode is recommended for reading in data for O2 replay (e.g. UWM).
2. Data transfer of features:
a. Saving features directly to disk, e.g. no data transfer.
This will save features to disk directly from the feature extractor,
and saves features periodically via hdf5.
b. Transfer of features via Kafka topics.
This requires a Kafka/Zookeeper service to be running (can be existing LDG
or your own). Features get transferred via Kafka from the feature extractor,
parallel instances of the extractor get synchronized, and then sent downstream
where it can be read by other processes (e.g. iDQ). In addition, an streaming
hdf5 file sink is launched where it'll dump features periodically to disk.
Launching DAGs
====================================================================================================
In order to start up online runs, you'll need an installation of gstlal. An installation Makefile that
includes Kafka dependencies are located at: gstlal/gstlal-burst/share/feature_extractor/Makefile.gstlal_idq_icc
To run, making sure that the correct environment is sourced:
$ make -f Makefile.gstlal_feature_extractor_online
Then launch the DAG with:
$ condor_submit_dag feature_extractor_pipe.dag
Configuration options
====================================================================================================
General:
* TAG: sets the name used for logging purposes, Kafka topic naming, etc.
Data ingestion:
* IFO: select the IFO for auxiliary channels to be ingested.
* CHANNEL_LIST: a list of channels for the feature extractor to process. Provided
lists for O1/O2 and H1/L1 lists are in gstlal/gstlal-burst/share/feature_extractor.
* DATA_SOURCE: Protocol for reading in auxiliary channels (framexmit/lvshm).
* MAX_STREAMS: Maximum # of streams that a single gstlal_feature_extractor process will
process. This is determined by sum_i(channel_i * # rates_i). Number of rates for a
given channels is determined by log2(max_rate/min_rate) + 1.
Waveform parameters:
* WAVEFORM: type of waveform used to perform matched filtering (sine_gaussian/half_sine_gaussian).
* MISMATCH: maximum mismatch between templates (corresponding to Omicron's mismatch definition).
* QHIGH: maximum value of Q
Data transfer/saving:
* OUTPATH: directory in which to save features.
* SAVE_FORMAT: determines whether to transfer features downstream or save directly (kafka/hdf5).
* SAVE_CADENCE: span of a typical dataset within an hdf5 file.
* PERSIST_CADENCE: span of a typical hdf5 file.
Kafka options:
* KAFKA_TOPIC: basename of topic for features generated from feature_extractor
* KAFKA_SERVER: Kafka server address where Kafka is hosted. If features are run in same location,
as in condor's local universe, setting localhost:port is fine. Otherwise you'll need to determine
the IP address where your Kafka server is running (using 'ip addr show' or equivalent).
* KAFKA_GROUP: group for which Kafka producers for feature_extractor jobs report to.
Synchronizer/File sink options:
* PROCESSING_CADENCE: cadence at which incoming features are processed, so as to limit polling
of topics repeatedly, etc. Default value of 0.1s is fine.
* REQUEST_TIMEOUT: timeout for waiting for a single poll from a Kafka consumer.
* LATENCY_TIMEOUT: timeout for the feature synchronizer before older features are dropped. This
is to prevent a single feature extractor job from holding up the online pipeline. This will
also depend on the latency induced by the feature extractor, especially when using templates
that have latencies associated with them such as Sine-Gaussians.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment