Overhaul handling of features/datasets (!207) · Merge requests · lscsoft / iDQ

Patrick Godwin requested to merge features_overhaul into master Mar 31, 2021

One big issue we've had in the past was the handling of features and datasets (quivers) were not handled uniformly across different classifiers. In particular, OVL used ClassifierData directly while every other classifier used Quivers to train/evaluate their models. This was necessary because quivers expected to represent their data is tabular form (one row per timestamp), whereas OVL required data to be in a more raw-like form to be able to form coincidences between different auxiliary channels.

This ultimately meant that we needed to jump through a lot more hoops to handle this both in stream/batch jobs but also in terms of I/O optimizations to ensure that jobs finished in a reasonable amount of time. In particular, there was all this extra business with UmbrellaClassifierDatas to ensure that the I/O and feature construction was reasonably efficient but it was difficult to maintain in general.

This PR does quite a few things, first by leveraging gwpy and gwtrigfind for trigger discovery and I/O. Instead of maintaining a separate set of routines for this, we use existing functionality for handling different types of LIGO-based triggers. This allows taking advantage of gwpy's parallel reads as well. By using gwpy's EventTable, we also have efficient enough I/O and feature construction efficiency that we don't need to use UmbrellaClassifierData or having several different sets of Kleine-Welle data loaders.

Second, I took the liberty of updating some main class names for clarity:

ClassifierData -> DataLoader
Quiver -> Dataset

Finally, some work was done to unify all classifiers to work with the Dataset API by heavily relying on astropy's Table structure.

All in all, this has helped with speedups across the board in terms of I/O + feature construction, anywhere from 5-30% and has reduced the size of the codebase by about 15%. This also provides an advantage in making it much easier to implement training with timeseries directly rather than triggers alone.

Overhaul handling of features/datasets

Merge request reports