Mixed bag of improvements for comparison run
So, this was originally meant to just improve some of the quiver I/O but ended being a mixed bag of different things to help out in doing the sklearn comparison run with gstlal-based features.
In order to give some context, the comparison runs were taking an ungodly amount of time to run the training jobs. I narrowed it down to the quiver generation, and part of it needed changes to the code and part was that I put the windowing logic in the wrong place so quivers had to do more work than it needed to.
Anyways, here are the changes:
- Added an option to also use a stride to break up I/O for quivers when using gstlal-based CD. Before, it would just split up I/O based on file boundaries, but I find that insufficient since I've increased the sampling rate from 1 Hz to 16 Hz, so a 2000 second file would have way too many triggers to parse through for a single CD.
- Because of this, I've consolidated some of the code in
factories.py
so that all code just uses the quiver splitting logic I wrote rather than maintaining two similar sections of code separately. I've opted with mine because it creates all the CD boundaries first and then populates the quiver rather than do it all at once, which allows me to factor out some parts of the splitting logic more easily. - Added
split_segments_by_stride
inutils.py
to help with making those boundaries for CDs. It leverages some of the code insegs2datasets
already with a little modification to that function. - Add verbosity to the train/evaluate stages in sklearn-based classifiers for the quiver generation and training/eval.
- Fixed an issue in gstlal-based CD where it wasn't grabbing all the right metadata from hdf5 files to do all the logic. What was done before to maintain backwards-compatibility was to allow explicit kwargs specifying this metadata so it wasn't really caught. Now that's been fixed (stride -> cadence, renamed some variables to not clash/cause confusion with the stride kwarg used here and elsewhere). Now I also make it required that this metadata is set, so it simplifies things a bit. Update the unit tests to reflect this change.
Anyways, with these changes and setting windowing/stride to 100s, this cut down the time to create a vectorized quiver from 3+ days to ~ 1 hour. How much of this was purely based on windowing I'm not sure, but at least being able to manually shorten the strides should have helped considerably.