Skip to content

Batch io optimizations

Patrick Godwin requested to merge batch_io_optimizations into master

This merge request patches some issues with batch-based jobs causing jobs to crash. In addition, it incorporates an alternative way to generate Umbrellas within factories.QuiverFactory for Gstlal-based ClassifierData objects, which are unfeasible to predict all file-paths a priori. It also includes a few other optimizations which cause batch-based jobs to be much, much quicker than before for GstLAL-based ClassifierData and Sklearn-based classifiers.

Bug fixes:

  • io.py: changed glob within DiskReporter for get_preferred_file from glob.iglob -> glob.glob. The reason for this change is because the sort done in the next line (line 1015) expects a list, not a generator.
  • classifiers.py: In SupervisedSklearnClassifier.timeseries(), in rare cases, segments passed in to generate timeseries into a quiver_factory may not actually span any data on disk, which will cause issues when calling Quiver.vectorize() since it expects a non-empty quiver. There's now a small check within that checks whether the quiver returned is empty or not.

Feature additions:

  • Added segs kwarg to both SupervisedSklearnClassifier.timeseries() and QuiverFactory.unlabeled() to pass in seglist subsets. This was talked about in the streaming_development email thread, and changes from there has been incorporated here.
  • Changed glue segment import in factories.py to be glue_segments rather than ligolw_segments which is more accurate and less confusing since this same glue_segments import is present in other submodules when both ligolw_segments and glue_segments are present.
  • Added option in factories.py to incorporate direct kwarg for explicitly specifying that Umbrellas are not to be used.
  • Added utils.find_unique_segs function which assumes a sorted segment list, takes in a segment as well to find all segments that overlap. This uses bisect to find segments very quickly and is used to generate Quivers in some of the optimizations below.
  • Added updating of segments when using Quiver.append_quiver() so that the new quiver contain all segments from both quivers.

Optimizations:

  • Added option within factories.py to generate Umbrellas from two different mechanisms: The first being what was already present (using window + stride). The second uses a cache to create Umbrellas. This change is completely seamless to the user and so KW-based ClassifierData (or others without caches) will not have to change a thing. The reason this has been added was since the io_optimizations patch was added in, all the I/O optimizations incorporated couldn't be used by GstLAL-based ClassifierData since the window + stride option assumed that filepaths could be completely predetermined, while this was not possible with GstLAL-based ClassifierData, and so strides needed to be determined from files present on disk, and the cache here allows this to be done in an easy way. With the changes added here + preaggregation when producing GstLAL-based triggers from the feature extractor has reduced training times by a factor of 100 (not even kidding, a 40 hour job now takes 15 minutes).
  • SupervisedSklearnClassifier now uses the num_procs kwarg to parallelize the cross-validation scheme to do grid-based hyperparameter tuning. It turns out that with the I/O optimizations that it's gotten to the point where this is now a bottleneck, so this helps alleviate the training time considerably. I'll have to think about a better scheme here now that this is the case, but this is good low hanging fruit for the time being.
Edited by Patrick Godwin

Merge request reports