Batch io optimizations
This merge request patches some issues with batch-based jobs causing jobs to crash. In addition, it incorporates an alternative way to generate Umbrellas within factories.QuiverFactory
for Gstlal-based ClassifierData objects, which are unfeasible to predict all file-paths a priori. It also includes a few other optimizations which cause batch-based jobs to be much, much quicker than before for GstLAL-based ClassifierData and Sklearn-based classifiers.
Bug fixes:
- io.py: changed glob within
DiskReporter
for get_preferred_file from glob.iglob -> glob.glob. The reason for this change is because the sort done in the next line (line 1015) expects a list, not a generator. - classifiers.py: In
SupervisedSklearnClassifier.timeseries()
, in rare cases, segments passed in to generate timeseries into a quiver_factory may not actually span any data on disk, which will cause issues when callingQuiver.vectorize()
since it expects a non-empty quiver. There's now a small check within that checks whether the quiver returned is empty or not.
Feature additions:
- Added segs kwarg to both
SupervisedSklearnClassifier.timeseries()
andQuiverFactory.unlabeled()
to pass in seglist subsets. This was talked about in the streaming_development email thread, and changes from there has been incorporated here. - Changed glue segment import in factories.py to be
glue_segments
rather thanligolw_segments
which is more accurate and less confusing since this sameglue_segments
import is present in other submodules when bothligolw_segments
andglue_segments
are present. - Added option in factories.py to incorporate
direct
kwarg for explicitly specifying that Umbrellas are not to be used. - Added
utils.find_unique_segs
function which assumes a sorted segment list, takes in a segment as well to find all segments that overlap. This uses bisect to find segments very quickly and is used to generate Quivers in some of the optimizations below. - Added updating of segments when using
Quiver.append_quiver()
so that the new quiver contain all segments from both quivers.
Optimizations:
- Added option within factories.py to generate Umbrellas from two different mechanisms: The first being what was already present (using window + stride). The second uses a cache to create Umbrellas. This change is completely seamless to the user and so KW-based ClassifierData (or others without caches) will not have to change a thing. The reason this has been added was since the io_optimizations patch was added in, all the I/O optimizations incorporated couldn't be used by GstLAL-based ClassifierData since the window + stride option assumed that filepaths could be completely predetermined, while this was not possible with GstLAL-based ClassifierData, and so strides needed to be determined from files present on disk, and the cache here allows this to be done in an easy way. With the changes added here + preaggregation when producing GstLAL-based triggers from the feature extractor has reduced training times by a factor of 100 (not even kidding, a 40 hour job now takes 15 minutes).
-
SupervisedSklearnClassifier
now uses thenum_procs
kwarg to parallelize the cross-validation scheme to do grid-based hyperparameter tuning. It turns out that with the I/O optimizations that it's gotten to the point where this is now a bottleneck, so this helps alleviate the training time considerably. I'll have to think about a better scheme here now that this is the case, but this is good low hanging fruit for the time being.
Edited by Patrick Godwin