Investigate performance issues
This is the result of performance studies done to try to reduce the I/O overhead and reduce the amount of time for iDQ to run executables (train, timeseries, etc.). This also includes a few additions that were meant to deal with issues encountered with circular dependencies as a result of helping with I/O problems.
What this includes:
-
Add a new option in data discovery sections of iDQ configuration file called
stride
, which allows for the option to split a singleClassifierData
into multipleClassifierData
objects, that have the granularity of a single file. For now, this option is checked by executables to see whether this splitting needs to be done.stride
is meant to be an integer representing the cadence of a single file, or possibly datasets within the file. I didn't want to enforce explicitly how specificClassifierData
objects want to do with this, because this could be implementation specific. For KW files, I'd imagine that the stride corresponds to the file stride, but for hdf5 objects, there's two strides; the file stride and the dataset stride, and I've chosenstride
to be the dataset stride. -
Added
ClassifierData
lists, that store lists of tuples in the form (ClassifierData
,segment
) to aid in splitting up a singleClassifierData
into multiple of them. There are two helper functions within io.py that generate denseClassifierData
lists (no windowing aroundFeatureVector
, just the full file stride) and ones that window aroundFeatureVector
to create many segments just surrounding the times that we want to extract from files. This lowers the I/O overhead and time significantly for sparsely separated training sets. In these functions, how I'm splitting up the segments for file boundaries depends on whether theClassifierData
implementation has a keyword argumentcache
. I've moved overcache
to be a keyword argument because I don't want to enforce this to allClassifierData
objects. However, since I have no way of predicting the span of files are a priori for hdf5 files, I used a cache to aid me in doing this. For now it still uses globs to aid in doing so, but this likely won't be the case in the future. All that to say, that I specified an alternate option (currently not implemented but with a suggestion of how to do so) that uses the stride to predetermine the file strides needed to generate these lists. -
I have completely stripped out the class declarations/instantiation routines at the bottom of io.py and classifiers.py in favor of
ClassifierFactory
, etc. that create instances of these objects. The known subclasses are generated upon instantiation, and contains a make() method that creates instances of this class. On the surface, this just appears to be a different implementation of what's already there. However, this solves the problem of circular dependencies from having the class instance creation situated within configparser.py rather than within their respective modules. This also removes the need of having code being run at the bottom of modules, and isolates the needed behavior within a class instead. -
These factories are then used throughout batch.py to generate Classifiers, Reporters, etc., as well as within
ClassifierData
lists to aid in creating multiple instances of these classes. -
Added quite a few utilities to aid in I/O based processing, living within utils.py
-
Reworked
GstlalHDF5ClassifierData_retrieve_triggers()
to reduce I/O overhead and time. -
There's
features.segs2times()
where some segments passed have a LIGOTimeGPS rather than a single float/int, which causes issues in comparisons. I have type-casted the segment start and stop to float to alleviate this issue. I'm not sure whether this pops up because of segments utilities, or what. I have gone with this fix because I'm not sure what option would be better. If there is a general issue with segments being inconsistent, we should investigate that separately and deal with it.
What this doesn't include:
-
Changing the API to
SupervisedClassifier.timeseries()
to useQuiver
rather thanClassifierData
. In principle, the utilities I added to generate an unlabeledQuiver
fromClassifierData
andClassifierData
lists would make this fairly trivial, but I'll leave this up to you since you'd have to deal with the case of handling it within OVL. -
Creating
UmbrellaClassifierData
objects. That in principle could be either directly leveragingClassifierData
list utilities or using some of their same workflow for delegations to smallerClassifierData
objects. I'm inclined to keep the list utilities even if there's modifications to it or if you decide on a different approach forUmbrellaClassifierData
because this may allow parallelization of generating Quivers in an easy fashion, when we run into the issue of processing denser-spaced features (either in frequency with dumping all templates rather than the loudest one, or in time where we increase the rate that features are being produced from 1 Hz). We could farm this preprocessing step via Condor or similar without much more work, in my opinion when/if we go down this route. -
Implementation of generating file segments within
ClassifierData
list utilities when not using a cache. This would presumably use thestride
kwarg ofClassifierData
to generate predetermined file segments, but I'll leave the specific implementation open for now because I can't use this approach for gstlal-based features.
I hope I have covered everything here within the scope of this merge request and haven't forgotten anything. This is intended to help out with a lot of I/O issues, and while it is not fully comprehensive to deal with the all the problems described, should not affect current runs at all with KW-based features as I've tried to isolate my changes until you have a chance to test/implement these changes on the KW side.