Investigate performance issues (!17) · Merge requests · lscsoft / iDQ

Patrick Godwin requested to merge investigate-performance-issues into master Apr 26, 2018

This is the result of performance studies done to try to reduce the I/O overhead and reduce the amount of time for iDQ to run executables (train, timeseries, etc.). This also includes a few additions that were meant to deal with issues encountered with circular dependencies as a result of helping with I/O problems.

What this includes:

Add a new option in data discovery sections of iDQ configuration file called stride, which allows for the option to split a single ClassifierData into multiple ClassifierData objects, that have the granularity of a single file. For now, this option is checked by executables to see whether this splitting needs to be done. stride is meant to be an integer representing the cadence of a single file, or possibly datasets within the file. I didn't want to enforce explicitly how specific ClassifierData objects want to do with this, because this could be implementation specific. For KW files, I'd imagine that the stride corresponds to the file stride, but for hdf5 objects, there's two strides; the file stride and the dataset stride, and I've chosen stride to be the dataset stride.
Added ClassifierData lists, that store lists of tuples in the form (ClassifierData, segment) to aid in splitting up a single ClassifierData into multiple of them. There are two helper functions within io.py that generate dense ClassifierData lists (no windowing around FeatureVector, just the full file stride) and ones that window around FeatureVector to create many segments just surrounding the times that we want to extract from files. This lowers the I/O overhead and time significantly for sparsely separated training sets. In these functions, how I'm splitting up the segments for file boundaries depends on whether the ClassifierData implementation has a keyword argument cache. I've moved over cache to be a keyword argument because I don't want to enforce this to all ClassifierData objects. However, since I have no way of predicting the span of files are a priori for hdf5 files, I used a cache to aid me in doing this. For now it still uses globs to aid in doing so, but this likely won't be the case in the future. All that to say, that I specified an alternate option (currently not implemented but with a suggestion of how to do so) that uses the stride to predetermine the file strides needed to generate these lists.
I have completely stripped out the class declarations/instantiation routines at the bottom of io.py and classifiers.py in favor of ClassifierFactory, etc. that create instances of these objects. The known subclasses are generated upon instantiation, and contains a make() method that creates instances of this class. On the surface, this just appears to be a different implementation of what's already there. However, this solves the problem of circular dependencies from having the class instance creation situated within configparser.py rather than within their respective modules. This also removes the need of having code being run at the bottom of modules, and isolates the needed behavior within a class instead.
These factories are then used throughout batch.py to generate Classifiers, Reporters, etc., as well as within ClassifierData lists to aid in creating multiple instances of these classes.
Added quite a few utilities to aid in I/O based processing, living within utils.py
Reworked GstlalHDF5ClassifierData_retrieve_triggers() to reduce I/O overhead and time.
There's features.segs2times() where some segments passed have a LIGOTimeGPS rather than a single float/int, which causes issues in comparisons. I have type-casted the segment start and stop to float to alleviate this issue. I'm not sure whether this pops up because of segments utilities, or what. I have gone with this fix because I'm not sure what option would be better. If there is a general issue with segments being inconsistent, we should investigate that separately and deal with it.

What this doesn't include:

Changing the API to SupervisedClassifier.timeseries() to use Quiver rather than ClassifierData. In principle, the utilities I added to generate an unlabeled Quiver from ClassifierData and ClassifierData lists would make this fairly trivial, but I'll leave this up to you since you'd have to deal with the case of handling it within OVL.
Creating UmbrellaClassifierData objects. That in principle could be either directly leveraging ClassifierData list utilities or using some of their same workflow for delegations to smaller ClassifierData objects. I'm inclined to keep the list utilities even if there's modifications to it or if you decide on a different approach for UmbrellaClassifierData because this may allow parallelization of generating Quivers in an easy fashion, when we run into the issue of processing denser-spaced features (either in frequency with dumping all templates rather than the loudest one, or in time where we increase the rate that features are being produced from 1 Hz). We could farm this preprocessing step via Condor or similar without much more work, in my opinion when/if we go down this route.
Implementation of generating file segments within ClassifierData list utilities when not using a cache. This would presumably use the stride kwarg of ClassifierData to generate predetermined file segments, but I'll leave the specific implementation open for now because I can't use this approach for gstlal-based features.

I hope I have covered everything here within the scope of this merge request and haven't forgotten anything. This is intended to help out with a lot of I/O issues, and while it is not fully comprehensive to deal with the all the problems described, should not affect current runs at all with KW-based features as I've tried to isolate my changes until you have a chance to test/implement these changes on the KW side.

Investigate performance issues

Merge request reports