Skip to content

WIP: Reed investigate performance issues

This request is still a work in progress mostly because the changes have NOT been tested.

This merge requests should wrap up several things I've promised to deliver for the batch pipeline. In particular

  • implementing io.UmbrellaClassifierData
  • reworking how quivers are built (and all that delegation chain) to utilize UmbrellaClassifierData in a smart way that should work for both densely spaced and sparsely spaced quivers.
    • we condition on a window kwarg to filter the ClassifierData around sample times
    • we condition on stride and window kwargs to split I/O into smaller objects, nested within Umbrellas
    • if samples are spaced closely but not close enough that their windows overlap, it may be faster to just load in all the data instead of filtering by the windows around each sample. Currently, we do not support automatic logic to handle that because it could be quite implementation dependent. Instead, we provide a flag called do_not_window that will stop the code from windowing around times. Users must specify this by hand (i.e.: the burden is on them to know what they're doing).
  • re-working how Factory objects work a bit to avoid repeated work (more on Factories below)
    • I made factories callable (just a small change in syntax) and stripped out a lot of the explicit checking within them in favor of catching errors. This should avoid repeating conditionals that we know will evaluate to True
    • I also explicitly called out the flavor kwarg instead of relying on it being passed through **kwargs. This required us to declare the input arguments explicitly for each type of Factory. That would only be a problem for ReporterFactory because StreamReporter did not follow the API declared in Reporter. StreamReporter won't work as is for several reasons, so we know it must be re-worked and therefore I'm fine breaking it further in this way.
  • adding new PredictiveKW?ClassifierData objects that use a stride to predict file and directory names
    • changes associated with this trickle down to idq-kwm2kws and some sanitycheck scripts
  • changed how target_bounds are read in from config files to avoid using eval whenever possible.
    • this introduces a slight change to the syntax within the config files and also required changes to configparser.config2bounds (renamed from config2target_bounds)
  • moved some functions used to predict GSTLAL filenames and directory names from utils.py to names.py
  • added some verbosity statements within OVL and DOVL to help users track and time the progress of their training
  • misc cleaning up and removal of unused functions and objects

Things that this merge request does not do

  • it does not change the signature for SupervisedClassifier.timeseries to ingest a Quiver instead of a ClassifierData object, and I do not intend to make this change. This was motivated by the following
    • we already have a class method that ingests a Quiver; it's called evalaute and we don't need to repeat it
    • we do not cache the results of Quiver.vectorize, so passing a single quiver to multiple calls to timeseries does not save time on calls to vectorize. The only savings would be in constructing the quiver itself, which is expected to be fast
    • there can be algorithm-specific optimizations that require direct object to a ClassifierData object instead of passing through a Quiver (e.g.: in both OVL and DOVL). These speed-ups are worthwhile enough that they should be maintained.
  • fully implement parallelization of Quiver.vectorize through Multiprocessing. I simply have not had time to implement this yet, but it should be coming.

Now, I'd like to comment on the Factory schema introduced in investigate-performance-issues. This provides a way to dynamically discover subclasses and map the names of those subclasses to their associated instantiation methods. It replaces loops that constructed a dictionary at the bottom of several modules with a recursive call that performs as similar loop to create a similar dictionary upon instantiation of each Factory object. That would be useful if we expect to dynamically change the set of subclasses accessible, but we do not. Repeating the construction of that map upon each instantiation is wasteful, and it would be better to construct the map once at compilation time. The objects could then reference these static maps, much like the dictionaries that were constructed before. If we buy that, then the objects are equivalent to a single function call and should just be made functions. If we also buy that, then this is equivalent to the function calls that I had previously defined within configparser.py. To wit, what we now use module.factory(..., flavor=..., ...) to do we could have equivalently done with module.known[flavor](...) every it is needed without restructuring the code and stripping configparser.py of almost all its functionality.

What's more, the criticism that circular dependencies were inevitably introduced was an artifact of the particular delegation scheme introduced to create classifier_data_lists, which are no longer used and have been removed entirely in favor of the unified logic within features.classifier_dataANDtimes2quiver; configparser.py probably didn't need to be stripped.

This logic chain depends on the claim that we don't/won't need to do dynamic re-evaluation of the mapping between declared subclasses and their names (strings). The new infrastructure allows for that possibility, which isn't necessarily bad, but I would argue that we don't want to rely on it if at all possible. Subclasses should be declared within a single module for sanity's sake.

Nonetheless, the additional overhead of re-computing this mapping is almost certainly small and will be confined to once each time we call one of the main executable helper functions (e.g.: batch.train). The cost is small enough that it seems more wasteful of our time to continue to discuss this, as long as the changes introduced in this merge request pass testing.

Merge request reports