WIP: Reed investigate performance issues
This request is still a work in progress mostly because the changes have NOT been tested.
This merge requests should wrap up several things I've promised to deliver for the batch pipeline. In particular
- implementing
io.UmbrellaClassifierData
- reworking how quivers are built (and all that delegation chain) to utilize
UmbrellaClassifierData
in a smart way that should work for both densely spaced and sparsely spaced quivers.- we condition on a
window
kwarg to filter the ClassifierData around sample times - we condition on
stride
andwindow
kwargs to split I/O into smaller objects, nested withinUmbrellas
- if samples are spaced closely but not close enough that their windows overlap, it may be faster to just load in all the data instead of filtering by the windows around each sample. Currently, we do not support automatic logic to handle that because it could be quite implementation dependent. Instead, we provide a flag called
do_not_window
that will stop the code from windowing around times. Users must specify this by hand (i.e.: the burden is on them to know what they're doing).
- we condition on a
- re-working how
Factory
objects work a bit to avoid repeated work (more on Factories below)- I made factories callable (just a small change in syntax) and stripped out a lot of the explicit checking within them in favor of catching errors. This should avoid repeating conditionals that we know will evaluate to True
- I also explicitly called out the
flavor
kwarg instead of relying on it being passed through**kwargs
. This required us to declare the input arguments explicitly for each type of Factory. That would only be a problem forReporterFactory
becauseStreamReporter
did not follow the API declared inReporter
.StreamReporter
won't work as is for several reasons, so we know it must be re-worked and therefore I'm fine breaking it further in this way.
- adding new
PredictiveKW?ClassifierData
objects that use astride
to predict file and directory names- changes associated with this trickle down to
idq-kwm2kws
and some sanitycheck scripts
- changes associated with this trickle down to
- changed how target_bounds are read in from config files to avoid using
eval
whenever possible.- this introduces a slight change to the syntax within the config files and also required changes to
configparser.config2bounds
(renamed fromconfig2target_bounds
)
- this introduces a slight change to the syntax within the config files and also required changes to
- moved some functions used to predict GSTLAL filenames and directory names from
utils.py
tonames.py
- added some verbosity statements within
OVL
andDOVL
to help users track and time the progress of their training - misc cleaning up and removal of unused functions and objects
Things that this merge request does not do
- it does not change the signature for
SupervisedClassifier.timeseries
to ingest aQuiver
instead of aClassifierData
object, and I do not intend to make this change. This was motivated by the following- we already have a class method that ingests a
Quiver
; it's calledevalaute
and we don't need to repeat it - we do not cache the results of Quiver.vectorize, so passing a single quiver to multiple calls to
timeseries
does not save time on calls tovectorize
. The only savings would be in constructing the quiver itself, which is expected to be fast - there can be algorithm-specific optimizations that require direct object to a
ClassifierData
object instead of passing through aQuiver
(e.g.: in bothOVL
andDOVL
). These speed-ups are worthwhile enough that they should be maintained.
- we already have a class method that ingests a
- fully implement parallelization of
Quiver.vectorize
through Multiprocessing. I simply have not had time to implement this yet, but it should be coming.
Now, I'd like to comment on the Factory
schema introduced in investigate-performance-issues
. This provides a way to dynamically discover subclasses and map the names of those subclasses to their associated instantiation methods. It replaces loops that constructed a dictionary at the bottom of several modules with a recursive call that performs as similar loop to create a similar dictionary upon instantiation of each Factory
object. That would be useful if we expect to dynamically change the set of subclasses accessible, but we do not. Repeating the construction of that map upon each instantiation is wasteful, and it would be better to construct the map once at compilation time. The objects could then reference these static maps, much like the dictionaries that were constructed before. If we buy that, then the objects are equivalent to a single function call and should just be made functions. If we also buy that, then this is equivalent to the function calls that I had previously defined within configparser.py
. To wit, what we now use module.factory(..., flavor=..., ...)
to do we could have equivalently done with module.known[flavor](...)
every it is needed without restructuring the code and stripping configparser.py
of almost all its functionality.
What's more, the criticism that circular dependencies were inevitably introduced was an artifact of the particular delegation scheme introduced to create classifier_data_lists, which are no longer used and have been removed entirely in favor of the unified logic within features.classifier_dataANDtimes2quiver
; configparser.py
probably didn't need to be stripped.
This logic chain depends on the claim that we don't/won't need to do dynamic re-evaluation of the mapping between declared subclasses and their names (strings). The new infrastructure allows for that possibility, which isn't necessarily bad, but I would argue that we don't want to rely on it if at all possible. Subclasses should be declared within a single module for sanity's sake.
Nonetheless, the additional overhead of re-computing this mapping is almost certainly small and will be confined to once each time we call one of the main executable helper functions (e.g.: batch.train
). The cost is small enough that it seems more wasteful of our time to continue to discuss this, as long as the changes introduced in this merge request pass testing.