modifications to I/O and parallelization (!19) · Merge requests · lscsoft / iDQ

Reed Essick requested to merge reed-investigate-performance-issues into master May 08, 2018

This merge requests should wrap up several things I've promised to deliver for the batch pipeline. It should replace !17 (merged) because this branch was based on those developments.

adding new PredictiveKW?ClassifierData objects that use a stride to predict file and directory names
- changes associated with this trickle down to idq-kwm2kws and some sanitycheck scripts
implementing io.UmbrellaClassifierData
reworking how quivers are built (and all that delegation chain) to utilize UmbrellaClassifierData in a smart way that should work for both densely spaced and sparsely spaced quivers.
- this is handled in a QuiverFactory object now.
- we condition on a window kwarg to filter the ClassifierData around sample times
- we condition on stride and window kwargs to split I/O into smaller objects, nested within Umbrellas
- if samples are spaced closely but not close enough that their windows overlap, it may be faster to just load in all the data instead of filtering by the windows around each sample. Currently, we do not support automatic logic to handle that because it could be quite implementation dependent. Instead, we provide a flag called do_not_window that will stop the code from windowing around times. Users must specify this by hand (i.e.: the burden is on them to know what they're doing).
- Note, we currently do not re-use any of the triggers already loaded into the main ClassifierData passed to the QuiverFactory at instantiation when we make the smaller ClassifierData objects within the quiver. This should be a relatively simple iteration and could save repeated I/O
  - mostly likely associated with re-loading the target channel's triggers, which shouldn't be necessary in most cases but it's good to have this safety net in place.
re-working how Factory objects work a bit to avoid repeated work (more on Factories below)
- I made factories callable (just a small change in syntax) and stripped out a lot of the explicit checking within them in favor of catching errors. This should avoid repeating conditionals that we know will evaluate to True
- I also explicitly called out the flavor kwarg instead of relying on it being passed through **kwargs. This required us to declare the input arguments explicitly for each type of Factory. That would only be a problem for ReporterFactory because StreamReporter did not follow the API declared in Reporter. StreamReporter won't work as is for several reasons, so we know it must be re-worked and therefore I'm fine breaking it further in this way.
- factory objects are now declared in factories.py to avoid circular dependences
- implemented QuiverFactory and changed the API of SuperviseClassifier.timeseries to ingest a QuiverFactory instead of a ClassifierData. QuiverFactory objects are thin wrappers around ClassifierData references, so this still provides convenient access to the underlying ClassifierData object while removing the need for anything in classifiers.py to import from features.py or factories.py, thereby simplifying the dependency structure.
parallelization of quiver.vectorize via delegation to multiprocessing pool.map. The scaling appears to be very much sub-linear, meaning we run less than twice as fast if we give it twice as many cores. The cause of that is not completely understood, but parallelization can save >30% of run time.
- added a children attribute to the main ClassifierData API so that identification of a set of ClassifierData within parallelized Quiver.vectorize is easier.
- we may still want to structure FeatureVector.vectorize more cleverly to try to save time there. This has not been investigated. This could involve
  - relying on pre-sorted triggers when searching through a trigger list
  - avoiding repeated construction of names within the function call
changed how target_bounds are read in from config files to avoid using eval whenever possible.
- this introduces a slight change to the syntax within the config files and also required changes to configparser.config2bounds (renamed from config2target_bounds)
moved some functions used to predict GSTLAL filenames and directory names from utils.py to names.py
added some verbosity statements within OVL and DOVL to help users track and time the progress of their training
fixed DOVL's calculation of eff_fap so it does not discard configurations that remove glitches but not clean times (i.e.: configs that perform super well). This would almost certainly be associated with poor sampling of the clean times distribution (too low random_rate) combined with small coincidence windows.
fixed a bug in batch._batch involving cross validation. We now only perform automatic cross validation if more than one bin is supplied; otherwise we train and evaluate on the same time period (the single bin).
added option to ignore segment queries within the [segments] section of the config file.
- we may want to move this to the individual [data discovery] sections so we can specify whether or not to ignore science segments in the train and timeseries jobs separately.
renamed how segment utilities are imported to avoid naming conflicts. This was particularly prevalent within io.py
changed Quiver to extend collections.deque and confirmed this does not slow down our main use cases. This also involved changing the API for how Quivers add/append to one another a bit.

Things that this merge request does not do

it does not change the signature for SupervisedClassifier.timeseries to ingest a Quiver instead of a ClassifierData object, and I do not intend to make this change. This was motivated by the following
- we already have a class method that ingests a Quiver; it's called evalaute and we don't need to repeat it
- we do not cache the results of Quiver.vectorize, so passing a single quiver to multiple calls to timeseries does not save time on calls to vectorize. The only savings would be in constructing the quiver itself, which is expected to be fast
- there can be algorithm-specific optimizations that require direct object to a ClassifierData object instead of passing through a Quiver (e.g.: in both OVL and DOVL). These speed-ups are worthwhile enough that they should be maintained.
fully implement parallelization of Quiver.vectorize through Multiprocessing. I simply have not had time to implement this yet, but it should be coming.
made logs.get_logger check for the existence of a directory before attempting to set up a FileHander point at that directory
changed segs2datasets to use collections.defaultdict instead of just a regular dictionary. This function has also been moved to utils.py.
various changes to tests/sanitycheck_* to support and test these changes.
misc cleaning up and removal of unused functions and objects

Edited May 08, 2018 by Reed Essick

modifications to I/O and parallelization

Merge request reports