modifications to I/O and parallelization
This merge requests should wrap up several things I've promised to deliver for the batch pipeline. It should replace !17 (merged) because this branch was based on those developments.
-
adding new
PredictiveKW?ClassifierData
objects that use astride
to predict file and directory names- changes associated with this trickle down to
idq-kwm2kws
and some sanitycheck scripts
- changes associated with this trickle down to
-
implementing
io.UmbrellaClassifierData
-
reworking how quivers are built (and all that delegation chain) to utilize
UmbrellaClassifierData
in a smart way that should work for both densely spaced and sparsely spaced quivers.- this is handled in a
QuiverFactory
object now. - we condition on a
window
kwarg to filter the ClassifierData around sample times - we condition on
stride
andwindow
kwargs to split I/O into smaller objects, nested withinUmbrellas
- if samples are spaced closely but not close enough that their windows overlap, it may be faster to just load in all the data instead of filtering by the windows around each sample. Currently, we do not support automatic logic to handle that because it could be quite implementation dependent. Instead, we provide a flag called
do_not_window
that will stop the code from windowing around times. Users must specify this by hand (i.e.: the burden is on them to know what they're doing). -
Note, we currently do not re-use any of the triggers already loaded into the main
ClassifierData
passed to theQuiverFactory
at instantiation when we make the smallerClassifierData
objects within the quiver. This should be a relatively simple iteration and could save repeated I/O- mostly likely associated with re-loading the target channel's triggers, which shouldn't be necessary in most cases but it's good to have this safety net in place.
- this is handled in a
-
re-working how
Factory
objects work a bit to avoid repeated work (more on Factories below)- I made factories callable (just a small change in syntax) and stripped out a lot of the explicit checking within them in favor of catching errors. This should avoid repeating conditionals that we know will evaluate to True
- I also explicitly called out the
flavor
kwarg instead of relying on it being passed through**kwargs
. This required us to declare the input arguments explicitly for each type of Factory. That would only be a problem forReporterFactory
becauseStreamReporter
did not follow the API declared inReporter
.StreamReporter
won't work as is for several reasons, so we know it must be re-worked and therefore I'm fine breaking it further in this way. - factory objects are now declared in
factories.py
to avoid circular dependences - implemented
QuiverFactory
and changed the API ofSuperviseClassifier.timeseries
to ingest aQuiverFactory
instead of aClassifierData
.QuiverFactory
objects are thin wrappers aroundClassifierData
references, so this still provides convenient access to the underlyingClassifierData
object while removing the need for anything inclassifiers.py
to import fromfeatures.py
orfactories.py
, thereby simplifying the dependency structure.
-
parallelization of
quiver.vectorize
via delegation to multiprocessingpool.map
. The scaling appears to be very much sub-linear, meaning we run less than twice as fast if we give it twice as many cores. The cause of that is not completely understood, but parallelization can save >30% of run time.- added a
children
attribute to the mainClassifierData
API so that identification of a set ofClassifierData
within parallelizedQuiver.vectorize
is easier. - we may still want to structure
FeatureVector.vectorize
more cleverly to try to save time there. This has not been investigated. This could involve- relying on pre-sorted triggers when searching through a trigger list
- avoiding repeated construction of
names
within the function call
- added a
-
changed how target_bounds are read in from config files to avoid using
eval
whenever possible.- this introduces a slight change to the syntax within the config files and also required changes to
configparser.config2bounds
(renamed fromconfig2target_bounds
)
- this introduces a slight change to the syntax within the config files and also required changes to
-
moved some functions used to predict GSTLAL filenames and directory names from
utils.py
tonames.py
-
added some verbosity statements within
OVL
andDOVL
to help users track and time the progress of their training -
fixed
DOVL
's calculation ofeff_fap
so it does not discard configurations that remove glitches but not clean times (i.e.: configs that perform super well). This would almost certainly be associated with poor sampling of the clean times distribution (too lowrandom_rate
) combined with small coincidence windows. -
fixed a bug in
batch._batch
involving cross validation. We now only perform automatic cross validation if more than one bin is supplied; otherwise we train and evaluate on the same time period (the single bin). -
added option to ignore segment queries within the [segments] section of the config file.
- we may want to move this to the individual [data discovery] sections so we can specify whether or not to ignore science segments in the train and timeseries jobs separately.
-
renamed how segment utilities are imported to avoid naming conflicts. This was particularly prevalent within
io.py
-
changed
Quiver
to extendcollections.deque
and confirmed this does not slow down our main use cases. This also involved changing the API for howQuiver
s add/append to one another a bit.
Things that this merge request does not do
-
it does not change the signature for
SupervisedClassifier.timeseries
to ingest aQuiver
instead of aClassifierData
object, and I do not intend to make this change. This was motivated by the following- we already have a class method that ingests a
Quiver
; it's calledevalaute
and we don't need to repeat it - we do not cache the results of Quiver.vectorize, so passing a single quiver to multiple calls to
timeseries
does not save time on calls tovectorize
. The only savings would be in constructing the quiver itself, which is expected to be fast - there can be algorithm-specific optimizations that require direct object to a
ClassifierData
object instead of passing through aQuiver
(e.g.: in bothOVL
andDOVL
). These speed-ups are worthwhile enough that they should be maintained.
- we already have a class method that ingests a
-
fully implement parallelization of
Quiver.vectorize
through Multiprocessing. I simply have not had time to implement this yet, but it should be coming. -
made
logs.get_logger
check for the existence of a directory before attempting to set up aFileHander
point at that directory -
changed
segs2datasets
to usecollections.defaultdict
instead of just a regular dictionary. This function has also been moved toutils.py
. -
various changes to
tests/sanitycheck_*
to support and test these changes. -
misc cleaning up and removal of unused functions and objects