Allow batch.timeseries to process small strides
This merge request was aimed primarily to allow batch.timeseries
to process timeseries in small strides rather than process it all at once, similar to what's being done in stream.timeseries
. In doing so, there was a rework of factories.QuiverFactory
to allow this to take place, as well as some other modifications while testing.
The other changes besides batch.timeseries
:
- The dividing of I/O within
factories.QuiverFactory
is now done within__init__
rather than__call__
. This was done to avoid repeating this step over and over again inbatch.timeseries
when working with dense quivers, which can get costly. There was also a rework with how the classifier data objects get assigned to dense quivers when the I/O gets split. When first adding the dense quiver implementation, I added a dumb implementation in the case that the I/O gets split up by assigning a single umbrella containing all the different classifier datum. This has since been reworked using the work done in__init__
to only assign the classifier data relevant to the times requested. I also defined adivide_io
property based on all the different conditions needed to trigger this (i.e. ifdirect
is set to True, etc) to make it a bit clearer. -
features.py
:DenseQuiver.extract_times()
now passes in segments inCD.triggers()
to get only the relevant triggers needed to vectorize the quiver. This speeds up timeseries generation in batch mode by about a factor of 2 (at least in my case with a stride of 20s) due to the amount of extra triggers grabbed by adjacent classifier data in the umbrella due to edge effects. I'm using thewindow
property of the select to determine these segments, which is probably going to be defined when we define new selects in the future so I think it's a safe bet. - There's a couple of segment-related changes in
io.py
that already got merged into master so those are irrelevant here. - I have changed
UmbrellaClassifierData
and child classes to not store data from triggers inself._data
but rather just return a data that gets generated each time. In doing so, I had to updateself._cached
at the end oftriggers()
and redefine thepop()
function in order for all the other functionality to still make sense. Doing it in this manner prevents from the same data to be stored in memory twice (or possibly more in the case of overlapping umbrellas). For a batch job with gstlal triggers, this reduced the memory usage of a single batch training job by a factor of 4 (from 72 GB to 18 GB in the stretch of data I was running over). Regenerating the triggers each time appears to be a negligible amount of time (I wasn't able to notice any difference running several classifiers sequentially which should runtriggers()
multiple times). Note, I leftMultiSourceClassifierData.pop()
to be unimplemented since I don't know the internals enough to do this properly. - line 244 of
io.py
: iterate throughself._data.keys()
rather thanself._data
to loop through channels. I thought it wasn't doing the right thing but I had since confirmed they're identical. I don't care whether this is still in here at this point.
I had done enough testing at this point in streaming and batch modes for both OVL and quiver-based classifiers to convince myself the the results are the same as before. This was my major concern after reworking the internals of QuiverFactory
.