Allow batch.timeseries to process small strides (!159) · Merge requests · lscsoft / iDQ

Patrick Godwin requested to merge batch_timeseries_strides into master May 02, 2019

This merge request was aimed primarily to allow batch.timeseries to process timeseries in small strides rather than process it all at once, similar to what's being done in stream.timeseries. In doing so, there was a rework of factories.QuiverFactory to allow this to take place, as well as some other modifications while testing.

The other changes besides batch.timeseries:

The dividing of I/O within factories.QuiverFactory is now done within __init__ rather than __call__. This was done to avoid repeating this step over and over again in batch.timeseries when working with dense quivers, which can get costly. There was also a rework with how the classifier data objects get assigned to dense quivers when the I/O gets split. When first adding the dense quiver implementation, I added a dumb implementation in the case that the I/O gets split up by assigning a single umbrella containing all the different classifier datum. This has since been reworked using the work done in __init__ to only assign the classifier data relevant to the times requested. I also defined a divide_io property based on all the different conditions needed to trigger this (i.e. if direct is set to True, etc) to make it a bit clearer.
features.py: DenseQuiver.extract_times() now passes in segments in CD.triggers() to get only the relevant triggers needed to vectorize the quiver. This speeds up timeseries generation in batch mode by about a factor of 2 (at least in my case with a stride of 20s) due to the amount of extra triggers grabbed by adjacent classifier data in the umbrella due to edge effects. I'm using the window property of the select to determine these segments, which is probably going to be defined when we define new selects in the future so I think it's a safe bet.
There's a couple of segment-related changes in io.py that already got merged into master so those are irrelevant here.
I have changed UmbrellaClassifierData and child classes to not store data from triggers in self._data but rather just return a data that gets generated each time. In doing so, I had to update self._cached at the end of triggers() and redefine the pop() function in order for all the other functionality to still make sense. Doing it in this manner prevents from the same data to be stored in memory twice (or possibly more in the case of overlapping umbrellas). For a batch job with gstlal triggers, this reduced the memory usage of a single batch training job by a factor of 4 (from 72 GB to 18 GB in the stretch of data I was running over). Regenerating the triggers each time appears to be a negligible amount of time (I wasn't able to notice any difference running several classifiers sequentially which should run triggers() multiple times). Note, I left MultiSourceClassifierData.pop() to be unimplemented since I don't know the internals enough to do this properly.
line 244 of io.py: iterate through self._data.keys() rather than self._data to loop through channels. I thought it wasn't doing the right thing but I had since confirmed they're identical. I don't care whether this is still in here at this point.

I had done enough testing at this point in streaming and batch modes for both OVL and quiver-based classifiers to convince myself the the results are the same as before. This was my major concern after reworking the internals of QuiverFactory.

Allow batch.timeseries to process small strides

Merge request reports