known/possible issues for streaming pipeline
Below are a list of known/possible issues with the streaming pipeline that were identified as part of !67 (merged) but not addressed.
-
padding new_umbrella
returned byStreamProcessor.poll
to avoid edge effects when delegating to FeatureVector.vectorize.-
calling restrict_segs
when addingnew_umbrella
to the bigumbrella
may update shared references with feature vectors that would undo the padding, so we need to check that this is not the case.
-
-
CalibrationMap
needs a concept of time, segments to keep provenance of which samples are included.- these segments should be used with the
Reporter
that writesCalibrationMap
s
- these segments should be used with the
-
idq-streaming_calibrate
may be systematically missing some of the data fromidq-streaming_evaluate
because of the wayDiskReporter
s manage their caches andpreferred
options- make calibrate's stride shorter?
- muck with CadenceManager.timestamp to keep it in line with what was actually read?
- confirm there is actually a problem...
- change
DiskReporter
's behavior to remove this issue- keep a counter of which line the "preferred" file is in the cache and increment it like
KafkaReporter
would, returningNone
as appropriate if we're off the end of the file
- keep a counter of which line the "preferred" file is in the cache and increment it like
-
KW and GSTLAL ClassifierData need to raise NoDataError
if they can't find any files within the requested period -
KW and GSTLAL ClassifierData need to raise IncompleteDataError
(or justBadSpanError
?) if there is only partial coverage -
change the order of nested iterations when writing timeseries so that all classifiers are written to disk for a given segment before moving on to the next segment -
improve provenance for all samples/inferences made. This could be as simple as making FeatureVectors store a unique identifier for the model used to evaluate them (this should be an attribute of the model, preferably searchable so that we can look-up which model was used without iterating over all models) in addition to the ranks. Then, CalibrationMap can reference FeatureVectors instead of just ranks and maintain provenance for what went in (vectors already contain gps time, but we may want to add segments to CalibrationMaps as well). -
we want to check the "checksum" of the model as well to make sure we don't have any associated issues
-
Edited by Patrick Godwin