State vector check is running into the latency of /dev/shm frames

changed milestone to %O4

changed iteration to LL grp iterations May 8, 2023 - May 14, 2023

@geoffrey.mo, are you able to construct a minimal reproducer code snippet that I can use to test?

working on this, will let you know once/if I get it going

There's a bit of test code in the below comment which reproduces the error, but I think this isn't a gwpy problem, and is just a consequence of the latency of the frames. Sorry for the false alarm!

mentioned in merge request !1193 (merged)

Some of these occurrences are caused by the detchar check skipping if the requested StateVector's start time is before now. However, the end can be after now, which means that we'll eventually be looking for missing data. !1193 (merged) should fix that.

However, I don't think this fixes everything. For example, for E406368, this is what the gwcelery log reveals:

[2023-05-03 19:54:03,620: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.gracedb.create_event[4dd4dd4d-07ab-438d-b9ec-38f6a314b4ff] succeeded in 2.3868746110238135s: {'warnings': [], 'submitter': 'emfollow', 'created': '2023-05-04 02:54:01 UTC', 'group': 'External', 'graceid': 'E406368', 'pipeline': 'Fermi', 'gpstime': 1367204052.99, 'reporting_latency': 6.294359, 'instruments': '', 'nevents': None, 'offline': False, 'search': 'GRB', 'far': None, 'far_is_upper_limit': False, 'likelihood': None, 'labels': ['NOT_GRB'], 'extra_attributes': {'GRB': {'ivorn': 'ivo://nasa.gsfc.gcn/Fermi#GBM_Alert_2023-05-04T02:53:54.99_704861639_1-716', 'author_ivorn': 'ivo://nasa.gsfc.tan/gcn', 'author_shortname': 'Fermi (via VO-GCN)', 'observatory_location_id': 'GEOLUN', 'coord_system': 'UTC-FK5-GEO', 'ra': 0.0, 'dec': 0.0, 'error_radius': 0.0, 'how_description': 'Fermi Satellite, GBM Instrument', 'how_reference_url': 'http://gcn.gsfc.nasa.gov/fermi.html', 'T90': None, 'trigger_duration': 0.256, 'designation': None, 'redshift': None, 'trigger_id': '704861639'}}, 'superevent': None, 'superevent_neighbours': {}, 'links': {'neighbors': 'https://gracedb.ligo.org/api/events/E406368/neighbors/', 'lo...', ...}}
[2023-05-03 19:54:03,625: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.check_vectors[6f4fa02c-92b9-462d-ab5d-c2d81507d032] received
[2023-05-03 19:54:03,692: ERROR/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.detchar.check_vectors[6f4fa02c-92b9-462d-ab5d-c2d81507d032] raised unexpected: ValueError('StateVector with span [1367204050.9375 ... 1367204055.0) does not cover requested interval [1367204050.99 ... 1367204055.246)')
Traceback (most recent call last):
  File "/home/emfollow/.local/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 207, in _inner
    reraise(*exc_info)
  File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 60, in reraise
    raise value
  File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 202, in _inner
    return f(*args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 454, in check_vectors
    states.update(check_vector(caches[channel.split(':')[0]], channel,
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 350, in check_vector
    statevector = StateVector.read(cache, channel,
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/statevector.py", line 689, in read
    return super().read(source, *args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 81, in _join
    return _pad_series(
  File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 139, in _pad_series
    raise ValueError(
ValueError: StateVector with span [1367204050.9375 ... 1367204055.0) does not cover requested interval [1367204050.99 ... 1367204055.246)

The important part of this is the time of the check_vectors task being received: 2023-05-03 19:54:03,625 Pacific, which translates to 2023-05-04 02:54:03,625 UTC, which translates to 1367204061.

The requested interval is [1367204050.99 ... 1367204055.246), and the StateVector span is [1367204050.9375 ... 1367204055.0). All of this is ~6 seconds before the check_vectors task is called.

I think what's happening instead is that we're running into the inherent latency of the low-latency frames. Running this bit of test code on the emfollow machine with end_diff = 5 results in the above ValueError every time. The same code with end_diff = 6 runs without any errors.

import glob
from glue.lal import Cache
from gwpy.time import tconvert
from gwpy.timeseries import StateVector

end_diff = 5

pattern = '/dev/shm/kafka/H1/*.gwf'
for i in range(20):
    try:
        gps_now = tconvert(gpsordate='now')

        start, end = gps_now - 8, gps_now - end_diff

        filenames = glob.glob(pattern)
        cache = Cache.from_urls(filenames)

        sv = StateVector.read(
            cache, 'H1:GDS-CALIB_STATE_VECTOR', 
            start=start, end=end, 
            )  
    except ValueError as e:
        print(e)
        print(f'gps_now is {gps_now}')

@stuart.anderson @patrick.brockill Is ~6 seconds the expected latency for the population of /dev/shm/kafka/*/*.gwf?

If so, then our check_vectors task should make sure 6 seconds since the end of the requested StateVector window has already passed before running, then either wait if not (which increases latency of the whole system), or handle the error and skip the check (which risks allowing bad data, or injections, or overflows to end up in public alerts). We are already running the latter risk with EarlyWarning events, which don't get a check_vectors check, but skipping it for all events which come in at a small latency seems risky as well.

changed title from Caches for state vector checking are sometimes being over-sieved when being read by gwpy to State vector check is running into the latency of /dev/shm frames

@geoffrey.mo please indicate which system(s) have a larger latency than expected for /dev/shm/kafka/*/*.gwf files?

I did this testing on the emfollow machine, but I’m not sure if this (6 sec to data availability) is necessarily a larger latency than expected, or what’s normal and just something we hadn’t run into before.

@geoffrey.mo while we wait for the definitive answer from@pb, please also check emfollow-{dev,test,playground}, which I suspect to have the same performance, and then one other random bare metal login system at CIT, e.g., ldas-pcdev3.

I get the same result on emfollow-{dev,test,playground} and ldas-pcdev3 as on emfollow, where asking for data 5 seconds before now gets me the StateVector error about 70% of the time (13-14 out of 20 queries) and asking for data 6 seconds before gives me no errors. (FWIW th test snippet runs much faster on pcdev3 compared to the emfollow nodes but that could just be a function of the load on the nodes.)

Bump on this @stuart.anderson (it seems like Patrick Brockill isn't in the emfollow group and so can't be tagged). I'm guessing the 6 seconds or so from acquisition to data availability at CIT is the expected latency? Maybe @patrick.godwin or @jameson.rollins might know?

I have just invited Patrick Brockill to be a Guest in the gwcelery project, something you can do via Assignee/Edit/Invite Members.

@pb it looks like you can be tagged now?

Depends on which stream we're talking about. 6 seconds may be true for the replay but not necessarily for the live stream. What should be the case however is that by the time a full-bandwidth trigger comes in, that the data should be available up to t_merger as the search had already processed that data.

Is there no retry mechanism for this task?

There are two (known) issues with the latency of the O3 Replay MDC streams:

We have about 1830 machines pulling this data, which causes the latency to become bursty. More on this below.
We have seen a bit of extra burstiness due to infrequent spikes in latency from the place where that data is being produced, lowlatency@cit, which should not be there (this could be hardware, software, etc.). I hope this will get better when we move to a new server, lowlatency1@cit.

Here is a graph of the latency of the O3 Replay MDC streams on emfollow over the last 7 hours:

Note that it's log-log, otherwise it would be difficult to notice the periodic outliers, particularly the worst at just under 12 seconds.
Generally speaking, even if the new server performs better there are always going to be some outliers with this number of clients - probably not 12 seconds, but a few seconds nonetheless.
The live data will have burstiness for various reasons, but it should not have the extra burstiness coming from the lowlatency@cit server, as the data is being pulled directly from the observatories to the brokers which distribute that data to the cluster, bypassing that server.

Is this also true for live data? I've been querying the live data here and am mostly concerned about that for O4.

While this problem isn't happening for full-bandwidth GW searches, we are seeing it with external events. If a trigger from e.g., Fermi comes in, we run the detchar check on it so that we can have an assessment of the GW data quality/availability around that time. However, if it's in GraceDB and triggering this task <6 seconds after the trigger, we won't be able to process it. One way to fix this is to retry the task after t_event + tau_latency where tau_latency is the constant (6 seconds?) that we might expect for the latency of the live O4 frames.

For live data, the latencies will be shifted due to latencies in data generation, calibration, etc. For the replay, there's only the latencies due to the data distribution itself.

I think retries are essential here as there is no guarantee 100% of data will arrive by T seconds. Instead of retrying once, what about retrying N times, where these times are set by t_event + i * tau_latency?

That makes sense and implementing retries is a good idea. Thanks!

Just to finish this discussion up, here are the same graphs for the live data:

LHO:

LLO:

Virgo:

The second "hump" in the LHO and LLO plots may be due to a number of things. In any event, as you can see, it's rare for the LHO or LLO latency to go over about 9.5 seconds.

The Virgo data has a higher latency for different reasons, one of which being that they send four one-second frame files every four seconds, which is inherently bursty. I'm not sure what happened with the outliers at about 100 seconds, but it's likely some kind of work was being performed at that time.

Thanks a lot for this @pb! This is really useful information.

@pb I was going to give a talk in a week. Is it OK if I use the data distribution time plots you had made above? I also wanted to check if this is roughly representative of the status right now.

@deep.chatterjee Use the plots: Sure, no problem. Roughly representative: as far as I'm aware, nothing major has changed, but you might want to add some kind of footnote saying that those were the plots as of 4 months ago.

Will do. Many thanks.

mentioned in issue #623 (closed)

Still debugging this for some playground GW events with no external triggers. S230511p is an early warning event with superevent created at 2023-05-11 03:29:10 UTC = 2023-05-10 20:29:10 Pacific. The EARLY_WARNING label is applied two seconds later, at 2023-05-11 03:29:12 UTC = 2023-05-10 20:29:12 Pacific. But somehow the check_vectors task is called (and fails due to this error) one second before that, at 2023-05-10 20:29:11 Pacific as seen in the gwcelery log below and on Sentry here. I'll continue looking at this but wanted to start something here.

[2023-05-10 20:29:11,013: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52] received
[2023-05-10 20:29:11,015: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.check_vectors[5b283fc5-6169-436b-8b04-442d5e029004] received
[2023-05-10 20:29:11,017: INFO/ForkPoolWorker-61] gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52]: Delaying omegascan because S230511p is in the future
[2023-05-10 20:29:11,021: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[686517bf-4d87-4811-8744-4d94f10fccb4] received
[2023-05-10 20:29:11,023: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52] succeeded in 0.006974603980779648s: None
[2023-05-10 20:29:11,023: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[f44299ed-494b-4292-9fa8-3d82fa493821] received
[2023-05-10 20:29:11,026: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[7886d8d4-ac4f-4495-802e-ec22260a6298] received
[2023-05-10 20:29:11,046: INFO/MainProcess/IGWNReceiverThread] calling handlers [<@task: gwcelery.tasks.superevents.handle of gwcelery at 0x14ba9fbb2880>, <@task: gwcelery.tasks.orchestrator.handle_cbc_event of gwcelery at 0x14ba9fbb2880>] for key 'cbc_gstlal'
[2023-05-10 20:29:11,050: INFO/MainProcess/MainThread] Task gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91] received
[2023-05-10 20:29:11,053: INFO/MainProcess/MainThread] Task gwcelery.tasks.orchestrator.handle_cbc_event[62521bd1-4753-488b-9693-165739b8588e] received
[2023-05-10 20:29:11,054: INFO/ForkPoolWorker-61] gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91]: Ignoring since G1046895 has ['EMBRIGHT_READY', 'EARLY_WARNING'] labels. It is not complete [2023-05-10 20:29:11,054: INFO/ForkPoolWorker-63/MainThread] Task gwcelery.tasks.orchestrator.handle_cbc_event[62521bd1-4753-488b-9693-165739b8588e] succeeded in 5.6097982451319695e-05s: None
[2023-05-10 20:29:11,056: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91] succeeded in 0.0011677530128508806s: None
[2023-05-10 20:29:11,072: INFO/MainProcess/MainThread] Task gwcelery.tasks.gracedb.upload[b8767f10-c0c8-4a1c-8e29-157907c4f250] received
[2023-05-10 20:29:11,084: ERROR/ForkPoolWorker-62/MainThread] Task gwcelery.tasks.detchar.check_vectors[5b283fc5-6169-436b-8b04-442d5e029004] raised unexpected: ValueError('StateVector with span [1367810964.875 ... 1367810968.0) does not cover requested interval [1367810964.892661 ... 1367810970.892661)')
Traceback (most recent call last):
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 207, in _inner
    reraise(*exc_info)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 60, in reraise
    raise value
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 202, in _inner
    return f(*args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 457, in check_vectors
    states.update(check_vector(caches[channel.split(':')[0]], channel,
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 353, in check_vector
    statevector = StateVector.read(cache, channel,
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/statevector.py", line 689, in read
    return super().read(source, *args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 81, in _join
    return _pad_series(
  File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 139, in _pad_series
    raise ValueError(
ValueError: StateVector with span [1367810964.875 ... 1367810968.0) does not cover requested interval [1367810964.892661 ... 1367810970.892661)

changed iteration to LL grp iterations May 15, 2023 - May 21, 2023

removed iteration LL grp iterations May 8, 2023 - May 14, 2023

mentioned in merge request !1219 (merged)

closed with merge request !1219 (merged)

State vector check is running into the latency of /dev/shm frames

Child items ...

Activity

Admin message

State vector check is running into the latency of /dev/shm frames

Activity