Maintenance will be performed on git.ligo.org, containers.ligo.org, and docs.ligo.org on Tuesday 21 January 2025 starting at approximately 9am MST (4pm UTC, 8am PST). It is expected to take around 30 minutes and there will be several periods of downtime throughout the maintenance. Please address any comments, concerns, or questions to the helpdesk.
State vector check is running into the latency of /dev/shm frames
The detchar task checks state vectors for times around superevent times to make sure that data quality is ok. As part of this, it queries and creates a Cache object that includes llhoft data. Sometimes, we're running into a gwpy ValueError (example from E406327 on production):
ValueErrorStateVector with span [1367156859.6875 ... 1367156863.0) does not cover requested interval [1367156859.71 ... 1367156863.774)
The event was submitted to GraceDB at 1367156867.000, 4 seconds after the end of the requested data, so the above data should already be on disk. Looking through the Sentry traceback linked above, gwcelery is able to create a cache with 10 1-second frames from /dev/shm/kafka, which is fed to gwpy.timeseries.core.timeseries_reader() as the source kwarg. timeseries_reader() then passes this cache through to gwpy.timeseries.io.cache.preformat_cache(), where the sieve results in a cache that has only 4 frames, which are then passed downstream:
Since the last frame ends before the end of the requested interval (1367156863.774), we get this ValueError.
We're seeing this issue a few times a day between the production and playground environments tracked on Sentry. I'm not sure why it's happening only sometimes.
Perhaps @duncanmmacleod can shed some light on the gwpy workings that are going on here?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
There's a bit of test code in the below comment which reproduces the error, but I think this isn't a gwpy problem, and is just a consequence of the latency of the frames. Sorry for the false alarm!
Some of these occurrences are caused by the detchar check skipping if the requested StateVector's start time is before now. However, the end can be after now, which means that we'll eventually be looking for missing data. !1193 (merged) should fix that.
However, I don't think this fixes everything. For example, for E406368, this is what the gwcelery log reveals:
[2023-05-03 19:54:03,620: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.gracedb.create_event[4dd4dd4d-07ab-438d-b9ec-38f6a314b4ff] succeeded in 2.3868746110238135s: {'warnings': [], 'submitter': 'emfollow', 'created': '2023-05-04 02:54:01 UTC', 'group': 'External', 'graceid': 'E406368', 'pipeline': 'Fermi', 'gpstime': 1367204052.99, 'reporting_latency': 6.294359, 'instruments': '', 'nevents': None, 'offline': False, 'search': 'GRB', 'far': None, 'far_is_upper_limit': False, 'likelihood': None, 'labels': ['NOT_GRB'], 'extra_attributes': {'GRB': {'ivorn': 'ivo://nasa.gsfc.gcn/Fermi#GBM_Alert_2023-05-04T02:53:54.99_704861639_1-716', 'author_ivorn': 'ivo://nasa.gsfc.tan/gcn', 'author_shortname': 'Fermi (via VO-GCN)', 'observatory_location_id': 'GEOLUN', 'coord_system': 'UTC-FK5-GEO', 'ra': 0.0, 'dec': 0.0, 'error_radius': 0.0, 'how_description': 'Fermi Satellite, GBM Instrument', 'how_reference_url': 'http://gcn.gsfc.nasa.gov/fermi.html', 'T90': None, 'trigger_duration': 0.256, 'designation': None, 'redshift': None, 'trigger_id': '704861639'}}, 'superevent': None, 'superevent_neighbours': {}, 'links': {'neighbors': 'https://gracedb.ligo.org/api/events/E406368/neighbors/', 'lo...', ...}}[2023-05-03 19:54:03,625: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.check_vectors[6f4fa02c-92b9-462d-ab5d-c2d81507d032] received[2023-05-03 19:54:03,692: ERROR/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.detchar.check_vectors[6f4fa02c-92b9-462d-ab5d-c2d81507d032] raised unexpected: ValueError('StateVector with span [1367204050.9375 ... 1367204055.0) does not cover requested interval [1367204050.99 ... 1367204055.246)')Traceback (most recent call last): File "/home/emfollow/.local/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task R = retval = fun(*args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 207, in _inner reraise(*exc_info) File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 60, in reraise raise value File "/home/emfollow/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 202, in _inner return f(*args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__ return self.run(*args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 454, in check_vectors states.update(check_vector(caches[channel.split(':')[0]], channel, File "/home/emfollow/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 350, in check_vector statevector = StateVector.read(cache, channel, File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/statevector.py", line 689, in read return super().read(source, *args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read return timeseries_reader(cls, source, *args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read return io_read_multi(joiner, cls, source, *args, **kwargs) File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi return flatten(out) File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 81, in _join return _pad_series( File "/home/emfollow/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 139, in _pad_series raise ValueError(ValueError: StateVector with span [1367204050.9375 ... 1367204055.0) does not cover requested interval [1367204050.99 ... 1367204055.246)
The important part of this is the time of the check_vectors task being received: 2023-05-03 19:54:03,625 Pacific, which translates to 2023-05-04 02:54:03,625 UTC, which translates to 1367204061.
The requested interval is [1367204050.99 ... 1367204055.246), and the StateVector span is [1367204050.9375 ... 1367204055.0). All of this is ~6 seconds before the check_vectors task is called.
I think what's happening instead is that we're running into the inherent latency of the low-latency frames. Running this bit of test code on the emfollow machine with end_diff = 5 results in the above ValueError every time. The same code with end_diff = 6 runs without any errors.
@stuart.anderson @patrick.brockill Is ~6 seconds the expected latency for the population of /dev/shm/kafka/*/*.gwf?
If so, then our check_vectors task should make sure 6 seconds since the end of the requested StateVector window has already passed before running, then either wait if not (which increases latency of the whole system), or handle the error and skip the check (which risks allowing bad data, or injections, or overflows to end up in public alerts). We are already running the latter risk with EarlyWarning events, which don't get a check_vectors check, but skipping it for all events which come in at a small latency seems risky as well.
Geoffrey Mochanged title from Caches for state vector checking are sometimes being over-sieved when being read by gwpy to State vector check is running into the latency of /dev/shm frames
changed title from Caches for state vector checking are sometimes being over-sieved when being read by gwpy to State vector check is running into the latency of /dev/shm frames
I did this testing on the emfollow machine, but I’m not sure if this (6 sec to data availability) is necessarily a larger latency than expected, or what’s normal and just something we hadn’t run into before.
@geoffrey.mo while we wait for the definitive answer from@pb, please also check emfollow-{dev,test,playground}, which I suspect to have the same performance, and then one other random bare metal login system at CIT, e.g., ldas-pcdev3.
I get the same result on emfollow-{dev,test,playground} and ldas-pcdev3 as on emfollow, where asking for data 5 seconds before now gets me the StateVector error about 70% of the time (13-14 out of 20 queries) and asking for data 6 seconds before gives me no errors. (FWIW th test snippet runs much faster on pcdev3 compared to the emfollow nodes but that could just be a function of the load on the nodes.)
Bump on this @stuart.anderson (it seems like Patrick Brockill isn't in the emfollow group and so can't be tagged). I'm guessing the 6 seconds or so from acquisition to data availability at CIT is the expected latency? Maybe @patrick.godwin or @jameson.rollins might know?
Depends on which stream we're talking about. 6 seconds may be true for the replay but not necessarily for the live stream. What should be the case however is that by the time a full-bandwidth trigger comes in, that the data should be available up to t_merger as the search had already processed that data.
There are two (known) issues with the latency of the O3 Replay MDC streams:
We have about 1830 machines pulling this data, which causes the latency to become bursty. More on this below.
We have seen a bit of extra burstiness due to infrequent spikes in latency from the place where that data is being produced, lowlatency@cit, which should not be there (this could be hardware, software, etc.). I hope this will get better when we move to a new server, lowlatency1@cit.
Here is a graph of the latency of the O3 Replay MDC streams on emfollow over the last 7 hours:
Note that it's log-log, otherwise it would be difficult to notice the periodic outliers, particularly the worst at just under 12 seconds.
Generally speaking, even if the new server performs better there are always going to be some outliers with this number of clients - probably not 12 seconds, but a few seconds nonetheless.
The live data will have burstiness for various reasons, but it should not have the extra burstiness coming from the lowlatency@cit server, as the data is being pulled directly from the observatories to the brokers which distribute that data to the cluster, bypassing that server.
Is this also true for live data? I've been querying the live data here and am mostly concerned about that for O4.
While this problem isn't happening for full-bandwidth GW searches, we are seeing it with external events. If a trigger from e.g., Fermi comes in, we run the detchar check on it so that we can have an assessment of the GW data quality/availability around that time. However, if it's in GraceDB and triggering this task <6 seconds after the trigger, we won't be able to process it. One way to fix this is to retry the task after t_event + tau_latency where tau_latency is the constant (6 seconds?) that we might expect for the latency of the live O4 frames.
For live data, the latencies will be shifted due to latencies in data generation, calibration, etc. For the replay, there's only the latencies due to the data distribution itself.
I think retries are essential here as there is no guarantee 100% of data will arrive by T seconds. Instead of retrying once, what about retrying N times, where these times are set by t_event + i * tau_latency?
Just to finish this discussion up, here are the same graphs for the live data:
LHO:
LLO:
Virgo:
The second "hump" in the LHO and LLO plots may be due to a number of things. In any event, as you can see, it's rare for the LHO or LLO latency to go over about 9.5 seconds.
The Virgo data has a higher latency for different reasons, one of which being that they send four one-second frame files every four seconds, which is inherently bursty. I'm not sure what happened with the outliers at about 100 seconds, but it's likely some kind of work was being performed at that time.
@pb I was going to give a talk in a week. Is it OK if I use the data distribution time plots you had made above? I also wanted to check if this is roughly representative of the status right now.
@deep.chatterjee Use the plots: Sure, no problem. Roughly representative: as far as I'm aware, nothing major has changed, but you might want to add some kind of footnote saying that those were the plots as of 4 months ago.
Still debugging this for some playground GW events with no external triggers. S230511p is an early warning event with superevent created at 2023-05-11 03:29:10 UTC = 2023-05-10 20:29:10 Pacific. The EARLY_WARNING label is applied two seconds later, at 2023-05-11 03:29:12 UTC = 2023-05-10 20:29:12 Pacific. But somehow the check_vectors task is called (and fails due to this error) one second before that, at 2023-05-10 20:29:11 Pacific as seen in the gwcelery log below and on Sentry here. I'll continue looking at this but wanted to start something here.
[2023-05-10 20:29:11,013: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52] received[2023-05-10 20:29:11,015: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.check_vectors[5b283fc5-6169-436b-8b04-442d5e029004] received[2023-05-10 20:29:11,017: INFO/ForkPoolWorker-61] gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52]: Delaying omegascan because S230511p is in the future[2023-05-10 20:29:11,021: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[686517bf-4d87-4811-8744-4d94f10fccb4] received[2023-05-10 20:29:11,023: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.detchar.omegascan[aa73196c-485d-43f5-bd89-afa952caba52] succeeded in 0.006974603980779648s: None[2023-05-10 20:29:11,023: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[f44299ed-494b-4292-9fa8-3d82fa493821] received[2023-05-10 20:29:11,026: INFO/MainProcess/MainThread] Task gwcelery.tasks.detchar.make_omegascan[7886d8d4-ac4f-4495-802e-ec22260a6298] received[2023-05-10 20:29:11,046: INFO/MainProcess/IGWNReceiverThread] calling handlers [<@task: gwcelery.tasks.superevents.handle of gwcelery at 0x14ba9fbb2880>, <@task: gwcelery.tasks.orchestrator.handle_cbc_event of gwcelery at 0x14ba9fbb2880>] for key 'cbc_gstlal'[2023-05-10 20:29:11,050: INFO/MainProcess/MainThread] Task gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91] received[2023-05-10 20:29:11,053: INFO/MainProcess/MainThread] Task gwcelery.tasks.orchestrator.handle_cbc_event[62521bd1-4753-488b-9693-165739b8588e] received[2023-05-10 20:29:11,054: INFO/ForkPoolWorker-61] gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91]: Ignoring since G1046895 has ['EMBRIGHT_READY', 'EARLY_WARNING'] labels. It is not complete [2023-05-10 20:29:11,054: INFO/ForkPoolWorker-63/MainThread] Task gwcelery.tasks.orchestrator.handle_cbc_event[62521bd1-4753-488b-9693-165739b8588e] succeeded in 5.6097982451319695e-05s: None[2023-05-10 20:29:11,056: INFO/ForkPoolWorker-61/MainThread] Task gwcelery.tasks.superevents.handle[bab8f61c-eb83-49f0-a0e0-a1afaaab0f91] succeeded in 0.0011677530128508806s: None[2023-05-10 20:29:11,072: INFO/MainProcess/MainThread] Task gwcelery.tasks.gracedb.upload[b8767f10-c0c8-4a1c-8e29-157907c4f250] received[2023-05-10 20:29:11,084: ERROR/ForkPoolWorker-62/MainThread] Task gwcelery.tasks.detchar.check_vectors[5b283fc5-6169-436b-8b04-442d5e029004] raised unexpected: ValueError('StateVector with span [1367810964.875 ... 1367810968.0) does not cover requested interval [1367810964.892661 ... 1367810970.892661)')Traceback (most recent call last): File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 451, in trace_task R = retval = fun(*args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 207, in _inner reraise(*exc_info) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 60, in reraise raise value File "/home/emfollow-playground/.local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 202, in _inner return f(*args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/celery/app/trace.py", line 734, in __protected_call__ return self.run(*args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 457, in check_vectors states.update(check_vector(caches[channel.split(':')[0]], channel, File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwcelery/tasks/detchar.py", line 353, in check_vector statevector = StateVector.read(cache, channel, File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/statevector.py", line 689, in read return super().read(source, *args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read return timeseries_reader(cls, source, *args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read return io_read_multi(joiner, cls, source, *args, **kwargs) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi return flatten(out) File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 81, in _join return _pad_series( File "/home/emfollow-playground/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 139, in _pad_series raise ValueError(ValueError: StateVector with span [1367810964.875 ... 1367810968.0) does not cover requested interval [1367810964.892661 ... 1367810970.892661)