Maintenance will be performed on git.ligo.org, containers.ligo.org, and docs.ligo.org on Tuesday 1 April 2025 starting at approximately 9am PDT. It is expected to take around 30 minutes and there will be several periods of downtime throughout the maintenance. Please address any comments, concerns, or questions to the helpdesk.
Cannot append discontiguous TimeSeries error in detchar.py
idq_probs=dict(check_idq(caches[channel.split(':')[0]],channel,start,end)forchannelinapp.conf['idq_channels'])### calling code at line 244 idq_prob=TimeSeries.read(cache,channel,start=start,end=end)
Please check that this is not a bug in gwpy connected to rouding that results in two TimeSeries appearing as discontiguous even if they are not. Floting point numbers have just 16 decimal digit precision.
I was not able to locate the two time-series that the program try to join to chck this is the case and that may results in DQV/DQOK label in a wrong way.
@duncanmmacleod This has come up again with the O3 replay frames. I've got a test case for you though, so I'm hoping you can help. I haven't been able to set this up on CIT because I don't have access to the python3.9 header file, but I setup a virtual environment on my own machine and only installed gwpy, lscsoft-glue, and lalsuite using pip.
Traceback (most recent call last): File "/home/cmessick/gwcelery_personal/240104_gwpy_timeseries_read_error/runme", line 12, in <module> ts = TimeSeries.read(cache, 'V1:Hrec_hoft_16384Hz_INJ1_O3Replay', start=1388417648.252626, end=1388417662.252626) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read return timeseries_reader(cls, source, *args, **kwargs) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read return io_read_multi(joiner, cls, source, *args, **kwargs) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi return flatten(out) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 79, in _join joined = list_.join(pad=pad, gap=gap) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 1657, in join out.append(series, gap=gap, pad=pad) File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/types/series.py", line 813, in append raise ValueError(ValueError: Cannot append discontiguous TimeSeries TimeSeries 1 span: [1388417648.2526243 ... 1388417648.9999387) TimeSeries 2 span: [1388417649.0 ... 1388417650.0)
You can find the frame files needed for this test on CIT at /home/cody.messick/playground/240104_gwpy_timeseries_read_error/frames and the script I'm running one above that, though I've pasted it here (with the glob pattern modified) too.
I searched for Cannot append discontiguous TimeSeries on sentry and didn't get any hits, and I grepped the logs on playground (which go back to Jan 18) and also didn't get any hits, so I think we can close this.
The issue seems to be a small discontinuity in the 1s long segments. Wondering if there is an easy way to ignore the small segment, maybe zero padding? I'll reopen this issue.
This is no longer the idq probabilities from before, this is now straight up strain data. That makes me extremely nervous. Why are there 6k discontinuities in the data near uploads over a 30 day period?
Flower shows this task from sentry having completed, which makes me wonder if this is an issue with the data not being where it's expected when the omega scan task first runs
But why is the value error happening in the first place? Is it trying to access data that isn't available yet? If so, I dont think zero padding is the correct solution, waiting a few more seconds is.
What I'm asking is where those segments come from? We don't specify them as inputs, so something (I assume gwpy) is returning those discontiguous segments.
And the fact that the problem goes away with a retry suggests the segments aren't actually discontiguous and that it's just an error in computing the segments. Which I would assume are computed using the data. Which is what makes me think a frame is being dropped somewhere. The other error you linked seems to be when there's no data at all, which is different than just missing a single frame or frame file.
What I'm asking is where those segments come from?
This is the cache finder. Its globbing the frame files creating a lal cache. @geoffrey.mo can say more about the other corner cases, but that's what its doing in essence.
And the fact that the problem goes away with a retry
I was mistaken, I missed that we mark the task as complete as part of our error catching.
I'm warming up the idea of checking for 9s like you suggest, but I think we should try to understand exactly what is causing the cache to give us an end time of 0.999939 instead of 0.999999. All of the low latency frame files should be one second long, and they should all cover 0 to 0.999999 (disclaimer: idk how many 9s there should be).
I'm wondering if we could use numpy.isclose on this line to fix errors like this, but I think we need more information. Specifically, is this some weird round off error getting introduced by the Cache object in glue.lal, and if not, what is it? If it is some weird rounding issue in glue, then we can implement a workaround with a fixme, and open up an issue with them.
I'd feel a lot better about all of this if we could snag a snapshot of what files are available at the time one of these errors happens, and maybe even make a copy of as many of them as possible to poke at. I just want to make sure this isn't indicative of some deeper problem, it'd really suck if we find out later there is some underlying problem and the only reason we didn't catch it is we just put a bandaid on before understanding this issue.
The number of nines are not consistent. I have seen at least 2 nines though. Another point to note is that I did not find an instance of this particular error in production in the last 90 days. The cases I looked at, all came from the O3 injection replay channel (and hence only playground).
Do you have an example of this error being raised where this is not the first frame in the cache? Every example I've seen so far seems to suggest that the discontinuity corresponds to a fraction of a second of data requested from the first frame in the cache. If this were a data distribution issue (due to dropouts, etc), I'd expect this issue to be more random and pop up anywhere in the time being requested. My point is, I don't think this is coming from the underlying data itself.
I've seen it not be the first frame in the cache, but it's the first frame file that contains the time you're requesting (the cache passed to the read function contains more than just the requested times of data). Just in case that changes your interpretation Patrick.
So I was playing with this earlier, and found that I could reproduce the error on playground, but if I used the exact same versions of gwpy and glue on a different headnode, I couldn't reproduce the error. Assuming I didn't make a mistake, I think that and what you found point to some dependency causing this. If that's the case, we should be able to compare the lock files between production and playground and see what else is different.
Seems like this issue is isolated to playground host only. @geoffrey.mo is it easy enough for you to run this quickly on emfollow-test? One difference between the production and other hosts is that they don't have the igwn conda distribution.
glue depends on ligo-segments. Given the nature of the error with intervals, it might be worthwhile to check the versions of ligo-segments in the igwn conda envs vs. our poetry lock file.
Duncan, I know there's a lot to read here. This comment has a case you can reproduce the problem, and Robertos comment just above this (link) describes an interesting behavior we found that we think is related.
File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/io/gwf/framecpp.py", line 28, in <module> from LDAStools import frameCPP # noqa: F401
Indeed we do have to install LDAStools and possibly NDS2
@roberto.depietri, the GWpy issue is known to be limited to using LALFrame format="gwf.lalframe" as the GWF interface. If you are able to install python-ldas-tools-framecpp and use that, that should be sufficient.
Otherwise, GWpy 3.0.8 should be issued this week and will include a workaround for this.
@duncanmmacleod Thank you! Let me know as soon as GWpy 3.0.8 is available to test it. That would be the simple solution, and I think we should follow this path and impose GWpy version >= 3.0.8 in our dependencies.
It seems like the omegascans for this event (and many others) are failing due to the same(?) gwpy.types.series Cannot append discontiguous TimeSeries error. Here's a short traceback for the above event:
The problem is fundamentally (I think) that you are passing in a micro-second precision GPS time, which Python can't represent exactly as a float, so when it comes to working out the start and end time of arrays, precision is lost and mistakes are made.
Do you get the same problem if you attempt the exact same scan using a GPS time rounded to 1/2/3 decimal places?
I thought I had done some work to resolve this in GWpy, but am overloaded with many other things and GWpy is suffering as a result, so I might have imagined that entirely or done the work and never released it, or something else.
Thanks for the suggestion @duncanmmacleod ! Here are some findings:
I'm now using 1403777831.873724, 1403777845.873724 as the start and end times since this also broke for this GW event.
Our usual way of accessing frames is through /dev/shm/kafka , but since the buffer has passed, I'm using find_urls from gwdatafind instead
This discontiguous TimeSeries problem does not happen with H1_HOFT_C00, but does with H1_llhoft (see below snippet #1 (closed)). Maybe @pb can say if this is expected, or some differences between the frame types that might cause this?
@duncanmmacleod was right about a rounded gpstime working (see snippet 2 below). I think we can use this as a solution.