Cannot append discontiguous TimeSeries error in detchar.py

@duncanmmacleod Please have a look to this iussue

@roberto.depietri, how can I reproduce this in an isolated manner?

@duncanmmacleod This has come up again with the O3 replay frames. I've got a test case for you though, so I'm hoping you can help. I haven't been able to set this up on CIT because I don't have access to the python3.9 header file, but I setup a virtual environment on my own machine and only installed gwpy, lscsoft-glue, and lalsuite using pip.

python3.9 -m venv gwpy_testing
source gwpy_testing/bin/activate
python3.9 -m pip install --upgrade pip
python3.9 -m pip install gwpy lscsoft-glue lalsuite

I then run my test script and see

Traceback (most recent call last):
  File "/home/cmessick/gwcelery_personal/240104_gwpy_timeseries_read_error/runme", line 12, in <module>
    ts = TimeSeries.read(cache, 'V1:Hrec_hoft_16384Hz_INJ1_O3Replay', start=1388417648.252626, end=1388417662.252626)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 79, in _join
    joined = list_.join(pad=pad, gap=gap)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 1657, in join
    out.append(series, gap=gap, pad=pad)
  File "/home/cmessick/virtual_envs/gwpy_testing_2/lib/python3.9/site-packages/gwpy/types/series.py", line 813, in append
    raise ValueError(
ValueError: Cannot append discontiguous TimeSeries
    TimeSeries 1 span: [1388417648.2526243 ... 1388417648.9999387)
    TimeSeries 2 span: [1388417649.0 ... 1388417650.0)

You can find the frame files needed for this test on CIT at /home/cody.messick/playground/240104_gwpy_timeseries_read_error/frames and the script I'm running one above that, though I've pasted it here (with the glob pattern modified) too.

#!/usr/bin/env python3

from glob import glob

from glue.lal import Cache
from gwpy.timeseries import TimeSeries

filenames = glob('/path/to/frames/*gwf')
cache = Cache.from_urls(filenames)

ts = TimeSeries.read(cache, 'V1:Hrec_hoft_16384Hz_INJ1_O3Replay', start=1388417648.252626, end=1388417662.252626)

added Data Quality label

mentioned in issue #173 (closed)

Judging by the gwpy error, this looks like it's actually two discontiguous timeseries. It is a repeat of #173 (closed), so I will close that.

changed milestone to %O4

assigned to @geoffrey.mo

Does the problem persist? If this is a repeat of #173 (closed) which is closed, shall we close this?

I searched for Cannot append discontiguous TimeSeries on sentry and didn't get any hits, and I grepped the logs on playground (which go back to Jan 18) and also didn't get any hits, so I think we can close this.

The problem doesn't seem to come up anymore, so I'm going to close this issue.

closed

This issue is showing up regularly (sentry reports over 6K instances in last 30 days on playground): https://ligo-caltech.sentry.io/issues/4447067763/events/32d894a1dac040499b2f2044ae99a71c/

The symptoms is the same as in the original description

ValueError

Cannot append discontiguous TimeSeries
    TimeSeries 1 span: [1388349433.6259766 ... 1388349433.999939)
    TimeSeries 2 span: [1388349434.0 ... 1388349435.0)

The issue seems to be a small discontinuity in the 1s long segments. Wondering if there is an easy way to ignore the small segment, maybe zero padding? I'll reopen this issue.

This is no longer the idq probabilities from before, this is now straight up strain data. That makes me extremely nervous. Why are there 6k discontinuities in the data near uploads over a 30 day period?

Flower shows this task from sentry having completed, which makes me wonder if this is an issue with the data not being where it's expected when the omega scan task first runs

we handle the valuerror and make a log in sentry

But why is the value error happening in the first place? Is it trying to access data that isn't available yet? If so, I dont think zero padding is the correct solution, waiting a few more seconds is.

Well, the argument of the value error says the reason

TimeSeries 1 span: [1388349433.6259766 ... 1388349433.999939)
TimeSeries 2 span: [1388349434.0 ... 1388349435.0)')

If the data does not exist, then its a different traceback. See https://ligo-caltech.sentry.io/issues/1556997797/?environment=playground&project=1425216&query=is%3Aunresolved&referrer=issue-stream&statsPeriod=24h&stream_index=4&utc=true

What I'm asking is where those segments come from? We don't specify them as inputs, so something (I assume gwpy) is returning those discontiguous segments.

And the fact that the problem goes away with a retry suggests the segments aren't actually discontiguous and that it's just an error in computing the segments. Which I would assume are computed using the data. Which is what makes me think a frame is being dropped somewhere. The other error you linked seems to be when there's no data at all, which is different than just missing a single frame or frame file.

What I'm asking is where those segments come from?

This is the cache finder. Its globbing the frame files creating a lal cache. @geoffrey.mo can say more about the other corner cases, but that's what its doing in essence.

And the fact that the problem goes away with a retry

Sorry, what is retried?

Sorry, what is retried?

I was mistaken, I missed that we mark the task as complete as part of our error catching.

I'm warming up the idea of checking for 9s like you suggest, but I think we should try to understand exactly what is causing the cache to give us an end time of 0.999939 instead of 0.999999. All of the low latency frame files should be one second long, and they should all cover 0 to 0.999999 (disclaimer: idk how many 9s there should be).

I'm wondering if we could use numpy.isclose on this line to fix errors like this, but I think we need more information. Specifically, is this some weird round off error getting introduced by the Cache object in glue.lal, and if not, what is it? If it is some weird rounding issue in glue, then we can implement a workaround with a fixme, and open up an issue with them.

EDIT: numpy.isclose won't work

In [1]: import numpy

In [2]: numpy.isclose(2.999939, 3)
Out[2]: False

In [3]: numpy.isclose(2.999939, 2.999999)
Out[3]: False

EDIT 2:

Oh wait, maybe it will

In [5]: numpy.isclose(1388349433.999939, 1388349433.999999)
Out[5]: True

In [6]: numpy.isclose(1388349433.999939, 1388349434)
Out[6]: True

EDIT 3:

Okay, back to skeptical

In [8]: numpy.isclose(1388349433.5, 1388349434)
Out[8]: True

I'd feel a lot better about all of this if we could snag a snapshot of what files are available at the time one of these errors happens, and maybe even make a copy of as many of them as possible to poke at. I just want to make sure this isn't indicative of some deeper problem, it'd really suck if we find out later there is some underlying problem and the only reason we didn't catch it is we just put a bandaid on before understanding this issue.

The number of nines are not consistent. I have seen at least 2 nines though. Another point to note is that I did not find an instance of this particular error in production in the last 90 days. The cases I looked at, all came from the O3 injection replay channel (and hence only playground).

reopened

mentioned in merge request !1362 (closed)

@pb could you take a look at this? It appears from Sentry (https://ligo-caltech.sentry.io/issues/4447067763/?project=1425216&query=is%3Aunresolved&referrer=issue-stream&stream_index=3) that this is only happening on playground, where we ingest O3 replay data: /dev/shm/kafka/{detector}_O3ReplayMDC/*.gwf

Do you have an example of this error being raised where this is not the first frame in the cache? Every example I've seen so far seems to suggest that the discontinuity corresponds to a fraction of a second of data requested from the first frame in the cache. If this were a data distribution issue (due to dropouts, etc), I'd expect this issue to be more random and pop up anywhere in the time being requested. My point is, I don't think this is coming from the underlying data itself.

I think you're right, see below comment for more.

I've seen it not be the first frame in the cache, but it's the first frame file that contains the time you're requesting (the cache passed to the read function contains more than just the requested times of data). Just in case that changes your interpretation Patrick.

Sorry, that's what I meant.

I copied the data from this event (06c5277ecc7440a7a4c8ebdf67e9665a, https://ligo-caltech.sentry.io/issues/4447067763/?project=1425216&query=is%3Aunresolved&referrer=issue-stream&stream_index=3) to CIT at /home/geoffrey.mo/test/discontinuous_ts/data and wrote a small script to reproduce this: /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py The script contains this:

import sys
import glob
import glue
import gwpy
from glue.lal import Cache
from gwpy.timeseries import TimeSeries

start, end = 1388443007.591797, 1388443021.591797
data_path = '/home/geoffrey.mo/test/discontinuous_ts/data'

filenames = glob.glob(data_path + '/*gwf')
cache = Cache.from_urls(filenames)

print(f'python version: {sys.version}')
print(f'python location: {sys.executable}')
print(f'gwpy version: {gwpy.__version__}')
print(f'glue version: {glue.__version__}')

# this should error out
ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end)

print('I ran without errors.')

When running from poetry shell on emfollow-playground@emfollow-playground.ligo.caltech.edu :

(gwcelery-py3.9) [emfollow-playground@emfollow-playground ~]$ python /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py
python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0]
python location: /home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/bin/python
gwpy version: 3.0.7
glue version: 3.0.1
Traceback (most recent call last):
  File "/home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py", line 23, in <module>
    ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 79, in _join
    joined = list_.join(pad=pad, gap=gap)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 1657, in join
    out.append(series, gap=gap, pad=pad)
  File "/home/emfollow-playground/.cache/pypoetry/virtualenvs/gwcelery-IZAoHbYj-py3.9/lib/python3.9/site-packages/gwpy/types/series.py", line 813, in append
    raise ValueError(
ValueError: Cannot append discontiguous TimeSeries
    TimeSeries 1 span: [1388443007.5917969 ... 1388443007.999939)
    TimeSeries 2 span: [1388443008.0 ... 1388443009.0)

the error occurs.

However, if I run it using igwn-py39, it runs without issue.

python version: 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32)
[GCC 12.3.0]
python location: /cvmfs/software.igwn.org/conda/envs/igwn-py39/bin/python
gwpy version: 3.0.7
glue version: 3.0.1
I ran without errors.

The gwpy and glue versions are the same, so maybe it's some other package that's causing the issue?

The detectors are down, so I ran the same script on emfollow@emfollow, and the script ran successfully:

(igwn-py39-20221118) [emfollow@emfollow ~]$ python /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py
python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0]
python location: /cvmfs/software.igwn.org/conda/envs/igwn-py39-20221118/bin/python
gwpy version: 3.0.5
glue version: 3.0.1
I ran without errors.

This might explain why we're only seeing this problem on playground, and not on production.

So I was playing with this earlier, and found that I could reproduce the error on playground, but if I used the exact same versions of gwpy and glue on a different headnode, I couldn't reproduce the error. Assuming I didn't make a mistake, I think that and what you found point to some dependency causing this. If that's the case, we should be able to compare the lock files between production and playground and see what else is different.

I also just confirmed that I still get the issue on my machine running main but with gwpy 3.0.5 pinned.

Seems like this issue is isolated to playground host only. @geoffrey.mo is it easy enough for you to run this quickly on emfollow-test? One difference between the production and other hosts is that they don't have the igwn conda distribution.

glue depends on ligo-segments. Given the nature of the error with intervals, it might be worthwhile to check the versions of ligo-segments in the igwn conda envs vs. our poetry lock file.

Same issue on test:

(gwcelery-py3.9) [emfollow-test@emfollow-test ~]$ python /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py
python version: 3.9.18 (main, Nov 18 2023, 01:00:14)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]
python location: /home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/bin/python
gwpy version: 3.0.7
glue version: 3.0.1
Traceback (most recent call last):
  File "/home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py", line 23, in <module>
    ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 79, in _join
    joined = list_.join(pad=pad, gap=gap)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 1657, in join
    out.append(series, gap=gap, pad=pad)
  File "/home/emfollow-test/.cache/pypoetry/virtualenvs/gwcelery-nFtdlj0t-py3.9/lib/python3.9/site-packages/gwpy/types/series.py", line 813, in append
    raise ValueError(
ValueError: Cannot append discontiguous TimeSeries
    TimeSeries 1 span: [1388443007.5917969 ... 1388443007.999939)
    TimeSeries 2 span: [1388443008.0 ... 1388443009.0)

ligo.segments version 1.4.0 is in igwn-py39, and the same is installed in poetry on emfollow-test as well as in the lockfile (https://git.ligo.org/emfollow/gwcelery/-/blob/main/poetry.lock#L2590). So it might be another package.

On the ldas-pcdev5, using the igwn-py39 (conda activate igwn-py39) environment:

(igwn-py39) [roberto.depietri@ldas-pcdev5 ~]$ python3.9  /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py
python version: 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) 
[GCC 12.3.0]
python location: /cvmfs/software.igwn.org/conda/envs/igwn-py39/bin/python3.9
gwpy version: 3.0.7
glue version: 3.0.1
I ran without errors.

The difference is in a different hash for glue.lal, i.e.:

(igwn-py39) [roberto.depietri@ldas-pcdev5 ~]$ python3.9 -c "import glue.lal;print( glue.lal.__version__)"
git id 338edd5e22ac5c0172ee20d33b599386462e7ffb
(igwn-py39) [roberto.depietri@ldas-pcdev5 ~]$ python3.9 -c "import glue;print( glue.__version__)"
3.0.1

Instead of (in emfollow-dev machine):

[emfollow-dev@emfollow-dev ~]$ python3.9 -c "import glue.lal;print( glue.lal.__version__)"
git id ad4229f87c381ec0332f9c7f08b254275adc9442
[emfollow-dev@emfollow-dev ~]$ python3.9 -c "import glue;print( glue.__version__)"
3.0.1

and

[emfollow-dev@emfollow-dev ~]$ python3.9  /home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py
python version: 3.9.18 (main, Nov 18 2023, 01:00:14) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]
python location: /usr/bin/python3.9
gwpy version: 3.0.7
glue version: 3.0.1
Traceback (most recent call last):
  File "/home/geoffrey.mo/test/discontinuous_ts/make_valueerror.py", line 23, in <module>
    ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 310, in read
    return timeseries_reader(cls, source, *args, **kwargs)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 50, in read
    return io_read_multi(joiner, cls, source, *args, **kwargs)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/io/mp.py", line 101, in read_multi
    return flatten(out)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/io/core.py", line 79, in _join
    joined = list_.join(pad=pad, gap=gap)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/core.py", line 1657, in join
    out.append(series, gap=gap, pad=pad)
  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/types/series.py", line 813, in append
    raise ValueError(
ValueError: Cannot append discontiguous TimeSeries
    TimeSeries 1 span: [1388443007.5917969 ... 1388443007.999939)
    TimeSeries 2 span: [1388443008.0 ... 1388443009.0)

Good catch! I don't understand how the two glue instances could have the same version numbers but different git hashes.

That puzzled me a lot. One is installed using conda and the other using pip.

Maybe a question for @duncanmmacleod ?

Duncan, I know there's a lot to read here. This comment has a case you can reproduce the problem, and Robertos comment just above this (link) describes an interesting behavior we found that we think is related.

The glue issue is related to the way that the package is built, and the poorly constructed internal versioning system. It can likely be ignored.

At a glance, the GWpy issue looks similar to https://git.ligo.org/computing/helpdesk/-/issues/4774, which was reported upstream as https://git.ligo.org/computing/helpdesk/-/issues/4774, and has a workaround proposed for GWpy as https://github.com/gwpy/gwpy/pull/1694.

For the record, a fix for the git information in the lscsoft-glue Conda packages has been posted to https://github.com/conda-forge/lscsoft-glue-feedstock/pull/35.

The problem is that 1 works and 2 does not :

ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end,format='gwf.framecpp')
ts = TimeSeries.read(cache, 'L1:GDS-CALIB_STRAIN_INJ1_O3Replay', start=start, end=end,format='gwf.lalframe')

In our installation LDAStools are not installed:

  File "/home/emfollow-dev/.local/lib/python3.9/site-packages/gwpy/timeseries/io/gwf/framecpp.py", line 28, in <module>
    from LDAStools import frameCPP  # noqa: F401

Indeed we do have to install LDAStools and possibly NDS2

@roberto.depietri, the GWpy issue is known to be limited to using LALFrame format="gwf.lalframe" as the GWF interface. If you are able to install python-ldas-tools-framecpp and use that, that should be sufficient.

Otherwise, GWpy 3.0.8 should be issued this week and will include a workaround for this.

@duncanmmacleod Thank you! Let me know as soon as GWpy 3.0.8 is available to test it. That would be the simple solution, and I think we should follow this path and impose GWpy version >= 3.0.8 in our dependencies.

https://git.ligo.org/computing/sccb/-/issues/1406+ has now been released and requested via the SCCB, which should formally address the gwpy issue.

Thank you. !1366 (merged) will use this version.

I think we still see this issue with v3.0.8: https://ligo-caltech.sentry.io/issues/4447067763/events/fac71d7c86794cb1974009ea2955ba3e/

This issue is recurring now: https://ligo-caltech.sentry.io/issues/4510701400/events/d70199337adb403db6faa9d68ae64e05/ @duncanmmacleod @geoffrey.mo is there a gwpy related issue?

mentioned in merge request !1366 (merged)

closed with merge request !1366 (merged)

added LLAI Review label

reopened

Cannot append discontiguous TimeSeries error in detchar.py

Child items ...

Activity

Admin message

Cannot append discontiguous TimeSeries error in detchar.py

Activity