Segments: Add segment querying utils, refactoring

changed the description

some comments:

unless there is a reason for adding type restrictions to the arguments, they should all be removed. there are places in our code where it is necessary for a variable to be a specific type. for example when converting the bytes contained in gstreamer buffers to trigger objects in Python it is necessary to use the Python wrapping of the exact C structure definition that was packed into the buffer by the trigger generator. no flexibility can be permitted or terrible things happen. so there certainly are times when you need to restrict the type of a variable, but, for example, why does this code only allow integers and LIGOTimeGPS to be used for process start and stop times? the I/O library contains tests to determine if a suitable object has been provided, are those tests not sufficient?
please don't introduce functions that read(write) one named segment list from(to) one named file. that's literally exactly what lalsuite's XML I/O interface looked like 15 years ago when I took on the task of fixing that problem (i.e. the inefficient unmaintainable programs that design leads to) by writing the python I/O library that allows whole documents to be loaded, manipulated, and written back to disk. reading a coalesced named segmentlist from a named file is already only one line of code. it's kind of a long line of code, and you maybe have to look up what it is each time, and it's only one line if you're already importing XML I/O modules for other reasons, but it is a one-liner. gstlal_cs_triggergen, gstlal_play, gstlal_ligo_data_find_check, etc., need more lines of code to use the new function than to just do it inline themselves. Chad can explain whether you can sire an inca or inca a sire. he'll know what that means, and he can explain how writing "one function to read one table from a file" lead to no programs being able to produce output that any other programs could read, except for whichever one special program was originally intended to come after each. this is a bad direction to take the code in.
as I said on the call, I'd encourage thought to be put into thinking about pushing things upstream to help. for example, there's a case to be made for making coalescing be the default behaviour of the segment list extraction routines in ligo.lw. I think experience shows that that is overwhelmingly the use case of the code, I think exactly nobody relies on the current default behaviour (preserving the segments verbatim as they appear in the XML file). that API change would have to be co-ordinated with the collaboration just in case, but it would shorten the get-a-segmentlist one-liners by the ".coalesce()" at the end of each one.
another thing I've been threatening to do in the ligo.lw library is to flatten the API. the code was written at a time when basically nobody could read or write these files except through libmetaio-style hacks, and it wasn't clear how compatible all the different ways people were doing it really were. that meant I had to create a layered library with the lowest level imposing only the DTD rules, no policy about file contents, so that it would be sure to be able to read anything. people who knew they were working with files containing tables compatible with the standard definitions could enable a higher-level interface (the lsctables module) that added features to support those tables but then would refuse to load documents that didn't match the structure even if the documents were valid LIGOLW XML. this is why you always need to construct a content handler with your particular desired set of features turned on. at this time, all XML files used with the ligo.lw library are compatible with the highest level interfaces, and if ever one is found that isn't there's a case to be made for telling the author of the software that generated it to get with the program. so really there's no need to preserve the ability to turn those features off. that's the threat: make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this. but if this is done, then there'll be really no reason for the segment list read and write functions to exist, it really will be just a couple of lines of code to get a segment list even in a program that is not otherwise reading or writing XML files. so if the line count is important, I'd say this is the way to do it, not writing wrapper functions.

Thanks @kipp.cannon for the feedback. I'll reply as I have time to do so so this may be split up into a few replies.

unless there is a reason for adding type restrictions to the arguments, they should all be removed. there are places in our code where it is necessary for a variable to be a specific type. for example when converting the bytes contained in gstreamer buffers to trigger objects in Python it is necessary to use the Python wrapping of the exact C structure definition that was packed into the buffer by the trigger generator. no flexibility can be permitted or terrible things happen. so there certainly are times when you need to restrict the type of a variable, but, for example, why does this code only allow integers and LIGOTimeGPS to be used for process start and stop times? the I/O library contains tests to determine if a suitable object has been provided, are those tests not sufficient?

I think I may be misunderstanding the issue. These aren't type restrictions, they are type annotations and there are no checks done by the interpreter on runtime to validate. With type annotations, you can run a static type checker (e.g. mypy) externally to do validation at "compile" time but that's out of scope for this. These are purely for the benefit for documentation unless your project uses external tools to enforce this.

Here's one example: https://git.ligo.org/lscsoft/gstlal/-/blob/master/gstlal/python/segutil.py#L48. What am I allowed to pass in for allsegs? What about boundary_seg? Based on the names, you can probably guess what you're supposed to pass in. How about the time_slices arg in this constructor? https://git.ligo.org/lscsoft/gstlal/-/blob/master/gstlal-inspiral/python/cbc_template_fir.py#L204. I probably wouldn't know without doing some digging and seeing how it's used in other places.

As for "why does this code only allow integers and LIGOTimeGPS to be used for process start and stop time", I assume you're talking about the query_*_segments functions. It's true that ligo-segments and LIGOTimeGPS don't care whether this is an int, but GWOSC/DQSegDB only handle integer second boundaries and passing in an arbitrary float would likely either return an error or at least not have the expected behavior. If this is in fact not the case, then the type annotation for start/stop times should be corrected.

After the explanation above, do you still have the same thoughts on type annotations?

please don't introduce functions that read(write) one named segment list from(to) one named file. that's literally exactly what lalsuite's XML I/O interface looked like 15 years ago when I took on the task of fixing that problem (i.e. the inefficient unmaintainable programs that design leads to) by writing the python I/O library that allows whole documents to be loaded, manipulated, and written back to disk. reading a coalesced named segmentlist from a named file is already only one line of code. it's kind of a long line of code, and you maybe have to look up what it is each time, and it's only one line if you're already importing XML I/O modules for other reasons, but it is a one-liner. gstlal_cs_triggergen, gstlal_play, gstlal_ligo_data_find_check, etc., need more lines of code to use the new function than to just do it inline themselves. Chad can explain whether you can sire an inca or inca a sire. he'll know what that means, and he can explain how writing "one function to read one table from a file" lead to no programs being able to produce output that any other programs could read, except for whichever one special program was originally intended to come after each. this is a bad direction to take the code in.

That's a fair point, I'll be happy to remove/revert as needed. My motivation for this in case it's helpful is I was not looking to have a general replacement for any I/O code, but rather its focus is accessing data products on the analysis side. The analysis generates data products including segments either as part of the DAG or used as inputs to the DAG. The question I'm asking as a user is, how do I access these data products? Maybe this is where we're diverging in opinions, but as a user, I don't care about implementation details of the data format (and things like the name of the named segments within the XML file). What I do care about is, how do I access in these files that the analysis generated that are stored on disk (or elsewhere). In the LIGOLW/HDF5/etc libraries, there's a general way to create and read in these files, but in the analysis we've made a choice of how we want to store that information. That choice I've typically seen in runs for segments/vetoes are two separate LIGOLW files: a file containing analysis segments stored under the "datasegments" name, and a file containing vetoes stored under the "vetoes" name. Those names are options that can be specified, but typically the Makefile pinned that down to these specific names. The intention for these additional I/O functions were to answer the question: Given this choice, how do I access segments/vetoes in a way that doesn't care about the particular representation of the data on disk?

as I said on the call, I'd encourage thought to be put into thinking about pushing things upstream to help. for example, there's a case to be made for making coalescing be the default behaviour of the segment list extraction routines in ligo.lw. I think experience shows that that is overwhelmingly the use case of the code, I think exactly nobody relies on the current default behaviour (preserving the segments verbatim as they appear in the XML file). that API change would have to be co-ordinated with the collaboration just in case, but it would shorten the get-a-segmentlist one-liners by the ".coalesce()" at the end of each one.

I have no strong opinions on this but that sounds like it would be nice. I'm also not aware of anyone expecting non-coalesced segments stored on disk. Was your idea to have a way to opt out of this default behavior, or to make this the only way to store/retrieve segments on disk?

another thing I've been threatening to do in the ligo.lw library is to flatten the API. the code was written at a time when basically nobody could read or write these files except through libmetaio-style hacks, and it wasn't clear how compatible all the different ways people were doing it really were. that meant I had to create a layered library with the lowest level imposing only the DTD rules, no policy about file contents, so that it would be sure to be able to read anything. people who knew they were working with files containing tables compatible with the standard definitions could enable a higher-level interface (the lsctables module) that added features to support those tables but then would refuse to load documents that didn't match the structure even if the documents were valid LIGOLW XML. this is why you always need to construct a content handler with your particular desired set of features turned on. at this time, all XML files used with the ligo.lw library are compatible with the highest level interfaces, and if ever one is found that isn't there's a case to be made for telling the author of the software that generated it to get with the program. so really there's no need to preserve the ability to turn those features off. that's the threat: make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this. but if this is done, then there'll be really no reason for the segment list read and write functions to exist, it really will be just a couple of lines of code to get a segment list even in a program that is not otherwise reading or writing XML files. so if the line count is important, I'd say this is the way to do it, not writing wrapper functions.

With the historical context, that makes sense. For this: "make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this.", honestly, that sounds great and would love to see this.

@kipp.cannon @patrick.godwin This does sound nice, potentially. But I could also imagine it going the other way - though maybe there is something that I do not understand. It would be nice if the ligolw library allowed people to continue to make generic documents, but that it was somehow aware of when it needed to do strict checking based on the document not on the code. At a conceptual level, why doesn't the document itself specify the content handler? Why not both simplify the code and preserve the ability to write less structured documents? Sorry if that comment doesn't make sense. I guess it depends on what you think of ligolw as a library going forward - if you only want to support the high level stuff then I guess this makes sense, but that spec is basically dying so it sort of seems EOL to only support that.

added 10 commits

1c7bd7c2...d90a5aa3 - 5 commits from branch master
22450d28 - rename segutil.py -> segments.py, minor cleanup
371fe2ce - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
358ee499 - segments.py: add query_dqsegdb_segments
68bd98ed - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
59d665e1 - add utilities to query for veto segments from GWOSC/DQSegDB

Compare with previous version

added 5 commits

198242ee - rename segutil.py -> segments.py, minor cleanup
c5ce9c0f - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
0f8bea6d - segments.py: add query_dqsegdb_segments
fc6b2f35 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
700977dc - add utilities to query for veto segments from GWOSC/DQSegDB

Compare with previous version

changed the description

added 1 commit

6b2c88de - gstlal_query_*_veto_segments: Fix thinko

Compare with previous version

@kipp.cannon, have my replies addressed the comments/concerns you had?

added 19 commits

6b2c88de...b6020f6c - 13 commits from branch master
5637cc93 - rename segutil.py -> segments.py, minor cleanup
3715ef86 - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
572f6289 - segments.py: add query_dqsegdb_segments
8c3f2140 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
8dbc4ac7 - add utilities to query for veto segments from GWOSC/DQSegDB
c6cdd684 - gstlal_query_*_veto_segments: Fix thinko

Compare with previous version

added 1 commit

b656296d - update conda envs with dqsegdb2 lib

Compare with previous version

this appears to add a dependency to a python3-dqsegdb2 package, but there aren't corresponding changes to the .rpm and .deb package metadata. otherwise it seems fine to me.

Thanks @kipp.cannon, good catch. I'll add that in.

added 23 commits

b656296d...b1a0b5d8 - 15 commits from branch master
66cdf4e4 - rename segutil.py -> segments.py, minor cleanup
4481b29b - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
cdfd8ccb - segments.py: add query_dqsegdb_segments
0a23e7f0 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
af1325c6 - add utilities to query for veto segments from GWOSC/DQSegDB
a9a0745e - gstlal_query_*_veto_segments: Fix thinko
67631e61 - update conda envs with dqsegdb2 lib
c0ece453 - add dqsegdb2 dependency to gstlal

Compare with previous version

merged

Segments: Add segment querying utils, refactoring

Python API

Programs

Merged by Patrick Godwin 3 years ago (Oct 14, 2021 2:38am UTC) 3 years ago

Activity

Segments: Add segment querying utils, refactoring

Python API

Programs

Merge request reports

Merged by Patrick Godwin 3 years ago (Oct 14, 2021 2:38am UTC) 3 years ago

Activity