Segments: Add segment querying utils, refactoring
This merge request is focused on adding utilties and refactoring for segment-related logic:
Python API
- Move common segments utils to segments module (segutil -> segments)
- Utilities for generating job segment boundaries (originally part of inspiral_pipe.py)
- Add new utilities
- Query science/veto segments from DQSegDB and GWOSC
- Note: some of this exists in standalone scripts in various places
Programs
- Add gstlal_query_(gwosc/dqsegdb)_segments
- Get science segments from GWOSC/DQSegDB with similar CLI
- Leverages functions from gstlal.segments
- Add gstlal_query_(gwosc/dqsegdb)_veto_segments
- Get vetoes from GWOSC/DQSegDB with similar CLI
- Leverages functions from gstlal.segments
- For GWOSC, scripts existed for this for O2. generalize for all runs
- For DQSegDB, we currently rely on ligolw_segments_from_cats_dqsegdb
- Scripts not been ported to python3, unclear if script sees any usage beyond GstLAL
- No python API analog to this without depending on gwpy
Update: removed the I/O segments utilities
Merge request reports
Activity
some comments:
- unless there is a reason for adding type restrictions to the arguments, they should all be removed. there are places in our code where it is necessary for a variable to be a specific type. for example when converting the bytes contained in gstreamer buffers to trigger objects in Python it is necessary to use the Python wrapping of the exact C structure definition that was packed into the buffer by the trigger generator. no flexibility can be permitted or terrible things happen. so there certainly are times when you need to restrict the type of a variable, but, for example, why does this code only allow integers and LIGOTimeGPS to be used for process start and stop times? the I/O library contains tests to determine if a suitable object has been provided, are those tests not sufficient?
- please don't introduce functions that read(write) one named segment list from(to) one named file. that's literally exactly what lalsuite's XML I/O interface looked like 15 years ago when I took on the task of fixing that problem (i.e. the inefficient unmaintainable programs that design leads to) by writing the python I/O library that allows whole documents to be loaded, manipulated, and written back to disk. reading a coalesced named segmentlist from a named file is already only one line of code. it's kind of a long line of code, and you maybe have to look up what it is each time, and it's only one line if you're already importing XML I/O modules for other reasons, but it is a one-liner. gstlal_cs_triggergen, gstlal_play, gstlal_ligo_data_find_check, etc., need more lines of code to use the new function than to just do it inline themselves. Chad can explain whether you can sire an inca or inca a sire. he'll know what that means, and he can explain how writing "one function to read one table from a file" lead to no programs being able to produce output that any other programs could read, except for whichever one special program was originally intended to come after each. this is a bad direction to take the code in.
- as I said on the call, I'd encourage thought to be put into thinking about pushing things upstream to help. for example, there's a case to be made for making coalescing be the default behaviour of the segment list extraction routines in ligo.lw. I think experience shows that that is overwhelmingly the use case of the code, I think exactly nobody relies on the current default behaviour (preserving the segments verbatim as they appear in the XML file). that API change would have to be co-ordinated with the collaboration just in case, but it would shorten the get-a-segmentlist one-liners by the ".coalesce()" at the end of each one.
- another thing I've been threatening to do in the ligo.lw library is to flatten the API. the code was written at a time when basically nobody could read or write these files except through libmetaio-style hacks, and it wasn't clear how compatible all the different ways people were doing it really were. that meant I had to create a layered library with the lowest level imposing only the DTD rules, no policy about file contents, so that it would be sure to be able to read anything. people who knew they were working with files containing tables compatible with the standard definitions could enable a higher-level interface (the lsctables module) that added features to support those tables but then would refuse to load documents that didn't match the structure even if the documents were valid LIGOLW XML. this is why you always need to construct a content handler with your particular desired set of features turned on. at this time, all XML files used with the ligo.lw library are compatible with the highest level interfaces, and if ever one is found that isn't there's a case to be made for telling the author of the software that generated it to get with the program. so really there's no need to preserve the ability to turn those features off. that's the threat: make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this. but if this is done, then there'll be really no reason for the segment list read and write functions to exist, it really will be just a couple of lines of code to get a segment list even in a program that is not otherwise reading or writing XML files. so if the line count is important, I'd say this is the way to do it, not writing wrapper functions.
Thanks @kipp.cannon for the feedback. I'll reply as I have time to do so so this may be split up into a few replies.
unless there is a reason for adding type restrictions to the arguments, they should all be removed. there are places in our code where it is necessary for a variable to be a specific type. for example when converting the bytes contained in gstreamer buffers to trigger objects in Python it is necessary to use the Python wrapping of the exact C structure definition that was packed into the buffer by the trigger generator. no flexibility can be permitted or terrible things happen. so there certainly are times when you need to restrict the type of a variable, but, for example, why does this code only allow integers and LIGOTimeGPS to be used for process start and stop times? the I/O library contains tests to determine if a suitable object has been provided, are those tests not sufficient?
I think I may be misunderstanding the issue. These aren't type restrictions, they are type annotations and there are no checks done by the interpreter on runtime to validate. With type annotations, you can run a static type checker (e.g. mypy) externally to do validation at "compile" time but that's out of scope for this. These are purely for the benefit for documentation unless your project uses external tools to enforce this.
Here's one example: https://git.ligo.org/lscsoft/gstlal/-/blob/master/gstlal/python/segutil.py#L48. What am I allowed to pass in for
allsegs
? What aboutboundary_seg
? Based on the names, you can probably guess what you're supposed to pass in. How about thetime_slices
arg in this constructor? https://git.ligo.org/lscsoft/gstlal/-/blob/master/gstlal-inspiral/python/cbc_template_fir.py#L204. I probably wouldn't know without doing some digging and seeing how it's used in other places.As for "why does this code only allow integers and LIGOTimeGPS to be used for process start and stop time", I assume you're talking about the
query_*_segments
functions. It's true that ligo-segments and LIGOTimeGPS don't care whether this is an int, but GWOSC/DQSegDB only handle integer second boundaries and passing in an arbitrary float would likely either return an error or at least not have the expected behavior. If this is in fact not the case, then the type annotation for start/stop times should be corrected.After the explanation above, do you still have the same thoughts on type annotations?
please don't introduce functions that read(write) one named segment list from(to) one named file. that's literally exactly what lalsuite's XML I/O interface looked like 15 years ago when I took on the task of fixing that problem (i.e. the inefficient unmaintainable programs that design leads to) by writing the python I/O library that allows whole documents to be loaded, manipulated, and written back to disk. reading a coalesced named segmentlist from a named file is already only one line of code. it's kind of a long line of code, and you maybe have to look up what it is each time, and it's only one line if you're already importing XML I/O modules for other reasons, but it is a one-liner. gstlal_cs_triggergen, gstlal_play, gstlal_ligo_data_find_check, etc., need more lines of code to use the new function than to just do it inline themselves. Chad can explain whether you can sire an inca or inca a sire. he'll know what that means, and he can explain how writing "one function to read one table from a file" lead to no programs being able to produce output that any other programs could read, except for whichever one special program was originally intended to come after each. this is a bad direction to take the code in.
That's a fair point, I'll be happy to remove/revert as needed. My motivation for this in case it's helpful is I was not looking to have a general replacement for any I/O code, but rather its focus is accessing data products on the analysis side. The analysis generates data products including segments either as part of the DAG or used as inputs to the DAG. The question I'm asking as a user is, how do I access these data products? Maybe this is where we're diverging in opinions, but as a user, I don't care about implementation details of the data format (and things like the name of the named segments within the XML file). What I do care about is, how do I access in these files that the analysis generated that are stored on disk (or elsewhere). In the LIGOLW/HDF5/etc libraries, there's a general way to create and read in these files, but in the analysis we've made a choice of how we want to store that information. That choice I've typically seen in runs for segments/vetoes are two separate LIGOLW files: a file containing analysis segments stored under the "datasegments" name, and a file containing vetoes stored under the "vetoes" name. Those names are options that can be specified, but typically the Makefile pinned that down to these specific names. The intention for these additional I/O functions were to answer the question: Given this choice, how do I access segments/vetoes in a way that doesn't care about the particular representation of the data on disk?
as I said on the call, I'd encourage thought to be put into thinking about pushing things upstream to help. for example, there's a case to be made for making coalescing be the default behaviour of the segment list extraction routines in ligo.lw. I think experience shows that that is overwhelmingly the use case of the code, I think exactly nobody relies on the current default behaviour (preserving the segments verbatim as they appear in the XML file). that API change would have to be co-ordinated with the collaboration just in case, but it would shorten the get-a-segmentlist one-liners by the ".coalesce()" at the end of each one.
I have no strong opinions on this but that sounds like it would be nice. I'm also not aware of anyone expecting non-coalesced segments stored on disk. Was your idea to have a way to opt out of this default behavior, or to make this the only way to store/retrieve segments on disk?
another thing I've been threatening to do in the ligo.lw library is to flatten the API. the code was written at a time when basically nobody could read or write these files except through libmetaio-style hacks, and it wasn't clear how compatible all the different ways people were doing it really were. that meant I had to create a layered library with the lowest level imposing only the DTD rules, no policy about file contents, so that it would be sure to be able to read anything. people who knew they were working with files containing tables compatible with the standard definitions could enable a higher-level interface (the lsctables module) that added features to support those tables but then would refuse to load documents that didn't match the structure even if the documents were valid LIGOLW XML. this is why you always need to construct a content handler with your particular desired set of features turned on. at this time, all XML files used with the ligo.lw library are compatible with the highest level interfaces, and if ever one is found that isn't there's a case to be made for telling the author of the software that generated it to get with the program. so really there's no need to preserve the ability to turn those features off. that's the threat: make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this. but if this is done, then there'll be really no reason for the segment list read and write functions to exist, it really will be just a couple of lines of code to get a segment list even in a program that is not otherwise reading or writing XML files. so if the line count is important, I'd say this is the way to do it, not writing wrapper functions.
With the historical context, that makes sense. For this: "make ligo.lw be the definition of the format, make non-compliance with its expected document structure a fatal I/O error. all of the imports of lsctables, array, param, the creation of content handlers, and the passing of those content handlers to XML reading functions could all go away if people are generally onboard with this.", honestly, that sounds great and would love to see this.
@kipp.cannon @patrick.godwin This does sound nice, potentially. But I could also imagine it going the other way - though maybe there is something that I do not understand. It would be nice if the ligolw library allowed people to continue to make generic documents, but that it was somehow aware of when it needed to do strict checking based on the document not on the code. At a conceptual level, why doesn't the document itself specify the content handler? Why not both simplify the code and preserve the ability to write less structured documents? Sorry if that comment doesn't make sense. I guess it depends on what you think of ligolw as a library going forward - if you only want to support the high level stuff then I guess this makes sense, but that spec is basically dying so it sort of seems EOL to only support that.
added 10 commits
-
1c7bd7c2...d90a5aa3 - 5 commits from branch
master
- 22450d28 - rename segutil.py -> segments.py, minor cleanup
- 371fe2ce - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
- 358ee499 - segments.py: add query_dqsegdb_segments
- 68bd98ed - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
- 59d665e1 - add utilities to query for veto segments from GWOSC/DQSegDB
Toggle commit list-
1c7bd7c2...d90a5aa3 - 5 commits from branch
added 5 commits
- 198242ee - rename segutil.py -> segments.py, minor cleanup
- c5ce9c0f - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
- 0f8bea6d - segments.py: add query_dqsegdb_segments
- fc6b2f35 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
- 700977dc - add utilities to query for veto segments from GWOSC/DQSegDB
Toggle commit list@kipp.cannon, have my replies addressed the comments/concerns you had?
added 19 commits
-
6b2c88de...b6020f6c - 13 commits from branch
master
- 5637cc93 - rename segutil.py -> segments.py, minor cleanup
- 3715ef86 - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
- 572f6289 - segments.py: add query_dqsegdb_segments
- 8c3f2140 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
- 8dbc4ac7 - add utilities to query for veto segments from GWOSC/DQSegDB
- c6cdd684 - gstlal_query_*_veto_segments: Fix thinko
Toggle commit list-
6b2c88de...b6020f6c - 13 commits from branch
this appears to add a dependency to a python3-dqsegdb2 package, but there aren't corresponding changes to the .rpm and .deb package metadata. otherwise it seems fine to me.
Edited by Kipp CannonThanks @kipp.cannon, good catch. I'll add that in.
added 23 commits
-
b656296d...b1a0b5d8 - 15 commits from branch
master
- 66cdf4e4 - rename segutil.py -> segments.py, minor cleanup
- 4481b29b - move gwosc segment query logic from gstlal_query_gwosc_segments to segments.query_gwosc_segments
- cdfd8ccb - segments.py: add query_dqsegdb_segments
- 0a23e7f0 - add gstlal_query_dqsegdb_segments, migrate gstlal_query_gwosc_segments from gstlal-ugly to gstlal
- af1325c6 - add utilities to query for veto segments from GWOSC/DQSegDB
- a9a0745e - gstlal_query_*_veto_segments: Fix thinko
- 67631e61 - update conda envs with dqsegdb2 lib
- c0ece453 - add dqsegdb2 dependency to gstlal
Toggle commit list-
b656296d...b1a0b5d8 - 15 commits from branch