glue.segmentdb.segmentdb_utils.get_all_files_in_range() silently fails to remove 2nd bad filename (and 4th, 6th, etc.)
This is an outgrowth of dqsegdb
issue 111, in which it was noted that ligolw_publish_threaded_dqxml_dqsegdb
would not publish DQXML files from a directory containing 2 or more files with bad filenames. It would silently ignore the unpublished files in the dir until/unless the bad filenames count was reduced to 1 or 0. Quoted from that page:
[quote]
The problem is in glue.segmentdb.segmentdb_utils.get_all_files_in_range()
(which corresponds to /usr/lib64/python3.6/site-packages/glue/segmentdb/segmentdb_utils.py
). The function gets a list of filenames, sorts them, then iterates over the list, and if it finds a bad filename, it removes it from the list. The problem is that the pointer/index in the list is pointing to the name that gets removed; inferring that all items in the list are enumerated, the next one in the list gets moved up to the one that just got removed, and then the iterator moves on to the next item in the list, numerically - which means that the one that was moved up to replace the one that was removed is never processed. If there is only one bad filename in the dir, this is not a problem. If there are 2 bad filenames in the dir, they will be sorted into the first 2 slots in the list, then the first one will be removed, the 2nd one will move into the 1st slot, and the iterator will move on to the 2nd slot, leaving the bad filename in the 1st slot, which causes trouble somewhere else down the line (probably around line 199 in ligolw_publish_threaded_dqxml_dqsegdb
: pending_files += lal.Cache.from_urls(segmentdb_utils.get_all_files_in_range(options.input_directory,s[0],s[1]),coltype=int).sieve(segment=s)
). This is a known issue (not necessarily a bug) in Python, and there are workarounds for it, with one of the simplest being to traverse the list in reverse order, so any removed items only affect items that have already been checked, e.g., for filename in file_list[::-1]:
.
[/quote]