HDF5 support
This series adds support for saving and loading noise budgets from HDF5 files. Functions are added to save/load HDF5 files, and support is added to the CLI.
The only discussion I can see here might be about the format of the saved HDF5 files. I've put in a fairly straightforward one, but maybe everyone would like to settle on something else before we merge this.
Merge request reports
Activity
added 9 commits
-
24118c35 - 1 commit from branch
gwinc:master
- 517277ea - move dhdl function from util module into gwinc module
- 2409a205 - struct: rename YAML loader for clarity
- f9e82273 - struct: Struct.from_yaml class method to load from yaml string
- f7067598 - initial HDF5 file read/write support
- d268a030 - CLI: support saving budget to HDF5 file
- f26b6d2a - CLI: load noise budget from HDF5 for plotting
- 5cc8bc5e - CLI: add ability to dump IFO YAML data to stdout
- d7ce981c - CLI: clean up help
Toggle commit list-
24118c35 - 1 commit from branch
Two discussion points:
Schema version:
- The schema should probably have a version associated with it so that we're not stuck with today's format forever. I'd vote for using something <1.0 right now in case this schema is changed before widespread use.
Container format:
- The YAML dump should maybe not directly be stored in
f.attrs['IFO']
. It may be that you want to support different configuration formats later, so the YAML formatted IFO dump should be labelled somehow as such, or be contained within a YAML namespace somewhere else in the container. - Should the noise data be stored somewhere in the HDF5 container more precise than just
/traces
? It would be nice to support transfer functions and other raw measurements as well as noise. Later we may wish to distinguish data from something else (e.g. configuration settings used to generate the data), so perhaps use e.g./data/noise/
as the root for GWINC outputs. Should each trace also have its own subfolder to allow metadata related to each trace to be stored close by?
The above points depend on what the purpose of this format will be. I'd like to have the option of combining GWINC data with non-GWINC data in the same file, in case some other tool wants to do something with both data. It may be that Finesse and other tools can one day export into this format too, if the schema is designed to allow this. I'd vote for at least namespacing the GWINC specific stuff in the container, but keeping the raw noise traces separate.
...But with a schema version all of these changes can be worked out over time instead of up front
Edited by Sean Leavey@sean-leavey I think you've been working on specifying a common data format, as has @duncanmmacleod. In fact, maybe we should just follow as close as possible the spec defined by the common data formats group.
I don't see any need to allow for multiple specifications for the structured data format of the IFO parameters. We should just pick one and fix it in the schema (YAML, JSON, etc.).
@jameson.rollins I've not worked on anything concrete, just been saying how nice it'd be to have one. Thanks, I didn't know about the common data format already prepared by the detchar team - that looks useful!
I agree the main point for discussion here is the file format.
Have you considered how this new format might overlap with the existing DTT XML? (which is defined in T990013 appendix B)
Of course, much of what's in there is not strictly relevant for gwinc. But if we are trying to establish a generic noise budget data format, I think we'd be interested in capturing as much metadata about a spectrum or transfer function as DTT generates when measuring it. (And someday DTT v3000 will be able to open noise budget files in this format and update the traces for us...)
That's a good point - DTT XML might be a nice ready-made format to adopt, or to fork.
It sounds like this topic needs more careful thought, and it might disrupt development to think too deeply about this now. Perhaps it's worth opening discussion with the common data formats group about a plan for this kind of data, but in the meantime just use the container format in this merge request.
I've been investigating how we could better conform to the common data format.
T1800014 section 2.1.1 defines an HDF5 format for "series" data with regular sample step defined by parameters x0, dx, and xunit, i.e. no explicit xarray. I had a discussion with the group about how to handle logarithmic sequences and the suggestion is just to specify a "log()" unit.
So in our case the default frequency vector would be stored as:
xunit = "log10(Hz)" x0 = 0.69897 dx = 0.00102674
or:
xunit = "log(Hz)" x0 = 8.69005816 dx = 0.00236415
Clients that read the file would have to parse the xunit appropriately. Does that seem reasonable to people?
I also suggested that the common data format be extended to handle log arrays with something like a "log" bool flag.
Thoughts appreciated.
I don't like the way this common data format spec is going. We should not be obliged to parse a unit string just to decode the frequency vector. And how would we handle irregularly sampled data, e.g. when stitching together a low frequency spectrum and a high frequency spectrum?
HDF5 supports compression, so having an xarray in the format does not necessarily bloat the file size very much.
I agree that decoding the unit string seems less than ideal. and I guess that irregularly sampled data could be encoded in a table?
I submitted a bug reports about the detchar/common-data-formats#2 about a scaling flag.
Maybe we could just suggest to them an extension of the series spec to handle specifying the x array.
or @christopher.wipf do you think we shouldn't bother spec'ing for interoperability with their effort?
I think it's good that you've engaged with them. I hope they end up with a standard that's useful and that we'd want to support. But I'm inclined to agree with Sean's point above: let's not get bogged down waiting for the perfect standard to arrive. So I'll merge this, and we can certainly evolve the format or add others if warranted down the line.
I just found xarray, based on
pandas
, which provides aDataArray
class which handles data with labels and metadata. It can read and write to netCDF files, which are HDF5-based containers for self-described datasets. Could be useful.Edited by Sean Leavey