HDF5 support

added 9 commits

24118c35 - 1 commit from branch gwinc:master
517277ea - move dhdl function from util module into gwinc module
2409a205 - struct: rename YAML loader for clarity
f9e82273 - struct: Struct.from_yaml class method to load from yaml string
f7067598 - initial HDF5 file read/write support
d268a030 - CLI: support saving budget to HDF5 file
f26b6d2a - CLI: load noise budget from HDF5 for plotting
5cc8bc5e - CLI: add ability to dump IFO YAML data to stdout
d7ce981c - CLI: clean up help

Two discussion points:

Schema version:

The schema should probably have a version associated with it so that we're not stuck with today's format forever. I'd vote for using something <1.0 right now in case this schema is changed before widespread use.

Container format:

The YAML dump should maybe not directly be stored in f.attrs['IFO']. It may be that you want to support different configuration formats later, so the YAML formatted IFO dump should be labelled somehow as such, or be contained within a YAML namespace somewhere else in the container.
Should the noise data be stored somewhere in the HDF5 container more precise than just /traces? It would be nice to support transfer functions and other raw measurements as well as noise. Later we may wish to distinguish data from something else (e.g. configuration settings used to generate the data), so perhaps use e.g. /data/noise/ as the root for GWINC outputs. Should each trace also have its own subfolder to allow metadata related to each trace to be stored close by?

The above points depend on what the purpose of this format will be. I'd like to have the option of combining GWINC data with non-GWINC data in the same file, in case some other tool wants to do something with both data. It may be that Finesse and other tools can one day export into this format too, if the schema is designed to allow this. I'd vote for at least namespacing the GWINC specific stuff in the container, but keeping the raw noise traces separate.

...But with a schema version all of these changes can be worked out over time instead of up front

@sean-leavey I think you've been working on specifying a common data format, as has @duncanmmacleod. In fact, maybe we should just follow as close as possible the spec defined by the common data formats group.

I don't see any need to allow for multiple specifications for the structured data format of the IFO parameters. We should just pick one and fix it in the schema (YAML, JSON, etc.).

@jameson.rollins I've not worked on anything concrete, just been saying how nice it'd be to have one. Thanks, I didn't know about the common data format already prepared by the detchar team - that looks useful!

I agree the main point for discussion here is the file format.

Have you considered how this new format might overlap with the existing DTT XML? (which is defined in T990013 appendix B)

Of course, much of what's in there is not strictly relevant for gwinc. But if we are trying to establish a generic noise budget data format, I think we'd be interested in capturing as much metadata about a spectrum or transfer function as DTT generates when measuring it. (And someday DTT v3000 will be able to open noise budget files in this format and update the traces for us...)

That's a good point - DTT XML might be a nice ready-made format to adopt, or to fork.

It sounds like this topic needs more careful thought, and it might disrupt development to think too deeply about this now. Perhaps it's worth opening discussion with the common data formats group about a plan for this kind of data, but in the meantime just use the container format in this merge request.

I've been investigating how we could better conform to the common data format.

T1800014 section 2.1.1 defines an HDF5 format for "series" data with regular sample step defined by parameters x0, dx, and xunit, i.e. no explicit xarray. I had a discussion with the group about how to handle logarithmic sequences and the suggestion is just to specify a "log()" unit.

So in our case the default frequency vector would be stored as:

xunit = "log10(Hz)"
x0 = 0.69897
dx = 0.00102674

or:

xunit = "log(Hz)"
x0 = 8.69005816
dx = 0.00236415

Clients that read the file would have to parse the xunit appropriately. Does that seem reasonable to people?

I also suggested that the common data format be extended to handle log arrays with something like a "log" bool flag.

Thoughts appreciated.

Looks good!

The specification listed in T1800014 looks like a draft, so perhaps we can ask for a proper scale flag to be included in the final version, not just for log scaled plots but also for linear ones, i.e. an "xscale" field.

I don't like the way this common data format spec is going. We should not be obliged to parse a unit string just to decode the frequency vector. And how would we handle irregularly sampled data, e.g. when stitching together a low frequency spectrum and a high frequency spectrum?

HDF5 supports compression, so having an xarray in the format does not necessarily bloat the file size very much.

I agree that decoding the unit string seems less than ideal. and I guess that irregularly sampled data could be encoded in a table?

I submitted a bug reports about the detchar/common-data-formats#2 about a scaling flag.

Maybe we could just suggest to them an extension of the series spec to handle specifying the x array.

or @christopher.wipf do you think we shouldn't bother spec'ing for interoperability with their effort?

I think it's good that you've engaged with them. I hope they end up with a standard that's useful and that we'd want to support. But I'm inclined to agree with Sean's point above: let's not get bogged down waiting for the perfect standard to arrive. So I'll merge this, and we can certainly evolve the format or add others if warranted down the line.

merged

I'd still suggest adding a schema version field.

I just found xarray, based on pandas, which provides a DataArray class which handles data with labels and metadata. It can read and write to netCDF files, which are HDF5-based containers for self-described datasets. Could be useful.

HDF5 support

Merged by Christopher Wipf 6 years ago (Jun 8, 2018 9:36pm UTC) 6 years ago

Activity

Admin message

HDF5 support

Merge request reports

Merged by Christopher Wipf 6 years ago (Jun 8, 2018 9:36pm UTC) 6 years ago

Activity