Skip to content
Snippets Groups Projects

HDF5 support

Merged Jameson Rollins requested to merge jameson.rollins/pygwinc:hdf5-support into master

This series adds support for saving and loading noise budgets from HDF5 files. Functions are added to save/load HDF5 files, and support is added to the CLI.

The only discussion I can see here might be about the format of the saved HDF5 files. I've put in a fairly straightforward one, but maybe everyone would like to settle on something else before we merge this.

Merge request reports

Pipeline #20971 passed

Pipeline passed for d7ce981c on jameson.rollins:hdf5-support

Approval is optional

Merged by Christopher WipfChristopher Wipf 6 years ago (Jun 8, 2018 9:36pm UTC)

Merge details

  • Changes merged into master with d7ce981c.
  • Deleted the source branch.

Pipeline #21608 passed

Pipeline passed for d7ce981c on master

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jameson Rollins added 9 commits

    added 9 commits

    • 24118c35 - 1 commit from branch gwinc:master
    • 517277ea - move dhdl function from util module into gwinc module
    • 2409a205 - struct: rename YAML loader for clarity
    • f9e82273 - struct: Struct.from_yaml class method to load from yaml string
    • f7067598 - initial HDF5 file read/write support
    • d268a030 - CLI: support saving budget to HDF5 file
    • f26b6d2a - CLI: load noise budget from HDF5 for plotting
    • 5cc8bc5e - CLI: add ability to dump IFO YAML data to stdout
    • d7ce981c - CLI: clean up help

    Compare with previous version

  • Two discussion points:

    Schema version:

    • The schema should probably have a version associated with it so that we're not stuck with today's format forever. I'd vote for using something <1.0 right now in case this schema is changed before widespread use.

    Container format:

    • The YAML dump should maybe not directly be stored in f.attrs['IFO']. It may be that you want to support different configuration formats later, so the YAML formatted IFO dump should be labelled somehow as such, or be contained within a YAML namespace somewhere else in the container.
    • Should the noise data be stored somewhere in the HDF5 container more precise than just /traces? It would be nice to support transfer functions and other raw measurements as well as noise. Later we may wish to distinguish data from something else (e.g. configuration settings used to generate the data), so perhaps use e.g. /data/noise/ as the root for GWINC outputs. Should each trace also have its own subfolder to allow metadata related to each trace to be stored close by?

    The above points depend on what the purpose of this format will be. I'd like to have the option of combining GWINC data with non-GWINC data in the same file, in case some other tool wants to do something with both data. It may be that Finesse and other tools can one day export into this format too, if the schema is designed to allow this. I'd vote for at least namespacing the GWINC specific stuff in the container, but keeping the raw noise traces separate.

    ...But with a schema version all of these changes can be worked out over time instead of up front :smiley:

    Edited by Sean Leavey
  • @sean-leavey I think you've been working on specifying a common data format, as has @duncanmmacleod. In fact, maybe we should just follow as close as possible the spec defined by the common data formats group.

    I don't see any need to allow for multiple specifications for the structured data format of the IFO parameters. We should just pick one and fix it in the schema (YAML, JSON, etc.).

  • @jameson.rollins I've not worked on anything concrete, just been saying how nice it'd be to have one. Thanks, I didn't know about the common data format already prepared by the detchar team - that looks useful!

  • I agree the main point for discussion here is the file format.

    Have you considered how this new format might overlap with the existing DTT XML? (which is defined in T990013 appendix B)

    Of course, much of what's in there is not strictly relevant for gwinc. But if we are trying to establish a generic noise budget data format, I think we'd be interested in capturing as much metadata about a spectrum or transfer function as DTT generates when measuring it. (And someday DTT v3000 will be able to open noise budget files in this format and update the traces for us...)

  • That's a good point - DTT XML might be a nice ready-made format to adopt, or to fork.

    It sounds like this topic needs more careful thought, and it might disrupt development to think too deeply about this now. Perhaps it's worth opening discussion with the common data formats group about a plan for this kind of data, but in the meantime just use the container format in this merge request.

  • I've been investigating how we could better conform to the common data format.

    T1800014 section 2.1.1 defines an HDF5 format for "series" data with regular sample step defined by parameters x0, dx, and xunit, i.e. no explicit xarray. I had a discussion with the group about how to handle logarithmic sequences and the suggestion is just to specify a "log()" unit.

    So in our case the default frequency vector would be stored as:

    xunit = "log10(Hz)"
    x0 = 0.69897
    dx = 0.00102674

    or:

    xunit = "log(Hz)"
    x0 = 8.69005816
    dx = 0.00236415

    Clients that read the file would have to parse the xunit appropriately. Does that seem reasonable to people?

    I also suggested that the common data format be extended to handle log arrays with something like a "log" bool flag.

    Thoughts appreciated.

  • Looks good!

    The specification listed in T1800014 looks like a draft, so perhaps we can ask for a proper scale flag to be included in the final version, not just for log scaled plots but also for linear ones, i.e. an "xscale" field.

  • I don't like the way this common data format spec is going. We should not be obliged to parse a unit string just to decode the frequency vector. And how would we handle irregularly sampled data, e.g. when stitching together a low frequency spectrum and a high frequency spectrum?

    HDF5 supports compression, so having an xarray in the format does not necessarily bloat the file size very much.

  • I agree that decoding the unit string seems less than ideal. and I guess that irregularly sampled data could be encoded in a table?

    I submitted a bug reports about the detchar/common-data-formats#2 about a scaling flag.

    Maybe we could just suggest to them an extension of the series spec to handle specifying the x array.

    or @christopher.wipf do you think we shouldn't bother spec'ing for interoperability with their effort?

  • I think it's good that you've engaged with them. I hope they end up with a standard that's useful and that we'd want to support. But I'm inclined to agree with Sean's point above: let's not get bogged down waiting for the perfect standard to arrive. So I'll merge this, and we can certainly evolve the format or add others if warranted down the line.

  • :thumbsup:

    I'd still suggest adding a schema version field.

  • I just found xarray, based on pandas, which provides a DataArray class which handles data with labels and metadata. It can read and write to netCDF files, which are HDF5-based containers for self-described datasets. Could be useful.

    Edited by Sean Leavey
Please register or sign in to reply
Loading