Streamline the result object

changed milestone to %0.4.1

changed the description

Using bilby==0.4.0 I generated a h5 file, then, in a singularity container, was able to read in the file using h5py.

However, reading it in with deepdish (with bilby not installed) I received the ModuleNotFoundError that @charlie.hoy reported.

For future reference here is the full traceback:

In [2]: deepdish.io.load('results_GW150914/result/GW150914_L1_emcee_G184098_result.h5')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    478     try:
--> 479         return pathtable[pathname]
    480     except KeyError:

KeyError: '/data'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    478     try:
--> 479         return pathtable[pathname]
    480     except KeyError:

KeyError: '/data/meta_data'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    478     try:
--> 479         return pathtable[pathname]
    480     except KeyError:

KeyError: '/data/meta_data/likelihood'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    478     try:
--> 479         return pathtable[pathname]
    480     except KeyError:

KeyError: '/data/meta_data/likelihood/frequency_domain_source_model'

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-54fe4f7d6bde> in <module>()
----> 1 deepdish.io.load('results_GW150914/result/GW150914_L1_emcee_G184098_result.h5')

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in load(path, group, sel, unpack)
    654                 name = next(iter(grp._v_children))
    655                 data = _load_specific_level(h5file, grp, name, sel=sel,
--> 656                                             pathtable=pathtable)
    657                 do_unpack = False
    658             elif sel is not None:

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_specific_level(handler, grp, path, sel, pathtable)
    317                 return _load_sliced_level(handler, getattr(grp, vv[0]), sel)
    318             else:
--> 319                 return _load_level(handler, getattr(grp, vv[0]), pathtable)
    320         elif hasattr(grp, '_v_attrs') and vv[0] in grp._v_attrs:
    321             if sel is not None:

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    480     except KeyError:
    481         pathtable[pathname] = _load_nonlink_level(handler, node, pathtable,
--> 482                                                   pathname)
    483         return pathtable[pathname]
    484 

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_nonlink_level(handler, level, pathtable, pathname)
    368         # Load sub-groups
    369         for grp in level:
--> 370             lev = _load_level(handler, grp, pathtable)
    371             n = grp._v_name
    372             # Check if it's a complicated pair or a string-value pair

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    480     except KeyError:
    481         pathtable[pathname] = _load_nonlink_level(handler, node, pathtable,
--> 482                                                   pathname)
    483         return pathtable[pathname]
    484 

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_nonlink_level(handler, level, pathtable, pathname)
    368         # Load sub-groups
    369         for grp in level:
--> 370             lev = _load_level(handler, grp, pathtable)
    371             n = grp._v_name
    372             # Check if it's a complicated pair or a string-value pair

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    480     except KeyError:
    481         pathtable[pathname] = _load_nonlink_level(handler, node, pathtable,
--> 482                                                   pathname)
    483         return pathtable[pathname]
    484 

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_nonlink_level(handler, level, pathtable, pathname)
    368         # Load sub-groups
    369         for grp in level:
--> 370             lev = _load_level(handler, grp, pathtable)
    371             n = grp._v_name
    372             # Check if it's a complicated pair or a string-value pair

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_level(handler, level, pathtable)
    480     except KeyError:
    481         pathtable[pathname] = _load_nonlink_level(handler, node, pathtable,
--> 482                                                   pathname)
    483         return pathtable[pathname]
    484 

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_nonlink_level(handler, level, pathtable, pathname)
    435     elif isinstance(level, tables.VLArray):
    436         if level.shape == (1,):
--> 437             return _load_pickled(level)
    438         else:
    439             return level[:]

~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.py in _load_pickled(level)
    341 
    342 def _load_pickled(level):
--> 343     if isinstance(level[0], ForcePickle):
    344         return level[0].obj
    345     else:

~/anaconda3/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    679                 key += self.nrows
    680             (start, stop, step) = self._process_range(key, key + 1, 1)
--> 681             return self.read(start, stop, step)[0]
    682         elif isinstance(key, slice):
    683             start, stop, step = self._process_range(

~/anaconda3/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    823         atom = self.atom
    824         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 825             outlistarr = [atom.fromarray(arr) for arr in listarr]
    826         else:
    827             # Convert the list to the right flavor

~/anaconda3/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    823         atom = self.atom
    824         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 825             outlistarr = [atom.fromarray(arr) for arr in listarr]
    826         else:
    827             # Convert the list to the right flavor

~/anaconda3/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

ModuleNotFoundError: No module named 'bilby'

It looks to me like the culprit is the frequency_domain_source_model

Is the issue definitely DataFrame objects? If the issue is with bilby things, shouldn't we focus on making things serialisable, even if that means making things like the source model strings?

It would be good to have a super safe version, the option to print the essentials (posterior, prior, some sampler details) to csv or similar.

Regarding the frequency_domain_source_model (i.e. the problem of getting a ModuleNotFoundError: No module named 'bilby'). This is occuring because the bilby.gw.result.CBCResult object naively passed the frequency_domain_source function in. Functions are serialisable, but deepdish does this oh-so-clever thing of loading the module at read-in. So we can fix that easily by just converting that to a string. I'll submit a MR soon.

RE the other problems. Having investigated a little I've found out that

You can't store named arrays in a HDF5 with deepdish
Deepdish is intrinsically linked to pandas/tables

So maybe it is worth exploring using H5py or something?

Confirmed that the frequency_domain_source_model was the pickled object. In !355 (merged) I've fixed this.

added 1 deleted label

assigned to @sylvia.biscoveanu

@gregory.ashton @sylvia.biscoveanu @colm.talbot

We decided on exploring to replace the result hdf5 files with json. We are keeping the old functionality to read in from hdf5 to old results. Result.save_to_file should now save it into a json.

added 1 deleted label and removed 1 deleted label

marked the checklist item Make all arrays in the deepdish file named numpy arrays rather than pandas data frames as completed

marked the checklist item Ensure that everything else in the deepdish result file is either a string, float, list or dictionary of these things. as completed

Check the json file size for a reasonable size results object (can fake one, no need to do a long run). Compare with the h5 file
Check the read/write time for the same file
Research if there are any issues with parallel reads/write etc.

For the GW150914 example, the json file is 17MB and the hdf5 file is 15MB for 13725 posterior samples. I've also investigated how long loading and saving takes:

json save time: 0.797025
json load time: 3.504098
hdf5 save time: 0.419306
hdf5 load time: 0.186613
dict load time: 0.604379

The dict load time is the time it takes to load the raw results dictionary from a json file. The reason that the json load time is so long is because in addition to loading the dictionary, we have to convert the elements of the PriorDict back to bilby Prior class objects from unicode strings in a loop and convert the posterior from a dictionary to a pandas.DataFrame for compatibility with the plotting module.

mentioned in merge request !368 (merged)

Awesome work @sylvia.biscoveanu . Is it possible to separate the amount of time taken to convert the posterior to a DataFrame with the time taken to convert the PriorDict?

I ask because it would be quite reasonable to load the prior as a PriorDict only when it is needed (i.e., it just loads as a string and won't be evaluated until it is needed). However, the posterior is probably going to be required everytime.

This is encouraging, but cases where people load 100's of results might want to investigate just loading the dict directly.

Right, so in addition to the dict load time, converting the posterior and the nested_samples to DataFrames takes 0.789895 s total. So the DataFrame conversion adds ~0.1s and the rest of the several seconds comes from evaluating the prior strings.

Closing as we have resolved these issues by moving to JSON by default

closed

mentioned in issue pesummary#91 (closed)

unassigned @gregory.ashton and @sylvia.biscoveanu

Streamline the result object

Designs

Child items ...

Activity

Admin message

Streamline the result object

Activity