Maintenance will be performed on git.ligo.org, containers.ligo.org, and docs.ligo.org on Tuesday 15 April 2025 starting at approximately 9am PDT. It is expected to take around 30 minutes and there will be several periods of downtime throughout the maintenance. Please address any comments, concerns, or questions to the helpdesk.
Our largest source of bugs right now are caused by deepdish handling of h5 files. I don't think there is any need to move away from deepdish (which was initially chosen just because it had an easy-to-use interface), but if we do the following, things will be much easier
Make all arrays in the deepdish file named numpy arrays rather than pandas data frames
Ensure that everything else in the deepdish result file is either a string, float, list or dictionary of these things.
During this process we should check that the results files are not dependent on bilby (@charlie.hoy reports this is not the case at the moment).
✓
2 of 2 checklist items completed
· Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Using bilby==0.4.0 I generated a h5 file, then, in a singularity container, was able to read in the file using h5py.
However, reading it in with deepdish (with bilby not installed) I received the ModuleNotFoundError that @charlie.hoy reported.
For future reference here is the full traceback:
In[2]:deepdish.io.load('results_GW150914/result/GW150914_L1_emcee_G184098_result.h5')---------------------------------------------------------------------------KeyErrorTraceback (mostrecentcalllast)~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)478try:-->479returnpathtable[pathname]480exceptKeyError:KeyError:'/data'Duringhandlingoftheaboveexception,anotherexceptionoccurred:KeyErrorTraceback (mostrecentcalllast)~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)478try:-->479returnpathtable[pathname]480exceptKeyError:KeyError:'/data/meta_data'Duringhandlingoftheaboveexception,anotherexceptionoccurred:KeyErrorTraceback (mostrecentcalllast)~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)478try:-->479returnpathtable[pathname]480exceptKeyError:KeyError:'/data/meta_data/likelihood'Duringhandlingoftheaboveexception,anotherexceptionoccurred:KeyErrorTraceback (mostrecentcalllast)~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)478try:-->479returnpathtable[pathname]480exceptKeyError:KeyError:'/data/meta_data/likelihood/frequency_domain_source_model'Duringhandlingoftheaboveexception,anotherexceptionoccurred:ModuleNotFoundErrorTraceback (mostrecentcalllast)<ipython-input-2-54fe4f7d6bde>in<module>()---->1deepdish.io.load('results_GW150914/result/GW150914_L1_emcee_G184098_result.h5')~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyinload(path,group,sel,unpack)654name=next(iter(grp._v_children))655data=_load_specific_level(h5file,grp,name,sel=sel,-->656pathtable=pathtable)657do_unpack=False658elifselisnotNone:~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_specific_level(handler,grp,path,sel,pathtable)317return_load_sliced_level(handler,getattr(grp,vv[0]),sel)318else:-->319return_load_level(handler,getattr(grp,vv[0]),pathtable)320elifhasattr(grp,'_v_attrs')andvv[0]ingrp._v_attrs:321ifselisnotNone:~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)480exceptKeyError:481pathtable[pathname]=_load_nonlink_level(handler,node,pathtable,-->482pathname)483returnpathtable[pathname]484~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_nonlink_level(handler,level,pathtable,pathname)368# Load sub-groups369forgrpinlevel:-->370lev=_load_level(handler,grp,pathtable)371n=grp._v_name372# Check if it's a complicated pair or a string-value pair~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)480exceptKeyError:481pathtable[pathname]=_load_nonlink_level(handler,node,pathtable,-->482pathname)483returnpathtable[pathname]484~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_nonlink_level(handler,level,pathtable,pathname)368# Load sub-groups369forgrpinlevel:-->370lev=_load_level(handler,grp,pathtable)371n=grp._v_name372# Check if it's a complicated pair or a string-value pair~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)480exceptKeyError:481pathtable[pathname]=_load_nonlink_level(handler,node,pathtable,-->482pathname)483returnpathtable[pathname]484~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_nonlink_level(handler,level,pathtable,pathname)368# Load sub-groups369forgrpinlevel:-->370lev=_load_level(handler,grp,pathtable)371n=grp._v_name372# Check if it's a complicated pair or a string-value pair~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_level(handler,level,pathtable)480exceptKeyError:481pathtable[pathname]=_load_nonlink_level(handler,node,pathtable,-->482pathname)483returnpathtable[pathname]484~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_nonlink_level(handler,level,pathtable,pathname)435elifisinstance(level,tables.VLArray):436iflevel.shape==(1,):-->437return_load_pickled(level)438else:439returnlevel[:]~/anaconda3/lib/python3.6/site-packages/deepdish-0.3.6-py3.6.egg/deepdish/io/hdf5io.pyin_load_pickled(level)341342def_load_pickled(level):-->343ifisinstance(level[0],ForcePickle):344returnlevel[0].obj345else:~/anaconda3/lib/python3.6/site-packages/tables/vlarray.pyin__getitem__(self,key)679key+=self.nrows680(start,stop,step)=self._process_range(key,key+1,1)-->681returnself.read(start,stop,step)[0]682elifisinstance(key,slice):683start,stop,step=self._process_range(~/anaconda3/lib/python3.6/site-packages/tables/vlarray.pyinread(self,start,stop,step)823atom=self.atom824ifnothasattr(atom,'size'):# it is a pseudo-atom-->825outlistarr=[atom.fromarray(arr)forarrinlistarr]826else:827# Convert the list to the right flavor~/anaconda3/lib/python3.6/site-packages/tables/vlarray.pyin<listcomp>(.0)823atom=self.atom824ifnothasattr(atom,'size'):# it is a pseudo-atom-->825outlistarr=[atom.fromarray(arr)forarrinlistarr]826else:827# Convert the list to the right flavor~/anaconda3/lib/python3.6/site-packages/tables/atom.pyinfromarray(self,array)1226ifarray.size==0:1227returnNone->1228returnsix.moves.cPickle.loads(array.tostring())ModuleNotFoundError:Nomodulenamed'bilby'
It looks to me like the culprit is the frequency_domain_source_model
Is the issue definitely DataFrame objects? If the issue is with bilby things, shouldn't we focus on making things serialisable, even if that means making things like the source model strings?
It would be good to have a super safe version, the option to print the essentials (posterior, prior, some sampler details) to csv or similar.
Regarding the frequency_domain_source_model (i.e. the problem of getting a ModuleNotFoundError: No module named 'bilby'). This is occuring because the bilby.gw.result.CBCResult object naively passed the frequency_domain_source function in. Functions are serialisable, but deepdish does this oh-so-clever thing of loading the module at read-in. So we can fix that easily by just converting that to a string. I'll submit a MR soon.
RE the other problems. Having investigated a little I've found out that
You can't store named arrays in a HDF5 with deepdish
Deepdish is intrinsically linked to pandas/tables
So maybe it is worth exploring using H5py or something?
We decided on exploring to replace the result hdf5 files with json. We are keeping the old functionality to read in from hdf5 to old results. Result.save_to_file should now save it into a json.
Gregory Ashtonadded 1 deleted label and removed 1 deleted label
added 1 deleted label and removed 1 deleted label
Gregory Ashtonmarked the checklist item Make all arrays in the deepdish file named numpy arrays rather than pandas data frames as completed
marked the checklist item Make all arrays in the deepdish file named numpy arrays rather than pandas data frames as completed
Gregory Ashtonmarked the checklist item Ensure that everything else in the deepdish result file is either a string, float, list or dictionary of these things. as completed
marked the checklist item Ensure that everything else in the deepdish result file is either a string, float, list or dictionary of these things. as completed
For the GW150914 example, the json file is 17MB and the hdf5 file is 15MB for 13725 posterior samples. I've also investigated how long loading and saving takes:
json save time: 0.797025json load time: 3.504098hdf5 save time: 0.419306hdf5 load time: 0.186613dict load time: 0.604379
The dict load time is the time it takes to load the raw results dictionary from a json file. The reason that the json load time is so long is because in addition to loading the dictionary, we have to convert the elements of the PriorDict back to bilby Prior class objects from unicode strings in a loop and convert the posterior from a dictionary to a pandas.DataFrame for compatibility with the plotting module.
Awesome work @sylvia.biscoveanu . Is it possible to separate the amount of time taken to convert the posterior to a DataFrame with the time taken to convert the PriorDict?
I ask because it would be quite reasonable to load the prior as a PriorDict only when it is needed (i.e., it just loads as a string and won't be evaluated until it is needed). However, the posterior is probably going to be required everytime.
This is encouraging, but cases where people load 100's of results might want to investigate just loading the dict directly.
Right, so in addition to the dict load time, converting the posterior and the nested_samples to DataFrames takes 0.789895 s total. So the DataFrame conversion adds ~0.1s and the rest of the several seconds comes from evaluating the prior strings.