Skip to content

Suggestion: Add support for hyperparameters with multiple dimensions

For example, when we fit a B-Spline model to a population parameter like mass, the coefficients \vec{\alpha} that multiply the spline matrix B are drawn as a vector at each step of the sampler and saved in a dictionary as hypersamples labeled something like 'mass_spline_coefficients'. For a spline with 10 coefficients and 1000 steps in the chain, the resulting hypersamples will have shape (1000,10), while other hyperparameters in the population model, say the standard deviation on a gaussian 'sigma', will only have hypersamples with shape (1000,).

Currently, popsummary expects all hyperparameters to have dimension (N_samples, ) so that the resulting hyperparameter sample dataset will have dimension (N_samples, N_hyperparams). In order for our spline coefficient hypersamples to fit in this framework, each individual spline coefficient \alpha_0, \alpha_1, ..., \alpha_n, will have to be labeled as it's own separate hyperparameter, i.e. hyperparameters = ['mass_spline_coefficient_0', 'mass_spline_coefficient_1', ...]. While we will be able to accommodate this by reformatting our posterior samples, the resulting list of hyperparameters and array of hypersamples taken in by popsummary will be rather long, unwieldy, and unclear for people unfamiliar with our analyses.

A possible solution to this issue would be to create a separate dataset for each hyperparameter within popsummary's hyperparameter_samples group. Each hyperparameter dataset can then store a numpy array for that specific hyperparameter, with dimension (N_samples, *N) where *N is the shape of the hyperparameter, e.g. (n,), (n,m), etc. The code snippet below demonstrates a way to do this (note that it assumes hyperparameter_samples is a list of numpy arrays, is ordered the same way as hyperparameters, and len(hyperparameter_samples) = len(hyperparameters)):

def set_hyperparameter_samples(self, hyperparameter_samples, overwrite=False, group='posterior'):
        """
        save hyperparameter samples to results file
        
        Parameters
        ----------
        hyperparameter_samples: List of N-D numpy arrays
            list of length (NumberOfPopulationDimensions,) of numpy arrays of data with shape (NumberOfHyperSamples, *HyperparameterDimension)
        overwrite: bool
            whether to overwrite existing dataset
        group: str
            group to save samples to ('posterior' or 'prior')
        """
        with h5py.File(self.fname, 'a') as f:
            if 'hyperparameter_samples' in list(f[group].keys()):
                if overwrite:
                    del f[group]['hyperparameter_samples']
                else:
                    raise Exception('dataset already exists, use the `overwrite` argument to overwrite it')
            f[group].create_group('hyperparameter_samples')
            for (i,hp) in enumerate(f.attrs['hyperparameters']):
                f[group]['hyperparameter_samples'].create_dataset(hp, data=hyperparameter_samples[i])

Implementing something like the above would obviously require adjustments to other parts of popsummary and may conflict with some necessary features, which is why I am offering it as a suggestion. If it will be too difficult to implement, we can use the workaround I described above, with the caveat that our popsummary files will be rather opaque to other users, especially for population models comprised of multiple B-Spline models like in the Cover Your Basis paper (Edelman 2022).