# Evidence review

A page for reviewing the evidence calculation produced by bilby.

## Analytical likelihood

To review the evidence calculation using bilby (in particular using the `dynesty`

sampler) it is useful to sample a likelihood with a known normalisation. In the review of LALInference this consisted of checking the evidence calculation using two distributions: i) a 15D multivariate Gaussian distribution, and ii) a bi-modal multivariate Gaussian distribution. Equivalent distributions are implemented in bilby in `15d_gaussian.py`

and have been tested on this page.
The LALInference review used a certain set of means and standard deviations in each directions. It is not necessary to exactly emulate these exactly, because they trivially transform into a problem with zero mean and similar standard deviations in each dimension.

A more extensive review has been performed using this repository.
While this review is meant to evaluate the performance of `dynesty`

, it was trivial to redo the tests with other nested sampling packages.
Thus, all tests have been performed using `bilby==0.6.5`

, `cpnest==0.9.7`

, `dynesty==1.0.1`

, `Multinest=3.10`

, `nestle==0.2.0`

, `Polychord==1.15.1`

.

### Convergence to analytical evidence

As we increase the number of live points, stochastic and systematic sampling errors should decrease and the mean value of the evidence should converge to the analytically known evidence. We choose to perform 100 runs with 32, 64, 128, ..., 4096 livepoints with the analytical likelihood both in the unimodal and bimodal case. All other settings are the defaults as they are specified in `bilby`

. `bilby`

mostly tries to emulate the defaults the samplers set themselves. The prior is taken to be uniform from -20 to 20 in each dimension.

The results can be seen below. The displayed errorbars are taken to be standard deviation of the 100 measured log evidences. If one is interested in the uncertainty on the *mean* of the log evidence after 100 runs, these errorbars need to be divided by `\sqrt 100 = 10`

. While `dynesty`

, `nestle`

, and `polychord`

converge to the analytical value, `cpnest`

, `dynamic_dynesty`

, and `pymultinest`

are systematically biased even for high numbers of live points. We see this result as generally encouraging for `dynesty`

. We also note that it is generally more difficult for samplers to recover the bimodal case, which is why it takes more live points to converge to the same level.

### Empirical evidence uncertainties vs. K-L-divergence uncertainties

The sampling packages quote evidence uncertainties by calculating a K-L divergence. We want to test whether this quoted uncertainty is truly Gaussian, i.e. is the true evidence covered by the 1(2)-sigma interval ~68(95)% of the time, etc.
We can display these results most conveniently by creating percentile-percentile style plots. Specifically, we look at the results of `dynesty`

and also `pypolychord`

as a reference.

With `dynesty`

we see that the evidences for low numbers of live points are significantly biased. This is obviously the case because the evidence estimates are systematically biased in this regime and the true value is therefore rarely covered by the uncertainty interval. As we increase the number of live points to 1024, we see that the curves generally stay inside the grey 95% confidence band. Interestingly, the curves tend to lie a bit below the band for low values of the confidence interval (CI), and overshoot for high values of the CI. This indicates that the errors quoted by `dynesty`

are not truly Gaussian. Instead, they do in fact tail off faster than a Gaussian distribution, which means that there are very rarely any outliers past twice the quoted error. Because of that `dynesty`

at and above 1000 live points can in general be used confidently to report log evidences and the associated errors.

`pypolychord`

in the unimodal case performs extremely well even for a low number of live points. The uncertainties generally appear to be truly Gaussian. In the bimodal case a higher number of live points is required to calculate accurate log evidence errors. We find again that 1000 or more live points are required for errors to be accurate.

## LALInference comparison

Compare the evidence produced by the standard bilby run on GW150914 with the LALInference evidence. Run for individual detectors and a coherent multi-detector analysis.

## Supplemental

Relevant `dynesty`

sampler settings other than `nlive`

for the analytical likelihood:

`{'bound': 'multi', 'sample': 'rwalk', 'verbose': True, 'periodic': [], 'reflective': [], 'check_point_delta_t': 600, 'nlive': 2048, 'first_update': None, 'walks': 100, 'npdim': None, 'rstate': None, 'queue_size': None, 'pool': None, 'use_pool': None, 'logl_args': None, 'logl_kwargs': None, 'ptform_args': None, 'ptform_kwargs': None, 'enlarge': 1.5, 'bootstrap': None, 'vol_dec': 0.5, 'vol_check': 8.0, 'facc': 0.2, 'slices': 5, 'update_interval': 1228, 'dlogz': 0.1, 'maxiter': None, 'maxcall': None, 'logl_max': inf, 'add_live': True, 'print_progress': True, 'save_bounds': False, 'n_effective': None, 'maxmcmc': 5000, 'nact': 5}`

Review matrix

```
review_covariance_matrix = [
[0.045991865933182365, -0.005489748382557155, -0.01025067223674548, 0.0020087713726603213, -0.0032648855847982987,
-0.0034218261781145264, -0.0037173401838545774, -0.007694897715679858, 0.005260905282822458, 0.0013607957548231718,
0.001970785895702776, 0.006708452591621081, -0.005107684668720825, 0.004402554308030673, -0.00334987648531921],
[-0.005489748382557152, 0.05478640427684032, -0.004786202916836846, -0.007930397407501268, -0.0005945107515129139,
0.004858466255616657, -0.011667819871670204, 0.003169780190169035, 0.006761345004654851, -0.0037599761532668475,
0.005571796842520162, -0.0071098291510566895, -0.004477773540640284, -0.011250694688474672, 0.007465228985669282],
[-0.01025067223674548, -0.004786202916836844, 0.044324704403674524, -0.0010572820723801645, -0.009885693540838514,
-0.0048321205972943464, -0.004576186966267275, 0.0025107211483955676, -0.010126911913571181, 0.01153595152487264,
0.005773054728678472, 0.005286558230422045, -0.0055438798694137734, 0.0044772210361854765, -0.00620416958073918],
[0.0020087713726603213, -0.007930397407501268, -0.0010572820723801636, 0.029861342087731065, -0.007803477134405363,
-0.0011466944120756021, 0.009925736654597632, -0.0007664415942207051, -0.0057593957402320385,
-0.00027248233573270216, 0.003885350572544307, 0.00022362281488693097, 0.006609741178005571, -0.003292722856107429,
-0.005873218251875897],
[-0.0032648855847982987, -0.0005945107515129156, -0.009885693540838514, -0.007803477134405362, 0.0538403407841302,
-0.007446654755103316, -0.0025216534232170153, 0.004499568241334517, 0.009591034277125177, 0.00008612746932654223,
0.003386296829505543, -0.002600737873367083, 0.000621148057330571, -0.006603857049454523, -0.009241221870695395],
[-0.0034218261781145264, 0.004858466255616657, -0.004832120597294347, -0.0011466944120756015, -0.007446654755103318,
0.043746559133865104, 0.008962713024625965, -0.011099652042761613, -0.0006620240117921668, -0.0012591530037708058,
-0.006899982952117269, 0.0019732354732442878, -0.002445676747004324, -0.006454778807421816, 0.0033303577606412765],
[-0.00371734018385458, -0.011667819871670206, -0.004576186966267273, 0.009925736654597632, -0.0025216534232170153,
0.008962713024625965, 0.03664582756831382, -0.009470328827284009, -0.006213741694945105, 0.007118775954484294,
-0.0006741237990418526, -0.006003374957986355, 0.005718636997353189, -0.0005191095254772077,
-0.008466566781233205],
[-0.007694897715679857, 0.0031697801901690347, 0.002510721148395566, -0.0007664415942207059, 0.004499568241334515,
-0.011099652042761617, -0.009470328827284016, 0.057734267068088, 0.005521731225009532, -0.017008048805405164,
0.006749693090695894, -0.006348460110898, -0.007879244727681924, -0.005321753837620446, 0.011126783289057604],
[0.005260905282822458, 0.0067613450046548505, -0.010126911913571181, -0.00575939574023204, 0.009591034277125177,
-0.0006620240117921668, -0.006213741694945106, 0.005521731225009532, 0.04610670018969681, -0.010427010812879566,
-0.0009861561285861987, -0.008896020395949732, -0.0037627528719902485, 0.00033704453138913093,
-0.003173552163182467],
[0.0013607957548231744, -0.0037599761532668475, 0.01153595152487264, -0.0002724823357326985, 0.0000861274693265406,
-0.0012591530037708062, 0.007118775954484294, -0.01700804880540517, -0.010427010812879568, 0.05909125052583998,
0.002192545816395299, -0.002057672237277737, -0.004801518314458135, -0.014065326026662672, -0.005619012077114913],
[0.0019707858957027763, 0.005571796842520162, 0.005773054728678472, 0.003885350572544309, 0.003386296829505542,
-0.006899982952117272, -0.0006741237990418522, 0.006749693090695893, -0.0009861561285862005, 0.0021925458163952988,
0.024417715762416557, -0.003037163447600162, -0.011173674374382736, -0.0008193127407211239, -0.007137012700864866],
[0.006708452591621083, -0.0071098291510566895, 0.005286558230422046, 0.00022362281488693216, -0.0026007378733670806,
0.0019732354732442886, -0.006003374957986352, -0.006348460110897999, -0.008896020395949732, -0.002057672237277737,
-0.003037163447600163, 0.04762367868805726, 0.0008818947598625008, -0.0007262691465810616, -0.006482422704208912],
[-0.005107684668720825, -0.0044777735406402895, -0.005543879869413772, 0.006609741178005571, 0.0006211480573305693,
-0.002445676747004324, 0.0057186369973531905, -0.00787924472768192, -0.003762752871990247, -0.004801518314458137,
-0.011173674374382736, 0.0008818947598624995, 0.042639958466440225, 0.0010194948614718209, 0.0033872675386130637],
[0.004402554308030674, -0.011250694688474675, 0.004477221036185477, -0.003292722856107429, -0.006603857049454523,
-0.006454778807421815, -0.0005191095254772072, -0.005321753837620446, 0.0003370445313891318, -0.014065326026662679,
-0.0008193127407211239, -0.0007262691465810616, 0.0010194948614718226, 0.05244900188599414, -0.000256550861960499],
[-0.00334987648531921, 0.007465228985669282, -0.006204169580739178, -0.005873218251875899, -0.009241221870695395,
0.003330357760641278, -0.008466566781233205, 0.011126783289057604, -0.0031735521631824654, -0.005619012077114915,
-0.007137012700864866, -0.006482422704208912, 0.0033872675386130632, -0.000256550861960499, 0.05380987317762257]]
```

## Review statement

### Date 2020-03-09

Reviewers: @matthew-pitkin, @simon-stevenson

To review the evidence (marginal likelihood) estimates produced by the nested sampling algorithms implemented in bilby, we requested checks on their ability to reproduce the solutions to a pair of analytically known 15D integrals (a unimodal multivariate Gaussian distribution, and a pair of multivariate Gaussian distributions). These tests are consistent with those that were performed on the LALInference-based nested sampling codes.

Based on the tests above we sign-off the evidence evaluation as reviewed with the following criteria:

- The evidence for the
`dynesty`

,`nestle`

and`pypolychord`

samplers appear valid when using more than 1000 live points. In these cases any small systematic bias on the evidence is well within the statistical variation of the evidence. We therefore recommend that evidences from these samplers should only be quoted if using more than 1000 live points.- For these samplers, and using greater than 1000 live points, the evidence uncertainties output by bilby should be reliable and provide conservative bounds (i.e., they may be slight overestimates of the true uncertainty).

- The evidences for other samplers can suffer significant systematic biases across a broad range of numbers of live points. If evidences for these samplers are required then the tests in this review would need to be reproduced to show specific settings that can reduce the bias.

At the time of writing consistency tests between bilby and LALInference for the evidence produced for a real gravitational-wave signal are underway. However, the test performed so far are equivalent to those used to evaluate LALInference and are deemed sufficient.