Changes

Moritz Huebner · 902c4985
--- a/O3-review/evidence-review.md
+++ b/O3-review/evidence-review.md
@@ -4,17 +4,18 @@ A page for reviewing the evidence calculation produced by bilby.

 ## Analytical likelihood

-To review the evidence calculation using bilby (in particular using the `dynesty` sampler) it can be used to sample from a probability distribution with a known normalisation. In the review of LALInference this consisted of checking the evidence calculation using [two distributions](https://www.lsc-group.phys.uwm.edu/ligovirgo/cbcnote/LALInferenceReviewAnalyticGaussianLikelihood): i) a 15D multivariate Gaussian distribution, and ii) a bi-modal multivariate Gaussian distribution. Equivalent distributions are implemented in bilby in [`15d_gaussian.py`](https://git.ligo.org/lscsoft/bilby/blob/master/examples/core_examples/15d_gaussian.py) and have been tested on [this page](https://git.ligo.org/lscsoft/bilby_pipe/-/wikis/O3-review/15D_Gaussian).
+To review the evidence calculation using bilby (in particular using the `dynesty` sampler) it is useful to sample a likelihood with a known normalisation. In the review of LALInference this consisted of checking the evidence calculation using [two distributions](https://www.lsc-group.phys.uwm.edu/ligovirgo/cbcnote/LALInferenceReviewAnalyticGaussianLikelihood): i) a 15D multivariate Gaussian distribution, and ii) a bi-modal multivariate Gaussian distribution. Equivalent distributions are implemented in bilby in [`15d_gaussian.py`](https://git.ligo.org/lscsoft/bilby/blob/master/examples/core_examples/15d_gaussian.py) and have been tested on [this page](https://git.ligo.org/lscsoft/bilby_pipe/-/wikis/O3-review/15D_Gaussian).
+The LALInference review used a certain set of means and standard deviations in each directions. It is not necessary to exactly emulate these exactly, because they trivially transform into a problem with zero mean and similar standard deviations in each dimension.


 A more extensive review has been performed using [this repository](https://git.ligo.org/moritz.huebner/evidence_review).
 While this review is meant to evaluate the performance of `dynesty`, it was trivial to redo the tests with other nested sampling packages.
-Thus, all tests have been performed using `bilby==0.6.5`, `cpnest==0.9.7`, `dynesty==1.0.1`, `Multinest=3.10`, `nestle==0.2.0`, `Polychord==1.15.1`
+Thus, all tests have been performed using `bilby==0.6.5`, `cpnest==0.9.7`, `dynesty==1.0.1`, `Multinest=3.10`, `nestle==0.2.0`, `Polychord==1.15.1`.

 ### Convergence to analytical evidence
-As we increase the number of live points, stochastic and systematic errors should decrease and the mean value of the evidence should converge to the analytically known evidence. We choose to perform 100 runs with 32, 64, 128, ..., 4096 livepoints with the analytical likelihood both in the unimodal and bimodal case. All other settings are the defaults as they are specified in `bilby`. `bilby` mostly tries to emulate the defaults the samplers set themselves. The prior is taken to be uniform from -20 to 20 in each dimension.
+As we increase the number of live points, stochastic and systematic sampling errors should decrease and the mean value of the evidence should converge to the analytically known evidence. We choose to perform 100 runs with 32, 64, 128, ..., 4096 livepoints with the analytical likelihood both in the unimodal and bimodal case. All other settings are the defaults as they are specified in `bilby`. `bilby` mostly tries to emulate the defaults the samplers set themselves. The prior is taken to be uniform from -20 to 20 in each dimension.

-The results can be seen below. The displayed errorbars are taken to be standard deviation of the 100 measured log evidences. While `dynesty`, `nestle`, and `polychord` converge to the analytical value, 'cpnest', 'dynamic_dynesty', and 'pymultinest' are significantly biased. We see this result as generally encouraging for `dynesty`.
+The results can be seen below. The displayed errorbars are taken to be standard deviation of the 100 measured log evidences. If one is interested in the uncertainty on the *mean* of the log evidence after 100 runs, these errorbars need to be divided by `\sqrt 100 = 10`. While `dynesty`, `nestle`, and `polychord` converge to the analytical value, 'cpnest', 'dynamic_dynesty', and 'pymultinest' are significantly biased. We see this result as generally encouraging for `dynesty`. We also note that it is generally more difficult for samplers to recover the bimodal case, which is why it takes more live points to converge to the same level.

 ![all_samplers_review_unimodal_summary](uploads/3904f7c48207f4abab73505ad4ef6fbc/all_samplers_review_unimodal_summary.png)
 ![all_samplers_review_bimodal_summary](uploads/dfb698950ab60945b6c869a8544d8096/all_samplers_review_bimodal_summary.png)
@@ -23,7 +24,7 @@ The results can be seen below. The displayed errorbars are taken to be standard
 ### Empirical evidence uncertainties vs. K-L-divergence uncertainties

 The sampling packages quote evidence uncertainties by calculating a K-L divergence. We want to test whether this quoted uncertainty is truly Gaussian, i.e. is the true evidence covered by the 1(2)-sigma interval ~68(95)% of the time, etc.
-We can display these results most conveniently by creating percentile-percentile style plots. Specifically, we look at the results of dynesty and Polychord. 
+We can display these results most conveniently by creating percentile-percentile style plots. Specifically, we look at the results of dynesty and also Polychord as a reference.

 ![dynesty_review_unimodal_summary_pp](uploads/da9cff0a6c9d6128b7bae9afd571c3d0/dynesty_review_unimodal_summary_pp.png)
 ![dynesty_review_bimodal_summary_pp](uploads/751a53cd96288a0f893e53805572937e/dynesty_review_bimodal_summary_pp.png)