Choices for PP test confidence intervals
Following on from some discussions on mattermost, I wanted to get my head around the issue. I'm creating this issue to focus discussion and decide on what constitutes a "pass" for a PP test.
What are we worried about?
The issue @katerina.chatziioannou alluded to is that for some PP tests certain parameters wonder outside of the 90% C.I. and stay there for a while. In particular, she says
How long a line stays outside the 90% tells you how extensive the bias is in the parameter space. If it stays outside only for a bit, you can argue that there might only be a couple of wonky injections (high SNR? edge of the parameter space?). But if they stay outside for a long time, the problem is in a large region. Similarly, where exactly it exits and reenters might help you pin down the problem by focusing on specific injections and their settings.
I agree with this statement, but how worried we should be depends on which CI bound we are talking about. The plots Katerina points to for BW show little wandering outside the shaded regions, but I'm not 100% sure what those regions are. My guess would be 1-2-3 sigma ([0.68, 0.95, 0.997]
), but I'd like to hear that confirmed.
Checking behaviour
In order to understand the behaviour of PP plots in a setting where we know there isn't any bias, I set up a script which generates fake Gaussian 10D (uncorrelated) Gaussian posteriors and generates a PP test. (Note, to run this you'll need !721 (merged) in order to show the 3 sigma background).
run.py
import numpy as np
import bilby
from bilby.core.prior import Uniform
import pandas as pd
import tqdm
np.random.seed(1234)
sigma = 1
Nresults = 100
Nsamples = 1000
Nparameters = 10
Nruns = 3
priors = {f"x{jj}": Uniform(-1, 1, f"x{jj}") for jj in range(Nparameters)}
for x in range(Nruns):
results = []
for ii in tqdm.tqdm(range(Nresults)):
posterior = dict()
injections = dict()
for key, prior in priors.items():
sim_val = prior.sample()
rec_val = sim_val + np.random.normal(0, sigma)
posterior[key] = np.random.normal(rec_val, sigma, Nsamples)
injections[key] = sim_val
posterior = pd.DataFrame(dict(posterior))
result = bilby.result.Result(
label="test",
injection_parameters=injections,
posterior=posterior,
search_parameter_keys=injections.keys(),
priors=priors)
results.append(result)
bilby.result.make_pp_plot(results, filename=f"run{x}_90CI",
confidence_interval=0.9)
bilby.result.make_pp_plot(results, filename=f"run{x}_3sigma",
confidence_interval=[0.68, 0.95, 0.997])
Below is a table of the output figures for two of the runs, on the left with 90% C.I. and on the right with 1-2-3 sigma intervals. As expected, the "1-2-3" sigma look much more convincing (because the C.I. are of course wider). But, perhaps more importantly, in this case where I've guaranteed that the results are unbiased (by drawing samples directly). We see lines wander outside the 90% C.I.
90% | 1-2-3 sigma | |
---|---|---|
run0 | ||
run1 |
Summary
In summary, I think the inference that bilby was failing the PP tests because some of the lines wandered outside the 90% C.I. needs to be revisited. If the intuition that "it shouldn't wander outside the C.I" is based on 1-2-3 sigma contours, we should update to 1-2-3 sigma contours by default.
PS: For those interested, I also played around with the script above to look at how forms of bias manifest. I know this is well documented in the literature, but it was nice to have an example I could fiddle with myself.