Choices for PP test confidence intervals

I'll add my thoughts here:

We should switch to 1-2-3 sigma by default if that is what is used by L.I. / BayesWave
For marginal issues, i.e. if the PP tests show a single parameter out of the interval, we can verify if it is a systematic issue, or random noise by rerunning the test. The same parameters should be biased in repeated runs if they are biased in one.

Hi @gregory.ashton, this is a very nice investigation. One thing that might also be worth looking at is whether there is any difference seen when you increase Nsamples. The PP plots are obviously made by estimating confidence intervals, and the uncertainty (from finite sampling) on those confidence intervals should decrease with larger numbers of samples. I would expect this to average out, but maybe relatively smaller numbers of samples give rise to more jitter that can force a PP plot to have larger excursions?

@matthew-pitkin you raise a good point. Below are three runs varing the number of samples (from 100 to 10000). I don't see any increased jitter for Nsamples = 100.

Nsamples	PP plot
100
1000
10000

EDIT: Fixed the erroneous plot

Yeah if anything the Nsamples=10000 one is the worst. Also I think those first two plots are the same. The second one should be https://git.ligo.org/lscsoft/bilby/uploads/84619a3d1547ffabdaabf20d4a6dcf97/run0_Nresults100_Nsamples1000_3sigma.png I guess, which doesn't exist

How do you compute the "error regions"? They look the same in the above plots, while their extent should depend on N (as Matt said, the more samples, the more diagonal the line should be). I suspect this whole confusion might boil down to how the error regions are computed, and if they don't change with N, then I am very confused about what you are doing!

The error region should depend on Nevents (the number of injections) not Nsamples (the number of posterior samples for a given event) right? Since these all have 100 events, the error regions should be the same.

Ah sorry, I misinterpreted what N means.

@simon-stevenson well spotted. I edited the original comment with the updated N=1000 samples case

@katerina.chatziioannou yep - the number of samples is what is changing here. For Nsamples < 50 or so, one eventually sees nasty "step" features, but I guess that makes sense if you don't have enough samples to compute the intervals.

created merge request !726 (merged) to address this issue

mentioned in merge request !726 (merged)

For reference, here are some PP tests by Salvo from the LI O3 review. I don't see any background contours there so I think the only consistency we have to go on is the BW ones (from which I think Meg originally built the code producing the background plots)

closed via merge request !726 (merged)

mentioned in commit 3b84478d

Choices for PP test confidence intervals

What are we worried about?

Checking behaviour

Summary

Designs

Child items ...

Activity

	90%	1-2-3 sigma
run0
run1

Admin message

Choices for PP test confidence intervals

What are we worried about?

Checking behaviour

Summary

Activity