Draft: Split chisq normalization from postcoh kernel
Draft
Adding the draft tag since I'm still looking at how best to split things up, and trying to avoid any drop in performance.
Changes
I'm pulling some steps out of postcoh kernel with a couple of motivations, with the following priority:
- So that in a following MR, I can change chisq normalization with a feature flag, by launching a different, and small, kernel.
- So that in a later project we can separate foreground & background calculation, to significantly reduce upload latency (we do literally 100x more work than we need to in postcoh_kernel)
- Readability (opinions may vary).
- Performance
There's other options for (1). We could:
- Put a branch in the cmbchisq calculation. It wouldn't be hard, but it feels like adding on to technical debt.
- Calculate cmbchisq on CPU (The overall work is typically small, but it Is parallelizable, and we've already copied the necessary data to GPU).
- There's theoretically as many as 500 peaks per second per bank, 101 triggers (fg + bg) per peak, 4 banks per node, and 2 fields getting set, ~= 400k floats to calculate. Typically there'd be just several peaks per second.
I think the kernel needs a rework anyway, so I'm inclined to dismiss those.
The original kernel is kinda weird. It has these jobs:
- For each foreground peak (256 threads per peak)
- Coherent search calculating cohsnr (threads per pixel, num pixels > num threads, with a warp reduce across all 256 threads at the end)
- Given the direction, store the results in ntoff, snglsnr, coaphase. (I'm pretty sure all 256 threads do the same writes, I don't know if memory writes are parallel.)
- calculate chisq, (threads per sample, num pixels > num threads, with warp reduce across all samples)
- For each background peak (32 threads per peak)
- As above, but with only 32 threads.
I think there's a few opportunities to improve performance there.
This MR, I've pulled out chisq reduction & cmbchisq calculation, and allocate effectively a thread per peak, and all it has to do is calculate & set 2 floats. That shouldn't work out to be any more threads than the original kernel uses:
Old kernel: max peaks \* threads per block = 500 \* 256
vs
New kernel: max peaks * triggers per peak * (threads per block - 1) / threads per block = 500 * 101 * 255 / 256 ~= 500 * 101
Tests
I'm running the 8000s injection test to confirm it functionally works.
For performance testing, I'll run 1 node on CIT before/after, with some debug info showing how long each step in postcoh_kernel.cu takes.