Revert multiratespiir warp reduce
We've traced some serious MDC performance issues back to the warp reduce change in !20 (merged). It's still not clear what exactly went wrong there, especially since we've had successful MDC runs with that change in the last year.
CIT's had some OS changes and we've rebuilt dependencies, so it's possible that the change in !20 (merged) relies on some UB that's now changed.
This change reverts it, but adds a compiler flag (commented out) that enables warp reduce for deterministic testing.
Tests
We've run a number of tests.
Primarily, a run on MDC, checking that
- CPU load is <= 8 on the 15 minute average
- cohfar_calcfap causes spikes, pushing load over 8 and creating a queue of buffers, which are then worked through and latency recovers).
- Latencies are between 8 and 10. (checking the latency_history.txt file)
And we've run tests on the following branches (as well as others, but these were the main ones)
- O4-dev on Cuda 11.2, fails
- Tiebreak-clustering on Cuda 9.2, fails (which we've previously had successful MDC runs on)
- !20 (merged) on Cuda 9.2, fails
- O3-reviewed on Cuda 9.2, passes
- O4-dev with warp reduce reverted, passes
The other changes in !20 (merged) shouldn't be significant. We're trying a few new runs reverting sorted & FFT estimation. They Shouldn't have an impact, but if they do, they'll be changed in another MR.