Skip to content

Revert multiratespiir warp reduce

Timothy Davies requested to merge tdavies__test_atomic_add into spiir-O4-EW-development

We've traced some serious MDC performance issues back to the warp reduce change in !20 (merged). It's still not clear what exactly went wrong there, especially since we've had successful MDC runs with that change in the last year.

CIT's had some OS changes and we've rebuilt dependencies, so it's possible that the change in !20 (merged) relies on some UB that's now changed.

This change reverts it, but adds a compiler flag (commented out) that enables warp reduce for deterministic testing.

Tests

We've run a number of tests.

Primarily, a run on MDC, checking that

  • CPU load is <= 8 on the 15 minute average
    • cohfar_calcfap causes spikes, pushing load over 8 and creating a queue of buffers, which are then worked through and latency recovers).
  • Latencies are between 8 and 10. (checking the latency_history.txt file)

And we've run tests on the following branches (as well as others, but these were the main ones)

  • O4-dev on Cuda 11.2, fails
  • Tiebreak-clustering on Cuda 9.2, fails (which we've previously had successful MDC runs on)
  • !20 (merged) on Cuda 9.2, fails
  • O3-reviewed on Cuda 9.2, passes
  • O4-dev with warp reduce reverted, passes

The other changes in !20 (merged) shouldn't be significant. We're trying a few new runs reverting sorted & FFT estimation. They Shouldn't have an impact, but if they do, they'll be changed in another MR.

Merge request reports

Loading