Replace shfl_xor with shfl_down where possible
When switching from the deprecated shfl_xor
to shfl_xor_sync
in !65 (merged) I accidentally introduced a hang. shfl_xor
is different to shfl_down
in that each thread in its 'thread mask' winds up with a copy of the result. However, one of our shfl_xor
s isn't run on all threads, but Is given the ALL_THREADS_MASK
. This causes the hang as it waits for all threads to catch up to get a copy of the results (I Think). !93 (merged) fixes this by adding a ballot_sync
and running shfl_xor_sync
with a thread mask corresponding to the ones that are actually running.
shfl_down_sync
may not have the same problem. We do the exact same process for the Warp Reduce added in !20 (merged), but don't get a hang.
Read through the code changed in !93 (merged) and make sure that shfl_down_sync
is valid (in that only the 0th thread needs the result) and try making the switch and removing the ballot.
Even if it can't be changed, @andrewmichael.gozzard suggests the following change: !93 (comment 608515)