Add a common lib for cuda functions
Our cuda kernels are generally hard to follow.
I assume that because the code needs to be performant, we tended to use bit hacks and avoid helper functions.
But the compiler should be aggressive about inlining functions, so really we're just making it hard to follow.
To make a refactor easier, we should add a 'cuda_utils.cu' file or common lib, that adds __device__
code for kernels to use.