WIP: ComputeFstat: Add OpenCL version of ComputeFstat Resampling
This is a WIP merge reguest, that implements an OpenCL version of the ComputeFstat Resampling method, based on a hacked version from an Einstein@home volunteer. Based on this, a CUDA version will be created later on after this is merged. I open this here to get some comments.
My proceeding:
- rewrite hacked code to integrate it into lalpulsar
- there for separate OpenCL and Generic Version completely
- then try to share as much code as sensible with ResampGeneric
Notes:
- I add a generic OpenCL module (
OpenCLutils
), which provides some basic OpenCL methods like initialising the necessary objects for OpenCL and generic VectorMath functions (with a test CLMEMVectorTest similar to VectorMathTest). - The device selecting function will select a GPU device, which supports at least OpenCL 1.2 and has support for double-precision. If there is more than one, the device with the higher memory is choosen.If no GPU device is found, it will select any other device which supports this. With the enviroment variable
CLDEVICE
one can choose a device explicitly. - includes configure script from @bernd.machenschalk discussed so far here.
- for this issue should be find a common way before merging this
- supports for now two FFT implementations, the Einstein@Home implemtentation eclfft and the AMD Implementatios clFFT.
- eclfft seems to be faster on non AMD devices, but have higher setup timing except on NVIDIA devices.
- The clFFT library had active development till recently, but is now in maintance mode.
- default FFT for now is (if available) clFFT for all devices except for NVIDIA, there it es eclfft. With the enviroment variable
OPENCLFFT
one can choose explicitly a FFT implementation - clFFT has a switch to avoid allocating extra device memory whenever possible (
CLFFT_REQUEST_LIB_NOMEMALLOC
), so this is a good way to not use so much memory on the device. Timings are comparable. Here is the Memory usage with and without this switch compared to the memory usage from the generic version got fromlalapps_ComputeFstatBenchmark --numTrials=80
(OpenCL Version needs about 100-150MB RAM): - FFT timing for vaious devices:
- more timings can be found here