Enable IGWN-OSG Usage
I've run a few test use cases of Bilby through OSG resources, using the CVMFS-deployed igwn-py37 conda environment. The pipeline only requires minor modifications to the pre- and post-processing to support this deployment in these cases (in fact to make sure those jobs do not run on the OSG).
Use-cases tested:
- single-core GW150914 analysis
- 8-core GW150914 analysis
- single BBH injection
- pp-test injections
Required Changes
The changes required to the workflow currently produced by bilby_pipe
consist of the following additions to the data-generation, merging and plotting submit files (i.e., NOT the main engine job):
+flock_local = True
+DESIRED_Sites = "nogrid"
+should_transfer_files = NO
These additional directives make sure the jobs run locally on the same cluster the workflow was submitted from and that they take advantage of the shared filesystem to look for e.g., relative paths.
The main analysis job itself runs out of the box on OSG resources since you have already done the hard work of setting up condor file transfer and deploying the executable to CVMFS.
Assumptions
I make the claim the bilby workflow will run ok with these changes, assuming:
- You submit from
ldas-osg.ligo.caltech.edu
or a submit point with an identical / equivalent configuration (namely, that you have access to a local pool, as well as the OSG pool, and that the filesystem domains are appropriately named to match the submit host) - Any other job configurations which I did not test (a) already make use of condor file transfer for any additional files (e.g., calibration files) and (b) any files the jobs create are written to locations the user has write-access (specifically, i saw problems using the distance-marginalization lookup table when something tried to write into CVMFS).
Example Runs / Test Location
The tests listed above can be found in:
CIT:/home/james.clark/Projects/osg-check/bilby
where you will also find a bash script modify_subs
which explicitly modifies an existing workflow to make the required changes.
(you will also see some runs with a "glidein" suffix there - you can ignore those. By default jobs will land in the local AND osg pools. These runs were just to test what happens if i force them onto the osg.)
Additional notes
Most resources available to LIGO on the OSG consist of a single core. There are a handful of multicore slots, up to a maximum (I think) of 8. Once things start running at scale / demand starts ramping up, it would be a good idea to spend some time optimizing the number and type of multicore slots available.
Also pinging @john-veitch @josh.willis