Skip to content

Add results directory to condor inputs transfer

Daniel Williams requested to merge daniel-williams/bilby_pipe:master into master

What?

This MR attempts to address the bug identified in #234 (closed) where analyses created with bilby_pipe will fail to resume correctly if they've been set up with the Transfer_files = True argument in the condor submit file.

This bug will affect jobs running with recommended settings on the LDG and on all jobs running on the OSG.

In addition the MR adds a preserve_relative_paths = True line to the submit file in order to preserve the location of the returned data relative to the working directory from which the submission of the job is done.

Why?

The resume pickle is not currently transferred back to the worker node if a job is being resumed after eviction. This issue will not appear in situations where the entire sampling process is completed without evictions, as the resume file will be transferred as an output, and isn't required again as an input. The issue will also not manifest in jobs which run with access to a shared filesystem. However jobs which transfer files to the worker node cannot see the shared filesystem.

Who?

This MR was prepared by @daniel.williams with assistance from @patricia.schmidt while debugging a run affected by this bug.

How?

This MR adds the results directory to the list of transfered files in the submit file for analysis jobs. This is the location of the resume pickle; if this isn't transferred back to the compute node then the job will start sampling at the beginning once again.

Impact

The bug this MR is designed to fix is a critical issue for long-running jobs which are evicted through any standard eviction route in htcondor. The source of this problem is especially difficult to track down, and could lead a less experienced user to assume sampling was just taking an unusually long time as it is in fact being restarted frequently (upon each job eviction).

Edited by Daniel Williams

Merge request reports