Distributed race condition during directory creation (os.mkdir)
Summary
There's a race condition here: https://git.ligo.org/lscsoft/spiir/-/blob/spiir-O4-EW-development/gstlal-spiir/bin/gstlal_inspiral_postcohspiir_online#L590.
This race condition caused an error for node #2 (closed) on a 24 node BNS run with error logs here: /fred/oz016/dtang/analysis/pastro/injection/bns/bns_bg_1257992907-604800-created_06-01-22_19-56-52/logs_HL/pipe_27466942_2.err
.
Explanation
Consider two nodes running this check. There is a non-zero chance that these two nodes can check whether os.path.exists(dir)
at the same time and then proceed to call os.mkdir(dir)
. However, in between the condition check and the directory creation, the directory can be created by the first node, which causes an error on a second node.
Traceback
File "/fred/oz016/gwdc_spiir_pipeline_codebase/scripts_n_things/build/dtang/install/bin/gstlal_inspiral_postcohspiir_online", line 620, in <module>
options, filenames, process_params, iir_banks, detectors = parse_command_line()
File "/fred/oz016/gwdc_spiir_pipeline_codebase/scripts_n_things/build/dtang/install/bin/gstlal_inspiral_postcohspiir_online", line 591, in parse_command_line
os.mkdir(skymap_path)
OSError: [Errno 17] File exists: 'H1L1_skymap'
srun: error: john65: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=27466945.0
Solution
Python3 has solutions for this built into the standard library - os.makedirs(dir, exist_ok=True)
being one of them. However, this is not available in Python2.7.
A potential work-around is discussed here: https://stackoverflow.com/questions/45283093/how-to-workaround-exist-ok-missing-on-python-2-7.
Alternatively some kind of try... except
block may be appropriate (maybe), see below. However, this approach may be overloading the use of OSError
as it can mask other raised exceptions that aren't simply "file already exists". Python2.7 does not have an equivalent to Python 3's FileExistsError.
try:
os.mkdir(dir)
except OSError:
logger.debug("%s already created." % dir)
All file/folder creation done by multiple nodes like this should be similarly addressed in the pipeline code.