pmdc failing to identify new files
When running repeatedly for file discover, pmdc seems to stop finding new files after a few iterations.
To work around this and guarantee new file discovery, I have to remove the cache state which obviously slows things down significantly and rather removes a lot of the benefits of this application.
The problem seems most acute when I'm trying to scan directories of symlinked files* (e.g. LLO: /home/rucio/dev/LIGO.frames.postO3/L1_R
) but I believe I was able to reproduce the problem by first running on a "daily directory" like L1/L-L1_R-13636
and then going up a level to L1
, and it stopped discovery new files in the original directory.
Some example scripts for executing pmdc and reading the resulting cache can be found on CIT: /home/james.clark/pmdc-test
(where I had a bit more luck scanning the original, non-symlinked directories).
Given the behavior I'm seeing, I'm wondering if one, or some combination of, the following is a problem:
- large numbers of files
- sudden massive jumps in the numbers of files / directories scanned on each pass
- problems handling symlinks (I had a play around with
os.walk('./')
vsos.walk('./', followlinks=True)
but don't see differences in behavior for my use case
Notes / clarifications:
- The symlinks are a convenience to let me operate with a rucio-friendly LFN -> PFN scheme, close to what we plan to move to in the very near future
- I expect to use ascii dumps from the "real" diskcache in production but it is useful to have a lightweight/standalone utility in general (I haven't had much luck today getting diskcache to run like i have in the past)
- I can live with just hacking around it by removing the old cache at each pass, particularly if I remove older links whose files have ultimately moved to cold storage anyway