Maintenance will be performed on git.ligo.org, containers.ligo.org, and docs.ligo.org on Tuesday 22 April 2025 starting at approximately 9am PDT. It is expected to take around 30 minutes and there will be several periods of downtime throughout the maintenance. Please address any comments, concerns, or questions to the helpdesk. This maintenance will be upgrading the GitLab database in order to be ready for the migration.
I think we need to monitor memory use as a function of time for all of our processes. @patrick.godwin, @stuart.anderson, any suggestions how to set that up?
Ah, this is making sense now. The memory use on emfollow-playground is about 20 GB, and there are only 32 GB available and 4 GB of swap. It's totally understandable that it's running out of memory when many of these jobs are running at once.
@leo-singer, one way you could do it is to use psutil, make a Process using the pid and log the memory usage regularly. You can then use Prometheus to scrape that information.
Whether to get rid of the 4GB of swap to force an error sooner if too much memory is requested rather than potentially going into a state of significantly degraded performance first? Note, my preference is to remove swap for these low-latency processing systems.
Whether to increase the system memory allocation from 32GB, throttle the number of concurrent flatten jobs, or reduce the per-job memory footprint?
Whether to get rid of the 4GB of swap to force an error sooner if too much memory is requested rather than potentially going into a state of significantly degraded performance first? Note, my preference is to remove swap for these low-latency processing systems.
Well, I don't know... this is probably a case where a little swapping doesn't hurt. The ligo_skymap_flatten job has a predictable, linear memory access pattern, so even if it is working out of swap, it shouldn't be that slow.
Whether to increase the system memory allocation from 32GB, throttle the number of concurrent flatten jobs, or reduce the per-job memory footprint?
I've turned on Celery autoscaling, which will reduce the memory footprint by keeping the number of threads low most of the time. Let's see if that helps.
OK, autoscaling helped some. But now I can clearly see that when the pool grows in concurrency by 1 subprocess, the machine's memory use goes up by 4 GB. So to make full use of the machine's capacity, we need 4 GB for each hardware thread.
That is how emfollow-playground is currently configured: 8 cpu and 32GB of memory (+4GB of swap). Should we try adding another 4 (or 8?) GB of memory, or reduce the number of cpu from 8 to 6?
I will add that all SCCB approved LSCSOFT updates and the latest kernel security update are also now applied on emfollow-playground. gwcelery is fairly isolated from the system installed packages, but there are a number of new and updated python 3 packages installed.
@philippe.grassia not urgent, but when you get a chance please also make all the same changes on emfollow-test and reboot.