gwcelery-worker running out of memory when running ligo_skymap_flatten

changed milestone to %O3b

I think we need to monitor memory use as a function of time for all of our processes. @patrick.godwin, @stuart.anderson, any suggestions how to set that up?

I can reproduce the MemoryError pretty reliably if I take an nside=2048 sky map and submit several flatten jobs simultaneously.

It should be allocating only 1.5 GB here, and only briefly, so I don't know what the problem is.

Ah, this is making sense now. The memory use on emfollow-playground is about 20 GB, and there are only 32 GB available and 4 GB of swap. It's totally understandable that it's running out of memory when many of these jobs are running at once.

@leo-singer, one way you could do it is to use psutil, make a Process using the pid and log the memory usage regularly. You can then use Prometheus to scrape that information.

Also: https://github.com/ncabatoff/process-exporter

There are at least 3 issues here,

How to monitor memory usage if ganglia (https://ldas-gridmon.ligo.caltech.edu/ganglia/?c=Servers&h=emfollow-playground.ldas.cit&m=load_one&r=hour&s=by%20name&hc=4&mc=2) and munin (https://ldas-gridmon.ligo.caltech.edu/munin/ldas.cit/cbc.ldas.cit/index.html) are insufficient? Note, munin example is for a different server since it doesn't seem to be running on the emfollow machines--will have that enabled.
Whether to get rid of the 4GB of swap to force an error sooner if too much memory is requested rather than potentially going into a state of significantly degraded performance first? Note, my preference is to remove swap for these low-latency processing systems.
Whether to increase the system memory allocation from 32GB, throttle the number of concurrent flatten jobs, or reduce the per-job memory footprint?

How to monitor memory usage if ganglia (https://ldas-gridmon.ligo.caltech.edu/ganglia/?c=Servers&h=emfollow-playground.ldas.cit&m=load_one&r=hour&s=by%20name&hc=4&mc=2) and munin (https://ldas-gridmon.ligo.caltech.edu/munin/ldas.cit/cbc.ldas.cit/index.html) are insufficient? Note, munin example is for a different server since it doesn't seem to be running on the emfollow machines--will have that enabled.

I get "401 Unauthorized" for those errors.

Whether to get rid of the 4GB of swap to force an error sooner if too much memory is requested rather than potentially going into a state of significantly degraded performance first? Note, my preference is to remove swap for these low-latency processing systems.

Well, I don't know... this is probably a case where a little swapping doesn't hurt. The ligo_skymap_flatten job has a predictable, linear memory access pattern, so even if it is working out of swap, it shouldn't be that slow.

Whether to increase the system memory allocation from 32GB, throttle the number of concurrent flatten jobs, or reduce the per-job memory footprint?

I've turned on Celery autoscaling, which will reduce the memory footprint by keeping the number of threads low most of the time. Let's see if that helps.

OK, autoscaling helped some. But now I can clearly see that when the pool grows in concurrency by 1 subprocess, the machine's memory use goes up by 4 GB. So to make full use of the machine's capacity, we need 4 GB for each hardware thread.

That is how emfollow-playground is currently configured: 8 cpu and 32GB of memory (+4GB of swap). Should we try adding another 4 (or 8?) GB of memory, or reduce the number of cpu from 8 to 6?

Let's try adding 8 GB of RAM. Would be good to add proportionally more RAM to the production system as well.

@philippe.grassia please reboot emfollow-playground with an increase in memory from 32GB to 40GB and after removing the 4GB swap device.

@leo-singer the production system will get a more significant reconfig once we have deploy the gwcelery release that offloads bayester computations to the cluster. Please also retry the following link to get a munin monitor view of emfollow-playground, https://ldas-gridmon.ligo.caltech.edu/munin/ldas.cit/emfollow-playground.ldas.cit/index.html. The ACL for that has been opened up to all LDG users, and the full list of servers being monitored by munin at CIT is listed at https://ldas-gridmon.ligo.caltech.edu/munin/

Please also retry the following link to get a munin monitor view of emfollow-playground, https://ldas-gridmon.ligo.caltech.edu/munin/ldas.cit/emfollow-playground.ldas.cit/index.html. The ACL for that has been opened up to all LDG users, and the full list of servers being monitored by munin at CIT is listed at https://ldas-gridmon.ligo.caltech.edu/munin/

That works. Thanks!

reboot done.

I will add that all SCCB approved LSCSOFT updates and the latest kernel security update are also now applied on emfollow-playground. gwcelery is fairly isolated from the system installed packages, but there are a number of new and updated python 3 packages installed.

@philippe.grassia not urgent, but when you get a chance please also make all the same changes on emfollow-test and reboot.

emfollow-test was also changed to 40GB on Thursday Nov 21.

changed milestone to %v0.9.1

This hasn't recurred since version 0.9.0 was deployed.

closed

gwcelery-worker running out of memory when running ligo_skymap_flatten

Child items ...

Activity

Admin message

gwcelery-worker running out of memory when running ligo_skymap_flatten

Activity