Emfolow machine emfollow.ligo.caltech.edu is running out of memory once per day, with the kernel oom-kill making a pseudo random guess on what processes to kill.
Following a note by Stuart https://git.ligo.org/computing/helpdesk/-/issues/4741 that noted that it is running out of memory ~once per day, and the automatic kernel oom-kill
may not always pick the best process to kill. An emfollow software expert should investigate what task(s) are using up the current 64GB of memory in this system and see if that can be reduced or indicate how much more memory this system needs to run stably.
- Ganglia monitoring (https://ldas-gridmon.ligo.caltech.edu/ganglia/?c=Servers&h=emfollow.ldas.cit) shows that memory usage is stable over 44GByte of RAM with stable escalating of memory in use from 12GByte to up to 56GByte. (The same trend happen also in emfollow-playground)
- The machines appear to have been oversubscribed. (Form memory side) while CPU is idle in the average of 90&%
- Login to emfollow.ligo.caltech.edu shows that the machine is usually idle. While "ps -aux --sort -vsz" shows that most of the memory is allocated by the primary gwcelery worker (see below). [The status of processes is very different in playground and test (?).]
No process seems responsible for the system running out of the available RAM but rather an escalation of allocated memory that is freed. (Memory python garbage collector(?)
Should we need to reduce concurrency of the main celery queue from 64 to 32? The reason (see below a trace of memory) to consider this are:
- Any worker allocate at least a virtual memory of ~1GBytes and a Physical memoory of ~230MByes (we have just 64 GByte of RAM)
- Most of the workers are almost-never activated from sleep status (time==0)
- cpuinfo reports that we have "Brand Raw: AMD EPYC 7543 32-Core Processor" and the machine is running other queues. (8x8=64 cores.)
ps -aux --forest | grep gwcel
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
emfollow 2644187 0.0 0.0 11824 1112 pts/0 S+ 07:25 0:00 | \_ grep --color=auto gwcel
emfollow 3434932 0.0 0.2 994096 194764 ? Ss Oct07 0:09 | | \_ [celery beat] beat -f gwcelery-beat.log
emfollow 3435819 0.4 0.2 1146600 180292 ? Ssl Oct07 20:38 | | \_ [celeryd: gwcelery-voevent-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-voevent-worker@%h -f %n.log -Q voevent -P solo)
emfollow 3435925 0.0 0.3 1412872 260724 ? Ssl Oct07 1:04 | | \_ /cvmfs/software.igwn.org/conda/envs/igwn-py39-20221118/bin/python /home/emfollow/.local/bin/gwcelery flask -l info -f gwcelery-flask.log run --with-threads --host 127.0.0.1
emfollow 3435926 0.6 1.6 2463892 1109976 ? Ssl Oct07 28:11 | | \_ /cvmfs/software.igwn.org/conda/envs/igwn-py39-20221118/bin/python /home/emfollow/.local/bin/gwcelery flower --address=127.0.0.1 --log-file-prefix=gwcelery-flower.log
emfollow 3435929 0.4 0.2 1008576 175176 ? Ss Oct07 20:54 | | \_ [celeryd: gwcelery-superevent-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-superevent-worker@%h -f %n.log -Q superevent -c 1 --prefetch-multiplier 1)
emfollow 3436047 0.0 0.2 1013008 186936 ? S Oct07 0:11 | | \_ [celeryd: gwcelery-superevent-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-1]
emfollow 3435930 0.8 0.4 2447828 300984 ? Ssl Oct07 35:22 | | \_ [celeryd: gwcelery-kafka-producer-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-kafka-producer-worker@%h -f %n.log -Q kafka-producer -P solo)
emfollow 3435931 0.7 0.3 2095580 244504 ? Ssl Oct07 33:39 | | \_ [celeryd: gwcelery-kafka-consumer-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-kafka-consumer-worker@%h -f %n.log -Q kafka-consumer -P solo)
emfollow 3435935 0.4 0.2 1008312 180216 ? Ss Oct07 20:41 | | \_ [celeryd: gwcelery-exttrig-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-exttrig-worker@%h -f %n.log -Q exttrig -c 1 --prefetch-multiplier 1)
emfollow 3436046 0.0 0.2 1008312 177652 ? S Oct07 0:00 | | \_ [celeryd: gwcelery-exttrig-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-1]
emfollow 1428516 5.2 31.1 25881560 20367200 ? Ssl Oct09 80:18 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:MainProcess] -active- (worker -l info -n gwcelery-worker@%h -f %n.log -Q celery --igwn-alert --email --concurrency 64)
emfollow 1428643 0.2 1.7 1956596 1157736 ? Sl Oct09 3:59 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-1]
emfollow 1428644 0.2 1.7 1967232 1163588 ? Sl Oct09 3:12 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-2]
emfollow 1428645 0.1 1.6 1887452 1082536 ? Sl Oct09 1:46 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-3]
emfollow 1428646 0.0 0.8 1351164 553020 ? Sl Oct09 0:50 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-4]
emfollow 1428647 0.0 1.8 1980344 1237984 ? S Oct09 0:47 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-5]
emfollow 1428648 0.0 3.8 3311260 2496980 ? Sl Oct09 0:43 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-6]
emfollow 1428649 0.0 0.4 1051652 303992 ? S Oct09 0:08 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-7]
emfollow 1428650 0.0 0.6 1169236 415236 ? S Oct09 0:02 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-8]
emfollow 1428651 0.0 0.5 1123236 385708 ? S Oct09 0:03 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-9]
emfollow 1428652 0.0 2.8 2600476 1846540 ? S Oct09 0:09 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-10]
emfollow 1428653 0.0 0.5 1104364 350912 ? S Oct09 0:01 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-11]
emfollow 1428654 0.0 0.5 1104584 348932 ? S Oct09 0:16 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-12]
emfollow 1428655 0.0 0.3 1013076 248468 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-13]
emfollow 1428656 0.0 0.3 1011868 246020 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-14]
emfollow 1428657 0.0 2.2 2261580 1504552 ? S Oct09 0:10 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-15]
emfollow 1428658 0.0 0.3 1011084 244704 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-16]
emfollow 1428659 0.0 0.4 1035112 276228 ? S Oct09 0:05 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-17]
emfollow 1428660 0.0 0.3 1011084 244668 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-18]
.......
emfollow 1428736 0.0 0.3 1008520 236816 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-54]
emfollow 1428737 0.0 0.3 1008524 236816 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-55]
emfollow 1428738 0.0 0.3 1008528 236828 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-56]
emfollow 1428739 0.0 0.3 1008532 236808 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-57]
emfollow 1428740 0.0 0.3 1008536 236808 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-58]
emfollow 1428741 0.0 0.3 1008540 236820 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-59]
emfollow 1428742 0.0 0.3 1008544 236824 ? S Oct09 0:00 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-60]
emfollow 1428743 0.8 2.0 2118148 1318832 ? Sl Oct09 12:26 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-61]
emfollow 1428744 0.7 2.0 2133448 1332396 ? Sl Oct09 10:44 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-62]
emfollow 1428745 0.3 1.8 2020624 1219596 ? Sl Oct09 4:41 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-63]
emfollow 1428746 0.2 1.9 2099440 1298012 ? Sl Oct09 3:43 | | \_ [celeryd: gwcelery-worker@emfollow.ligo.caltech.edu:ForkPoolWorker-64]```
Edited by Roberto DePietri