Dolphin errors cropping up on some H1 front ends
We are getting recurring dolphin errors on multiple front ends that lead to IOP delays, which lead to dropped packets and even DAQ restarts if too many front ends are erring.
In the week of the 2021-10-3, h1seih23 and h1seih45 ran out of memory. During that time they started producing "disconnected" messages in dmesg for certain dolphin nodes. These were the sus nodes that sent to the two sei nodes. The sei nodes were generating one error per second per connecting sus.
On the sus nodes, remote allocation errors were appearing. We freed up memory on the two sei nodes, but we needed to reboot them before the errors went away. They did not come back for the weekend.
On Tuesday Oct 12, we restarted h1lsc0 to change its timing. We added dolphin back in to h1seih7 and restarted them.
h1lsc0 quickly (30 minutes or an hour) started producing the same disconnect errors, even though lsc0 has plenty of free memory.
We restarted lsc0 and the errors went away, but the resumed the morning of the Oct 13. At this time, we noticed that h1seih7, h1seih45 were also producing these errors, and h1seih23 gave a heartbeat failed message instead of a disconnect message, but referencing the same SUS nodes.
Interestingly, dis_diag on the sus nodes showed failed connection to h1seih45 but not h1seih23.
We restarted all 4 nodes: seih23, seih45, lsc0, seih7 and the errors have not yet resumed.