investigate possible dropped messages
Anecdotal evidence from O2 suggested that packets were occasionally dropped, meaning lvalert messages were not received by all listeners. It is not clear where the fault lies here, as I'll try to explain below. We should determine whether dropped messages are actually an issue within the lvalert back-end. If that can be ruled out, then we know it was due to specific users' use patterns.
My recollection is that the majority of these issues occurred with ApprovalProcessor, which was built on top of lvalertMP. Because of the way lvalertMP processed alerts (fundamentally in series), time-consuming delegations to
parse_alert could cause messages to back-up in the Python multiprocessing.Pipe connection used to pass lvalert messages to separate subprocesses. If the Pipe had a limited size, this could have cause messages to be dropped. Alternatively, and what I believe to be more likely, it would cause delays in processing alerts. This could be interpreted as "dropping alerts" because of incomplete logging within ApprovalProcessor, etc.
As an example, ApprovalProcessor often queried GraceDb or attempted to annotate events as part of the call to
parse_alert. When it hit time-out errors, it would block for several minutes at a time and effectively stop digesting new alerts even though they were delivered to the listener without issue.
My belief that the fundamental issue lay with how ApprovalProcessor blocked when processing alerts (exacerbated by time-out errors from GraceDb) is also motivated by the fact that "regular" lvalert_listen instances did not seem to run into this issue, or at least not as much. Those listeners simply forked a subprocess, often orphaning it, and then processed the next alert immediately. Therefore, if the subprocess blocked for a long time due to a time-out from GraceDb, this did not affect the listener's ability to process newer alerts.
I also do not remember EventSupervisor suffering from dropped messages. EventSupervisor was built on top of lvalertMP like ApprovalProcessor, but it's calls to
parse_alert were quick and did not block for extended periods of time. This meant that it may have been able to process alerts without the same delays as ApprovalProcessor, even though other parts of EventSupervisor may have blocked for long periods of time.
/cc @alexander-pace @patrick-brady