Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
L
lvalert
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 14
    • Issues 14
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 4
    • Merge Requests 4
  • Requirements
    • Requirements
    • List
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Security & Compliance
    • Security & Compliance
    • Dependency List
    • License Compliance
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • lscsoft
  • lvalert
  • Issues
  • #3

Closed
Open
Opened Aug 01, 2018 by Reed Essick@reed.essick

investigate possible dropped messages

Anecdotal evidence from O2 suggested that packets were occasionally dropped, meaning lvalert messages were not received by all listeners. It is not clear where the fault lies here, as I'll try to explain below. We should determine whether dropped messages are actually an issue within the lvalert back-end. If that can be ruled out, then we know it was due to specific users' use patterns.

My recollection is that the majority of these issues occurred with ApprovalProcessor, which was built on top of lvalertMP. Because of the way lvalertMP processed alerts (fundamentally in series), time-consuming delegations to parse_alert could cause messages to back-up in the Python multiprocessing.Pipe connection used to pass lvalert messages to separate subprocesses. If the Pipe had a limited size, this could have cause messages to be dropped. Alternatively, and what I believe to be more likely, it would cause delays in processing alerts. This could be interpreted as "dropping alerts" because of incomplete logging within ApprovalProcessor, etc.

As an example, ApprovalProcessor often queried GraceDb or attempted to annotate events as part of the call to parse_alert. When it hit time-out errors, it would block for several minutes at a time and effectively stop digesting new alerts even though they were delivered to the listener without issue.

My belief that the fundamental issue lay with how ApprovalProcessor blocked when processing alerts (exacerbated by time-out errors from GraceDb) is also motivated by the fact that "regular" lvalert_listen instances did not seem to run into this issue, or at least not as much. Those listeners simply forked a subprocess, often orphaning it, and then processed the next alert immediately. Therefore, if the subprocess blocked for a long time due to a time-out from GraceDb, this did not affect the listener's ability to process newer alerts.

I also do not remember EventSupervisor suffering from dropped messages. EventSupervisor was built on top of lvalertMP like ApprovalProcessor, but it's calls to parse_alert were quick and did not block for extended periods of time. This meant that it may have been able to process alerts without the same delays as ApprovalProcessor, even though other parts of EventSupervisor may have blocked for long periods of time.

/cc @alexander-pace @patrick-brady

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: lscsoft/lvalert#3