Skip to content

Draft: Remove duplicate labels on g-events, and improve query performance

Alexander Pace requested to merge fix-duplicate-labels into master

Background:

Bulk g-event (1000-10000's of events) queries are still really slow, one non-negligible contributor is this .distinct() method that makes sure that the results contain only unique events. In looking at the case of querying for events with a given label or set of labels, this is only necessary when a user queries for events that have an "OR" in the query. For example, querying for "event has label A or B" will return all the events that have either label A, and then events that have label B. Events that have both labels A and B will be double-counted in the results. If the query's logic could be restructured so that the .distinct() is only performed in the subset of queries that need it, then it speeds it up for everyone else.

In performing some testing I noticed that there are a few g-events on production that had multiple copies of the same label (ex: https://gracedb.ligo.org/events/G459308/view/). That't not supposed to happen since there's a API-level check, but these events exposed a race condition that exists when two processes try to add a label at the same time, the Labelling object can get created before the API check can catch it. For example, spiir and gwcelery applying the PASTRO_READY label at the same time:

Screen_Shot_2024-09-25_at_5.46.05_PM

Side note: there are only 19 g-events in the production database with duplicate labels, and 100% percent of them are spiir uploads. 🤷

All that being said, the steps to fix:

  • Add a tool to remove duplicate labels on g-events (and write a log message that it happened)
  • Add a database constraint (via unique_together) so that an event/superevent can have only one copy of any given label
  • Modify write_label routines (in events, superevents, event creation (maybe? probably not), and write_log to catch the database constraint and return a meaningful message to the user.
  • Restructure g-event queries to add the .distinct() method only on queries that require it.
Edited by Alexander Pace

Merge request reports