Remove duplicate labels on g-events, and improve query performance
Background:
Bulk g-event (1000-10000's of events) queries are still really slow, one non-negligible contributor is this .distinct()
method that makes sure that the results contain only unique events. In looking at the case of querying for events with a given label or set of labels, this is only necessary when a user queries for events that have an "OR" in the query. For example, querying for "event has label A or B" will return all the events that have either label A, and then events that have label B. Events that have both labels A and B will be double-counted in the results. If the query's logic could be restructured so that the .distinct()
is only performed in the subset of queries that need it, then it speeds it up for everyone else.
In performing some testing I noticed that there are a few g-events on production that had multiple copies of the same label (ex: https://gracedb.ligo.org/events/G459308/view/). That't not supposed to happen since there's a API-level check, but these events exposed a race condition that exists when two processes try to add a label at the same time, the Labelling
object can get created before the API check can catch it. For example, spiir
and gwcelery
applying the PASTRO_READY
label at the same time:
Side note: there are only 19 g-events in the production database with duplicate labels, and 100% percent of them are spiir
uploads.
All that being said, the steps to fix:
-
Add a tool to remove duplicate labels on g-events (and write a log message that it happened) -
Add a database constraint (via unique_together
) so that an event/superevent can have only one copy of any given label -
Modify write_label
routines (in events, superevents, event creation (maybe? probably not), andwrite_log
to catch the database constraint and return a meaningful message to the user. -
Restructure g-event queries to add the .distinct()
method only on queries that require it.