Only apply .distinct() operator on label queries that contain an 'OR'
Querying events and superevents in GraceDB is a three-step process:
- First query for event/superevent parameters (such as far, created time, snr, pipeline, any event attribute)
- filter for whatever labels might be associated with the event.
- Construct a unique (distinct) queryset based on the results of the last two filters.
After performing some test queries on gracedb-dev1 and gracedb-playground I found that the .distinct()
operator is only necessary in a subset of queries: when a user searches for events/superevents that have multiple labels, and there is an OR
conditional in the query. The short answer is because labels <--> events/superevents are a many-to-many relationship with the Labelling
object as a go-between for the label and event models. So querying for events with label A or label B will fetch the Labelling
objects for events with A or B. If an event has both labels A and B, the query would fetch two Labelling
objects and the event would be double-counted in the query.
As a test I queried events for two labels individually, then together with an &
and then with a |
. I then repeated the query without the .distinct()
operator and looked at the discrepancy:
Without applying the .distinct()
operator, events in the "or" search are double-counted for the reason above. For non-label queries, the many-to-many relationship does not exist. So when queries on event/superevent parameters are passed to the database, postgres returns only unique events that meet those parameters. So for example, the and/or tests were repeated for far and snr ranges, with and without the distinct()
:
Here the .distinct()
is unnecessary since the database will retrieve only the events whose columns meet the query criteria.
Okay so what does it matter? Asking for a unique set of objects is more work for the database, especially when there are large numbers of results in the query. Here's a scenario: a user queries for a large number of events. The event query API will serve events back in chunks (I think it's 100 events/chunk?), and part of the process is that the queryset gets (counted) each time a chunk is processed. Here is database load during a large query on gracedb-playground this week, and the two largest contributors each contain a SELECT DISTINCT
that traces back to the operator described above.
In real world terms, analyzing the SQL that counts the number of events for the non-distinct query:
gracedb_playground=> explain analyze SELECT COUNT(*) AS "__count" FROM "events_event" INNER JOIN "events_group" ON ("events_event"."group_id" = "events_group"."id") LEFT OUTER JOIN "events_search" ON ("events_event"."search_id" = "events_search"."id") INNER JOIN "events_labelling" ON ("events_event"."id" = "events_labelling"."event_id") INNER JOIN "events_label" ON ("events_labelling"."label_id" = "events_label"."id") WHERE (NOT ("events_group"."name" = 'Test') AND NOT ("events_search"."name" = 'MDC' AND "events_search"."name" IS NOT NULL) AND "events_label"."name" = 'PASTRO_READY');
---------
1153822
Planning Time: 0.825 ms
Execution Time: 4649.512 ms
It's over a million objects so it took a while to count. And for the distinct query:
gracedb_playground=> explain analyze SELECT COUNT(*) FROM (SELECT DISTINCT "events_event"."id" AS "col1", "events_event"."submitter_id" AS "col2", "events_event"."created" AS "col3", "events_event"."group_id" AS "col4", "events_event"."superevent_id" AS "col5", "events_event"."pipeline_preferred_id" AS "col6", "events_event"."pipeline_id" AS "col7", "events_event"."search_id" AS "col8", "events_event"."instruments" AS "col9", "events_event"."nevents" AS "col10", "events_event"."far" AS "col11", "events_event"."likelihood" AS "col12", "events_event"."gpstime" AS "col13", "events_event"."perms" AS "col14", "events_event"."offline" AS "col15", "events_event"."graceid" AS "col16", "events_event"."reporting_latency" AS "col17" FROM "events_event" INNER JOIN "events_group" ON ("events_event"."group_id" = "events_group"."id") LEFT OUTER JOIN "events_search" ON ("events_event"."search_id" = "events_search"."id") INNER JOIN "events_labelling" ON ("events_event"."id" = "events_labelling"."event_id") INNER JOIN "events_label" ON ("events_labelling"."label_id" = "events_label"."id") WHERE (NOT ("events_group"."name" = 'Test') AND NOT ("events_search"."name" = 'MDC' AND "events_search"."name" IS NOT NULL) AND "events_label"."name" = 'PASTRO_READY')) subquery;
---------
1153822
Planning Time: 1.090 ms
Execution Time: 7556.512 ms
7.5 seconds vs 4.6 seconds. Bear in mind however that time difference is (for a large query), per chunk returned. So for hundreds of chunks that time difference could mean minutes or hours off a bulk query.
After trying a few permutations, I finally arrived on this solution that only applies the distinct()
operator to queries that i described above: when search for multiple labels, with an "OR". The patch passes the unit tests that specifically test label queries and ensure the results are as expected.
I'm going to run some test queries, with and without the patch to see that it produces the same results and in what time difference and then report back on this ticket before merging in.