GraceDB Server issueshttps://git.ligo.org/computing/gracedb/server/-/issues2022-08-03T18:16:59Zhttps://git.ligo.org/computing/gracedb/server/-/issues/22Overhaul of search feature2022-08-03T18:16:59ZTanner PrestegardOverhaul of search featureStarted on April 15, 2017. Copied from redmine (https://bugs.ligo.org/redmine/issues/5432)
The search feature really needs to be redone. There are several requests for new features (#1337, #2175, #3543, #5052) and the code (gracedb/quer...Started on April 15, 2017. Copied from redmine (https://bugs.ligo.org/redmine/issues/5432)
The search feature really needs to be redone. There are several requests for new features (#1337, #2175, #3543, #5052) and the code (gracedb/query.py) is really clunky. There is also a serious lack of consistency regarding when logical operators, quotes, keywords, etc. can/should be used.
Ideas from Patrick:
define a "language" for the search and STICK TO IT. Can get ideas from Google, other search syntaxes.
get feedback from users on any commonly used searches (primarily by automated systems) in order to make sure they don't break with the update (may have to break them, we'll see)
could be similar to natural language processing
expand search capabilities beyond what we have now, including the ability to search by mass, other parameters
improve overall architecture
think about design, understand uses, make a ~1 page write-up describing your planBackloghttps://git.ligo.org/computing/gracedb/server/-/issues/21Introduce type-ahead- or tab-completion-like features to the GraceDB search2022-08-03T18:14:45ZTanner PrestegardIntroduce type-ahead- or tab-completion-like features to the GraceDB searchStarted on May 9, 2014 by Branson. Copied from redmine (https://bugs.ligo.org/redmine/issues/1337)
From a conversation with Fan, Erik, and Patrick on May 8, 2014.
Fan suggested a type-ahead feature, as in Google search. You start typin...Started on May 9, 2014 by Branson. Copied from redmine (https://bugs.ligo.org/redmine/issues/1337)
From a conversation with Fan, Erik, and Patrick on May 8, 2014.
Fan suggested a type-ahead feature, as in Google search. You start typing, and the event list is narrowed down as you go, before your very eyes.
I pointed out that this might be difficult, as we can't load huge numbers of events into a datastore in order to facilitate this.
Patrick suggested that even keyword completion feature would be really useful. If you start typing 'Te...', GraceDB could fill in 'Test' by looking at her lexicon of keywords.Backloghttps://git.ligo.org/computing/gracedb/server/-/issues/17Unit tests2022-08-03T18:05:38ZTanner PrestegardUnit testsThe unit tests are really lacking and are absolutely needed. Especially for authentication and permissions.The unit tests are really lacking and are absolutely needed. Especially for authentication and permissions.Backloghttps://git.ligo.org/computing/gracedb/server/-/issues/16Refurbish events API2022-08-03T18:04:27ZTanner PrestegardRefurbish events APIThe events API needs to be redone for a few reasons:
1. Incomplete validation and error handling
2. Difficult to implement permissions - redoing this would make #15 much easier
3. Many redundancies and inefficiencies
4. Doesn't make...The events API needs to be redone for a few reasons:
1. Incomplete validation and error handling
2. Difficult to implement permissions - redoing this would make #15 much easier
3. Many redundancies and inefficiencies
4. Doesn't make use of the builtin features in django-rest-framework
One possible difficulty is that some changes might require corresponding client changes, so we might run into yet another case where we have another server-client incompatibility.Backloghttps://git.ligo.org/computing/gracedb/server/-/issues/344Unsafe search response DB operations2024-03-27T23:36:06ZDaniel WysockiUnsafe search response DB operationsSentry reported an `IndexError` in `search.response.event_datatables_response` [here](https://ligo-caltech.sentry.io/issues/5107058502/?alert_rule_id=710526&alert_timestamp=1711566827464&alert_type=email&environment=production&notificati...Sentry reported an `IndexError` in `search.response.event_datatables_response` [here](https://ligo-caltech.sentry.io/issues/5107058502/?alert_rule_id=710526&alert_timestamp=1711566827464&alert_type=email&environment=production¬ification_uuid=abebba78-dd00-4d76-b6e1-5f05b8265faa&project=1456379&referrer=alert_email).
Looking into it, I've realized that [this call to `count()`](https://git.ligo.org/computing/gracedb/server/-/blob/77f15d0b34598612f347216aa0e323296b400fe3/gracedb/search/response.py#L348) performs a SQL [`SELECT COUNT(*)`](https://docs.djangoproject.com/en/4.2/ref/models/querysets/#django.db.models.query.QuerySet.count), which can then be outdated by the time we [iterate over a second query](https://git.ligo.org/computing/gracedb/server/-/blob/77f15d0b34598612f347216aa0e323296b400fe3/gracedb/search/response.py#L354). This all seems to be part of optimizations made in !163. I would consider reverting that MR, or if avoiding `list.append` is actually having a measurable performance benefit, using something like a pre-allocated 2D buffer array.Alexander PaceDaniel WysockiAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/342Smooth deployment on Kubernetes2024-03-15T14:57:45ZSara ValleroSmooth deployment on KubernetesThis is to upstream all the patches currently implemented for the deployment of gracedb-test01.igwn.org or a sandboxed deployment on Minikube.
- [ ] unauthenticated access to hopskotch (https://git.ligo.org/computing/gracedb/server/-/me...This is to upstream all the patches currently implemented for the deployment of gracedb-test01.igwn.org or a sandboxed deployment on Minikube.
- [ ] unauthenticated access to hopskotch (https://git.ligo.org/computing/gracedb/server/-/merge_requests/205)
- [ ] generic site name (https://git.ligo.org/computing/gracedb/server/-/merge_requests/206)
- [ ] username/password authSara ValleroSara Vallerohttps://git.ligo.org/computing/gracedb/server/-/issues/334Expanded API calls for analytics2023-11-13T17:55:16ZAlexander PaceExpanded API calls for analyticsFrom an email chain with @andrew.toivonen, @michael-coughlin, @sushant.sharma-chaudhary:
```
Alex,
Following up on your email, we had a discussion as a group about what GraceDB API changes could be useful.
For some context, these are...From an email chain with @andrew.toivonen, @michael-coughlin, @sushant.sharma-chaudhary:
```
Alex,
Following up on your email, we had a discussion as a group about what GraceDB API changes could be useful.
For some context, these are the scripts (and what they fetch) that we have used in the past to fetch from GraceDB/GraceDB Playground:
Playground:
All MDC events (from a range of gpstimes): https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/events_from_gracedb.py
MDC Skymaps: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_skymaps.py
MDC Posterior Samples (from a range of gpstimes): https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_all_PE.py
GraceDB
All data products from a superevent: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_superevent.py
Posterior Samples from a single event: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_PE.py
GCN latencies: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_O4_gcn.py
First off, if you feel any of these scripts are poorly optimized feel free to let us know. This brings me to my next thought, we know that bulk fetching from Playground for the MDC is very resource intensive and has caused issues in the past. I however think there will always be a need for bulk fetching when it comes to the MDC, simply due to the nature of the study and the numerous triggers. Part of the strain was also caused due to the fact that we did not fetch in an optimized manner (and maybe our method could be optimized event further), so one possible addition to the API would be adding a call to fetch a table of all event quantities as we did, yet done how you would optimize such a query. The same could be said for event data products, such as PE and skymaps. We were maybe wondering if there was a way to add a call that would simply download a file, without having to save it or a list of files as an object?
As for fetching from GraceDB, I think in general our studies will be focused on specific or a small subset of events. What could be most useful would be a call to download the latest skymap or latest posterior samples for a given event. Finally, I know latency was added to the GraceDB page, how is that latency defined? And is there an easy way to fetch that value? Fetching all the latencies for a range of gpstimes or just the entire observing run would be useful as well. Maybe it would also be good to include the ability to fetch all superevents, or just significant ones.
These were our initial thoughts without a great idea of which of these are most easily implemented and would make a difference.
Let us know what you think,
Andrew
```https://git.ligo.org/computing/gracedb/server/-/issues/333search feature not working2023-11-13T06:25:19ZKipp Cannonsearch feature not working## Description of problem
Search feature is not working
## Expected behavior
Go to https://gracedb.ligo.org/search/ and enter "S231020", select "Superevent", click "Search". Only one entry appears, labelled "S231020a". But there we...## Description of problem
Search feature is not working
## Expected behavior
Go to https://gracedb.ligo.org/search/ and enter "S231020", select "Superevent", click "Search". Only one entry appears, labelled "S231020a". But there were many events that day, including "S231020bw", which is the one I was trying to find. Entering "S231020bw" into the search term produces the desired entry. If the search is not implicitly a wild-card search, why does the "a" event appear? If it is a wild-card search, why doesn't the "bw" event appear?
## Steps to reproduce
See above.
## Context/environment
My web browser.
## Suggested solutions
Fix the search feature, or modify the Query Help page to explain how to properly do wild-card searches. Thanks.https://git.ligo.org/computing/gracedb/server/-/issues/332number of log annotations on S190412m causes browser requests to hit timeout2023-10-16T15:00:31ZAlexander Pacenumber of log annotations on S190412m causes browser requests to hit timeoutAttempting to load the internal page for [S190412m](https://gracedb.ligo.org/superevents/S190412m/view/) results in a timeout because the time to retrieve the number of log entries on that event exceeds the 30 second timeout in gunicorn....Attempting to load the internal page for [S190412m](https://gracedb.ligo.org/superevents/S190412m/view/) results in a timeout because the time to retrieve the number of log entries on that event exceeds the 30 second timeout in gunicorn. Note that I had previously implemented a check for a maximum number of log messages to display for g-events (in response to RAVEN repeatedly annotating external events for years on end), but this check never got ported over to superevents. @roberto.depietri brought this up on the [emfollow dev call](https://git.ligo.org/emfollow/gwcelery/-/wikis/telcons/2023-10-16) this morning.
Okay, so what is it about this superevent, and who's writing all those log messages? I went into the database console to see where all the annotations were coming from and I believe they were from the `detchar` user, who annotated the superevent 877 times:
```
In: m = Superevent.get_by_date_id('S190412m')
In: m.log_set.exclude(comment__contains='Tagged message').filter(issuer=detchar).count()
Out: 877
```
This user accessed GraceDB with one of the following certificate subjects back in 2019.
And it looks like there was a server error of some sort server error (not related to GraceDB as far as I can tell) that prevented the upload of some data products from being uploaded because the `Detchar` log messages are mostly ones like these:
```
2019-04-12 05:33:01.508952+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:33:00.555024+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:59.357731+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:58.161597+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:57.160221+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:56.204876+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:55.161672+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:54.276859+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:06.365500+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:04.341545+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:03.398589+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
```
I've attached the timestamp and comment of each one of the detchar log messages to this issue. [S190412m-detchar-errors.txt](/uploads/3cf5e528934a4af0d2d5af6498531875/S190412m-detchar-errors.txt)
I'll go ahead and implement the maximum log messages error for superevents. @roberto.depietri, if there's anything else you need to help interrogate this 4-year old superevent, please let me know.Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/329GraceDB uploads from the lensing pipelines2024-01-22T15:22:29ZIan HarryGraceDB uploads from the lensing pipelinesI was asked to move this discussion here from an email thread. I'll copy/paste the emails into here one by one, starting with the top-level description/problem statement:
We've been discussing on the PyCBC end about the deployment of th...I was asked to move this discussion here from an email thread. I'll copy/paste the emails into here one by one, starting with the top-level description/problem statement:
We've been discussing on the PyCBC end about the deployment of the O4a
lensing search pipeline, and there was a question about GraceDB
interaction, which I wanted to bring to some experts. I hope I'm
reaching the right GraceDB experts here (alongside the GstLAL lensing
search leads, and search chairs in CC), but let me know if I'm missing
anyone.
Just as an overview/reminder. The lensing searches are run as a
followup to known CBC triggers. They perform a focused search on a
narrow range of parameters around the values obtained from the known
event. Motivation is that there might be a lensed event which appears
as two "images" on Earth, one with SNR > 8 and one with SNR < ~8. The
first "image" can be found by our standard all-sky searches, but the
second might only be extracted if we use information from the first
image.
Practically this means that we will have a set of search triggers for
*every* CBC candidate at *all* times in O4, from both GstLAL and
PyCBC.
These searches will be recovering *other* known BBH events, and given
a bulk of events around mchirp ~ 40, we will likely have some events
recovered in *multiple* lensing searches.
The question is how would we process this in GraceDB? Would we be
uploading all triggers (above some threshold?) to GraceDB? This has
the potential to make some superevents quite confusing on internal
views if there are numerous lensed triggers alongside the numerous
online and offline all sky triggers. What search
tags/columns/names/whatever would be used? Has GstLAL already got a
plan in place for doing this? Any other thoughts?
Thanks!
Ianhttps://git.ligo.org/computing/gracedb/server/-/issues/328DQR link in the RRT view should be linked to DQR 5-minutes tier URL2023-10-03T12:11:49ZKeita KawabeDQR link in the RRT view should be linked to DQR 5-minutes tier URLIn the RRT view for S-event, "Data quality report" is linked to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/ e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/.
Responders...In the RRT view for S-event, "Data quality report" is linked to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/ e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/.
Responders have to open the link, click "tasks by tier" and click "5 min" to open a different URL in the form of https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/5_min_tier_index.html, e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/5_min_tier_index.html.
Since the only tasks RRT shifters are interested in are under the 5-minutes tier anyway, the link should point to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/5_min_tier_index.html.https://git.ligo.org/computing/gracedb/server/-/issues/324Traefik returns 502 when Gunicorn restarts2023-10-04T23:11:51ZAlexander PaceTraefik returns 502 when Gunicorn restartsHere's the scenario: GraceDB will instantly (no 30 second timeout) return a 502 proxy error to the client, then the client code retries and everything works.
Further investigation will show that there's no errors in the gracedb (django...Here's the scenario: GraceDB will instantly (no 30 second timeout) return a 502 proxy error to the client, then the client code retries and everything works.
Further investigation will show that there's no errors in the gracedb (django/gunicorn) logs. but there will be one in the webgateway/traefik logs, ex:
```
# grep -n '" 502 ' *.log
gracedb-swarm-production-us-west-2c-docker-mgr-01.log:49112:Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_webgateway_webgateway.3.91ly1u1voskgcnyo78u2t0xq1: 131.215.113.150 - - [01/Aug/2023:13:45:34 +0000] "POST /api/events/ HTTP/1.1" 502 11 "-" "-" 613542 "gracedb@docker" "http://10.0.1.56:80" 1ms
```
At the same time there's a block in gracedb's logs like:
```
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Autorestarting worker after current request.
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Worker exiting (pid: 3589)
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Booting worker with pid: 11186
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Worker spawned (pid: 11186)
```
Automatic restarting is controlled [here](https://git.ligo.org/computing/gracedb/server/-/blob/81847bbf401c99dabd36d39d66aab5f95deae6d3/config/gunicorn_config.py#L74-86) and [here](https://git.ligo.org/computing/gracedb/deployment/-/blob/0cd096d8230e9a01dadeeed66609d8939dc1129c/swarm-stacks/gracedb-prod-stack.yml#L100-101), and it used to avoid possible memory leaks. As far as I can tell, this restart/502 hasn't actually affected low latency operations, as gwcelery has retried and succeeded each time. `pycbclive` did ping about a 502 twice (2023-07-24 15:21:20 UTC on playground and 2023-07-29 10:48:19 on prod), but as far as I can tell the request was subsequently retried by the client code and succeeded.
Possible solutions could be...? to do nothing, since clients are retrying and succeeding. We could also try increasing the maximum number of requests, and the jitter to see we can space them out and make it less frequent.https://git.ligo.org/computing/gracedb/server/-/issues/323Consider increasing the configuration parameter "max_wal_size".2023-07-28T19:19:26ZAlexander PaceConsider increasing the configuration parameter "max_wal_size".There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. Dur...There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. During the period in question there were the following lines in `gracedb-playground`'s RDS logs:
```
2023-07-28 18:35:50 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:12 UTC::@:[393]:LOG: checkpoint complete: wrote 39902 buffers (16.5%); 0 WAL file(s) added, 0 removed, 16 recycled; write=20.183 s, sync=1.326 s, total=21.691 s; sync files=211, longest=1.323 s, average=0.007 s; distance=1048579 kB, estimate=1048579 kB
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoints are occurring too frequently (23 seconds apart)
2023-07-28 18:36:13 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:39 UTC::@:[393]:LOG: checkpoint complete: wrote 231 buffers (0.1%); 0 WAL file(s) added, 0 removed, 13 recycled; write=25.661 s, sync=0.420 s, total=26.123 s; sync files=112, longest=0.399 s, average=0.004 s; distance=1048586 kB, estimate=1048586 kB
2023-07-28 18:36:49 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:14 UTC::@:[393]:LOG: checkpoint complete: wrote 2019 buffers (0.8%); 0 WAL file(s) added, 2 removed, 17 recycled; write=24.321 s, sync=0.191 s, total=25.505 s; sync files=138, longest=0.190 s, average=0.002 s; distance=1049475 kB, estimate=1049475 kB
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoints are occurring too frequently (28 seconds apart)
2023-07-28 18:37:17 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:24 UTC::@:[393]:LOG: checkpoint complete: wrote 69 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.996 s, sync=0.342 s, total=7.539 s; sync files=34, longest=0.342 s, average=0.011 s; distance=1065103 kB, estimate=1065103 kB
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoints are occurring too frequently (13 seconds apart)
2023-07-28 18:37:30 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:33 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 9 recycled; write=0.480 s, sync=0.190 s, total=2.933 s; sync files=4, longest=0.190 s, average=0.048 s; distance=1056458 kB, estimate=1064239 kB
2023-07-28 18:38:33 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:38:49 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 19 recycled; write=15.533 s, sync=0.120 s, total=16.420 s; sync files=89, longest=0.120 s, average=0.002 s; distance=1034294 kB, estimate=1061244 kB
2023-07-28 18:39:19 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:39:36 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 14 recycled; write=17.051 s, sync=0.006 s, total=17.104 s; sync files=94, longest=0.006 s, average=0.001 s; distance=1063328 kB, estimate=1063328 kB
2023-07-28 18:40:59 UTC::@:[393]:LOG: checkpoint complete: wrote 517 buffers (0.2%); 0 WAL file(s) added, 11 removed, 17 recycled; write=28.949 s, sync=0.112 s, total=29.842 s; sync files=181, longest=0.111 s, average=0.001 s; distance=1040638 kB, estimate=1061059 kB
2023-07-28 18:41:00 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:11 UTC::@:[393]:LOG: checkpoint complete: wrote 118 buffers (0.0%); 0 WAL file(s) added, 0 removed, 14 recycled; write=10.732 s, sync=0.280 s, total=11.601 s; sync files=47, longest=0.280 s, average=0.006 s; distance=1084223 kB, estimate=1084223 kB
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoints are occurring too frequently (14 seconds apart)
2023-07-28 18:41:14 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:16 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 5 recycled; write=1.227 s, sync=0.054 s, total=2.786 s; sync files=2, longest=0.054 s, average=0.027 s; distance=1037553 kB, estimate=1079556 kB
2023-07-28 18:42:12 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:42:16 UTC::@:[393]:LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=3.448 s, sync=0.090 s, total=3.948 s; sync files=22, longest=0.090 s, average=0.005 s; distance=1012093 kB, estimate=1072810 kB
2023-07-28 18:43:39 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:43:41 UTC::@:[393]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 removed, 16 recycled; write=1.116 s, sync=0.181 s, total=2.198 s; sync files=8, longest=0.181 s, average=0.023 s; distance=1103069 kB, estimate=1103069 kB
```
This also occurred during a period of high relational load in the database:
![Screen_Shot_2023-07-28_at_3.13.56_PM](/uploads/95a62730a64f5a8d0c75d39d8c809705/Screen_Shot_2023-07-28_at_3.13.56_PM.png)
I haven't seen these hints and warnings on production, even when the database gets `VACUUM`'ed, so hopefully chalk it up to another example of playground's growing pains. Either way, consider some of the recommendations that the internet has to offer:
* https://www.crunchydata.com/blog/tuning-your-postgres-database-for-high-write-loads
* https://www.enterprisedb.com/blog/tuning-maxwalsize-postgresql
* https://stackoverflow.com/questions/75134262/why-do-i-have-the-message-max-wal-size-suddenly-appearing-in-my-postgres-logs
And once those parameters are tuned and validations in the `gracedb-postgresql-dev` parameter group, apply it to production.https://git.ligo.org/computing/gracedb/server/-/issues/322apache returns 502 (bad gateway) instead of 403 unauthorized2023-07-19T14:49:59ZAlexander Paceapache returns 502 (bad gateway) instead of 403 unauthorizedHere's the scenario: a client attempts to upload an event (`POST /api/events`) without a proper cert or valid auth. Gunicorn properly returns a `403 Unauthorized` error, but when it gets sent back to apache and then to the client, it get...Here's the scenario: a client attempts to upload an event (`POST /api/events`) without a proper cert or valid auth. Gunicorn properly returns a `403 Unauthorized` error, but when it gets sent back to apache and then to the client, it gets turned into a `502 Bad Gateway` error. For instance, this happened on CIT early this morning. Here's an example of the gracedb log line with the 502 and the two lines before it.
```
Jul 18 09:40:12 : DJANGO | 2023-07-18 09:40:12.610 | 9e6074462859 | 10.0.1.42 | performance | INFO | middleware.py, line 58 | create: 403:
Jul 18 09:40:12 : GUNICORN | 131.215.113.168 - - [18/Jul/2023:09:40:12 +0000] "POST /api/events/ HTTP/1.1" 403 58 "-" "gracedb-client/2.10.0"
Jul 18 09:40:12 : APACHE | 10.0.1.35 - - [18/Jul/2023:09:40:12 +0000] "POST /api/events/ HTTP/1.1" 502 315 "-" "gracedb-client/2.10.0"
```
The `DJANGO` performance middleware recognizes it as a `403`, `GUNICORN` says it's a `403`, `APACHE` says `502`.
What's going to happen in this scenario is, a user will see `502` in their error logs, when the issue isn't with GraceDB, per se, but rather it's returning a catch-all error instead of the proper `403`. Manual intervention by looking in the gracedb error logs is required to get the user the correct information.
I think i remember seeing this before, and the issue was a parameter in apache that controlled the maximum size of a unauthorized request... and since `POST` requests to create new events are ~O(1Mb), they exceed this value and so when gunicorn says the request is unauthorized, apache will return the too large bad gateway error.
This isn't a showstopper, but more of a fix for dev sanity. As in, the complaint will be "we got another 502 when creating an event, gracedb is broken", but the real issue was the user wasn't authorized.https://git.ligo.org/computing/gracedb/server/-/issues/313Include p-astro in super event and/or g-event tables2023-06-13T15:35:56ZRyan MageeInclude p-astro in super event and/or g-event tables## Description of feature request
Change the web interface to include p-astro values for the super-event and g-event tables. This would make it easier to find those values rather than clicking through to the json / scrolling for the ima...## Description of feature request
Change the web interface to include p-astro values for the super-event and g-event tables. This would make it easier to find those values rather than clicking through to the json / scrolling for the image. This is also already implemented in the upcoming per-pipeline table, so hopefully this is a relatively small change.
## Use cases
Facilitates quick checking of p-astro when events come in.
## Benefits
## Drawbacks
None
## Suggested solutionshttps://git.ligo.org/computing/gracedb/server/-/issues/312Add time format popup menu for t_start, t_0 and t_end in superevent page2023-06-09T11:02:01ZTito Dal CantonAdd time format popup menu for t_start, t_0 and t_end in superevent pageThe "Superevent Information" table on superevent pages lists t_start, t_0 and t_end as GPS times only, while the "Submitted" time has a nice popup menu where one can choose different time formats. It would be useful to have this menu for...The "Superevent Information" table on superevent pages lists t_start, t_0 and t_end as GPS times only, while the "Submitted" time has a nice popup menu where one can choose different time formats. It would be useful to have this menu for the other times as well (especially for t_0).https://git.ligo.org/computing/gracedb/server/-/issues/310Study: increase overall request throughput.2023-07-03T17:09:32ZAlexander PaceStudy: increase overall request throughput.### Problem
The last 16 months of improving the REST API performance via the MDC on `gracedb-playground` has been fruitful, but `gracedb-playground`'s use case differs from the production server in one critical way. `gracedb-playground`...### Problem
The last 16 months of improving the REST API performance via the MDC on `gracedb-playground` has been fruitful, but `gracedb-playground`'s use case differs from the production server in one critical way. `gracedb-playground` is constantly getting hammered by API requests at a rate that isn't seen in production. Also, it does not see nearly as many visitors on the web as the production server does. And production sees a mix of public and LVK users, where the public users to `gracedb-playground` are nearly non-existent.
The last issue rapidly became apparent when the [database was overloaded](https://git.ligo.org/computing/gracedb/server/-/issues/301), but it's been addressed and fixed. But the issue of people visiting the site and complaining about slow load times remains.
I'm going to approach this problem by throwing more gunicorn workers at the problem in an attempt to increase overall throughput. Why not just throw a bigger computer at the problem? Below is the last 24 hours of CPU usage on production GraceDB, including a spike when [S230601bf](https://gracedb.ligo.org/superevents/S230601bf/view/) dropped:
![Screen_Shot_2023-06-02_at_2.50.06_PM](/uploads/e44280be7397be8e594e1f7fef22707d/Screen_Shot_2023-06-02_at_2.50.06_PM.png)
Even in periods of peak activity we're not getting anywhere near to saturating the cpu, so that makes me think any perceived slowness is due to a limited number of workers that are IO-bound and not serving new requests.
### Method
I'm going to focus on the gunicorn [sync](https://docs.gunicorn.org/en/stable/design.html#sync-workers) worker for this. Despite the claim that "each connection is closed after response has been sent", I've shown (to myself..?) previously that while the connection to the worker is closed, the outside connection is maintained from user--> apache-> gunicorn _server_. The byproduct of this is that there's not a TLS renegotiation after each request, which is what really murders latency. Also I've noticed that the gthread worker can provide some more overall throughput, at lower memory use, but at the expense of memory leaks and "stuck" threads ([issues](https://github.com/benoitc/gunicorn/issues?q=is%3Aissue+is%3Aopen+gthread))
The systematic testing will take place on gracedb-dev1 (a dual-core, 4GB RAM VM), and then translated to playground (which has the same specs at production), and then moved to production. Note that playground/production are running are [c4.xlarge](https://instances.vantage.sh/aws/ec2/c4.xlarge) nodes (4 core, 8GB RAM), so I hope I can just... double the results from dev1 and it will scale. We'll see.
Testing will take place with the [`siege`](https://git.ligo.org/computing/sccb/-/issues/1234) command. Sample output is below:
```
$ siege https://gracedb-dev1.ligo.org/superevents/S230525c/view/ -c 1 -t 30S
...
...
Lifting the server siege...
Transactions: 222 hits
Availability: 100.00 %
Elapsed time: 29.72 secs
Data transferred: 6.72 MB
Response time: 0.13 secs
Transaction rate: 7.47 trans/sec
Throughput: 0.23 MB/sec
Concurrency: 0.99
Successful transactions: 222
Failed transactions: 0
Longest transaction: 3.87
Shortest transaction: 0.06
```
I think I should focus the Transaction rate (trans/sec), and ~~Concurrency (specified as 1 in this example, realized as 0.99)~~ 'Concurrency Efficiency' (Realized Concurrency / Specified Concurrency). What I think will happen if you were run a siege over and over and increase the specified concurrency each time, then the transaction rate would eventually plateau, and then the efficiency would continue to decrease as a limited number of workers.
I'm also going to look at three scenarios to see if I can find an optimal setup for each (or all):
1) Siege of easy requests (loading one public superevent page)
2) Siege of DB-heavy requests (loading all public superevents, from the API)
3) Siege of easy requests WHILE under load (loading a superevent page, while i'm bombarding the server with GETs to both use connections and use the CPU).
Other notes:
- Gunicorn recommends 2*N_{cpu} + 1 workers, which is what we've been using. I think baselining the number of workers vs the number of CPUs is smart. So try [0, 1, 2, 4, 8,...]*N_{cpu}+1 and see how it scales.
- It's also important to establish a baseline with the current setup to see if there's a performance increase or what.
I'll update this ticket with results as i get themhttps://git.ligo.org/computing/gracedb/server/-/issues/306Adding robot SciToken support2023-09-06T05:38:29ZDuncan MeacherAdding robot SciToken supportI will now be working on adding support for robot SciTokens within GraceDB. My understanding of how to do this is to modify the [update_user_accounts_from_ligo_ldap.py](https://git.ligo.org/computing/gracedb/server/-/blob/master/gracedb/...I will now be working on adding support for robot SciTokens within GraceDB. My understanding of how to do this is to modify the [update_user_accounts_from_ligo_ldap.py](https://git.ligo.org/computing/gracedb/server/-/blob/master/gracedb/ligoauth/management/commands/update_user_accounts_from_ligo_ldap.py) management tool to scan the ldap for new robot scitoken accounts, create/modify accounts as needed, and then apply the per-pipeline permissions to those accounts as needed.
@satyanarayan.raypitambarmohapatra, @warren-anderson, what is the status of robot accounts within the ldap? Its been a while since I've looked at them, but my understanding is that they each have an eppn that will link a robot scitoken to an ldap account?
Including @duncanmmacleod in this issue.Duncan MeacherDuncan Meacherhttps://git.ligo.org/computing/gracedb/server/-/issues/302investigation of unauthorized (public) queries (get_objects_for_user)2024-03-26T20:48:08ZAlexander Paceinvestigation of unauthorized (public) queries (get_objects_for_user)It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a p...It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a permission (like "view log"), and a list of objects, and it returns a subset of those objects that a user can actually see. Please see this ticket: https://git.ligo.org/computing/gracedb/server/-/issues/289
I'm going to document the process for making this call faster. I think it's going to be two steps:
1) Mitigation- reducing the number of objects that this function has to filter. Also see the above ticket.
2) Optimization- we very well might be calling this function sub-optimally. So after the first step, see what we might be doing wrong.Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/301What is the plan to anticipate and mitigate future problems like those seen w...2023-06-07T14:18:08ZPeter CouvaresWhat is the plan to anticipate and mitigate future problems like those seen with S230518h (DB being overwhelmed by internal LVK or external public page loads)?First, thanks @alexander.pace for the quick id and fix of this problem. (If there is a postmortem, or relevant tickets, or even LIGO Chat URL of the debugging to link to, please add them to this ticket here for context.)
**What is the ...First, thanks @alexander.pace for the quick id and fix of this problem. (If there is a postmortem, or relevant tickets, or even LIGO Chat URL of the debugging to link to, please add them to this ticket here for context.)
**What is the plan to anticipate and mitigate future problems like those seen with S230518h (GraceDB being overwhelmed by internal LVK or external public page loads)?**
This plan should probably include:
1. Load specification & testing
- Define a spec / requirements for how many simultaneous public & private requests (and of what sort) a production GracedB instance should be able to respond to with a certain latency.
- As part of the CI process, perform synthetic load testing w/simulated public & private users up to the spec and ensure it passes.
- Outside of CI, perform synthetic load testing w/simulated public & private users _beyond_ the spec to understand where it fails and why.
- Do some cost/benefit analysis of preemptive fixes to the bottlenecks identified in the beyond-spec load testing, so we can decide whether to fix them now or wait until it's known to be necessary.
2. Document an emergency procedure for an overwhelmed DB that someone other than Alex can execute.
- How to identify when the DB is overwhelmed (vs. other problems – AWS down, network down, auth down, etc.)
- How to temporarily turn off public access (and turn it back on again, and how/when to decide)
- Can/should we give privileged access to certain MM partners in such an emergency, so they can follow up?
- Anything else that can temporarily speed things up in a pinch.
I would expect this to be part of Phase II of the O4 LLAI review. (If there is a milestone or tag you'd like to use for such LLAI review tasks, please add it to this ticket so we can find them all easily in in the future.)Alexander PaceAlexander Pace