GraceDB Server issueshttps://git.ligo.org/computing/gracedb/server/-/issues2023-10-03T12:11:49Zhttps://git.ligo.org/computing/gracedb/server/-/issues/328DQR link in the RRT view should be linked to DQR 5-minutes tier URL2023-10-03T12:11:49ZKeita KawabeDQR link in the RRT view should be linked to DQR 5-minutes tier URLIn the RRT view for S-event, "Data quality report" is linked to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/ e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/.
Responders...In the RRT view for S-event, "Data quality report" is linked to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/ e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/.
Responders have to open the link, click "tasks by tier" and click "5 min" to open a different URL in the form of https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/5_min_tier_index.html, e.g. https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/202309/S230927l/5_min_tier_index.html.
Since the only tasks RRT shifters are interested in are under the 5-minutes tier anyway, the link should point to https://ldas-jobs.ligo.caltech.edu/~dqr/o4dqr/online/events/YYYYMM/SYYMMDDabcd/5_min_tier_index.html.https://git.ligo.org/computing/gracedb/server/-/issues/307Add ability to disable alerts from (pipeline, searches) combinations2023-05-23T16:20:45ZTito Dal CantonAdd ability to disable alerts from (pipeline, searches) combinationsFollowup from the semi-regular RRT call of Tuesday May 23.
It seems we currently have the ability to disable alerts from individual pipelines, but not from (pipeline, search) combinations. I would like to request the latter ability as w...Followup from the semi-regular RRT call of Tuesday May 23.
It seems we currently have the ability to disable alerts from individual pipelines, but not from (pipeline, search) combinations. I would like to request the latter ability as well. The use case is that we could in principle have problems from e.g. PyCBC Live early-warning, but not from PyCBC Live full-bandwidth.O4https://git.ligo.org/computing/gracedb/server/-/issues/313Include p-astro in super event and/or g-event tables2023-06-13T15:35:56ZRyan MageeInclude p-astro in super event and/or g-event tables## Description of feature request
Change the web interface to include p-astro values for the super-event and g-event tables. This would make it easier to find those values rather than clicking through to the json / scrolling for the ima...## Description of feature request
Change the web interface to include p-astro values for the super-event and g-event tables. This would make it easier to find those values rather than clicking through to the json / scrolling for the image. This is also already implemented in the upcoming per-pipeline table, so hopefully this is a relatively small change.
## Use cases
Facilitates quick checking of p-astro when events come in.
## Benefits
## Drawbacks
None
## Suggested solutionshttps://git.ligo.org/computing/gracedb/server/-/issues/302investigation of unauthorized (public) queries (get_objects_for_user)2024-03-26T20:48:08ZAlexander Paceinvestigation of unauthorized (public) queries (get_objects_for_user)It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a p...It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a permission (like "view log"), and a list of objects, and it returns a subset of those objects that a user can actually see. Please see this ticket: https://git.ligo.org/computing/gracedb/server/-/issues/289
I'm going to document the process for making this call faster. I think it's going to be two steps:
1) Mitigation- reducing the number of objects that this function has to filter. Also see the above ticket.
2) Optimization- we very well might be calling this function sub-optimally. So after the first step, see what we might be doing wrong.Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/297Support other file formats (other than XML VOEvent) to ingest external events...2024-03-15T17:08:38ZBrandon PiotrzkowskiSupport other file formats (other than XML VOEvent) to ingest external events fromCurrently we can only ingest VOEvent XMl files to create external events via `gracedb.create_event`. There is already a need ingest events delivered via Kafka that have a `.json` format, which we are currently creating a workaround by co...Currently we can only ingest VOEvent XMl files to create external events via `gracedb.create_event`. There is already a need ingest events delivered via Kafka that have a `.json` format, which we are currently creating a workaround by converting to a VOEvent packet here (note this code has not been merged yet and subject to change):
https://git.ligo.org/emfollow/gwcelery/-/blob/dfdd84a97dec60257c4d7bd91d6c0c9442ec3de6/gwcelery/tasks/external_triggers.py#L561-626
Example alert:
https://git.ligo.org/emfollow/gwcelery/-/blob/a35b3ba998ab4726f90d5fb3cdf87d365cccbc65/gwcelery/tests/data/kafka_alert_fermi.json
In general we should make a more flexible system to ingest external events as GCN moves towards Kafka, potentially able to add new notice types/formats as needed (e.g. Kamland notices also have a different format, etc.)
I assume we need to make additional parser functions such as [`populateGrbEventFromVOEventFile`](https://git.ligo.org/computing/gracedb/server/-/blob/79c9b1ead0086fc4789a32c597b84c7abaee9513/gracedb/events/translator.py#L646) and add options to use the and determine which schema is being used [here](https://git.ligo.org/computing/gracedb/server/-/blob/79c9b1ead0086fc4789a32c597b84c7abaee9513/gracedb/events/translator.py#L352).
I can personally help with this development if needed, especially after the start of O4.https://git.ligo.org/computing/gracedb/server/-/issues/210Reducing queries by packing igwn-alert2022-03-18T01:03:58ZAlexander PaceReducing queries by packing igwn-alertAs discussed on low-latency call, December 8 2021. The purpose of this ticket is to solicit feedback and suggestions for what information to include in `igwn-alert` packets with the goal of reducing costly queries to GraceDB.
Relevant p...As discussed on low-latency call, December 8 2021. The purpose of this ticket is to solicit feedback and suggestions for what information to include in `igwn-alert` packets with the goal of reducing costly queries to GraceDB.
Relevant past commits:
* https://git.ligo.org/lscsoft/gracedb/-/commit/e52b0c2ea248efbbb221ed51c58d55a3e5c4a3de
* https://git.ligo.org/lscsoft/gracedb/-/commit/2402e914dd6afd28035f6d06086bd6519f8018a9
Current MR's:
* https://git.ligo.org/lscsoft/gracedb/-/merge_requests/52/O4 Infrastructure ImprovementsAlexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/161User's favourites or Followed Events list2019-07-09T22:07:10ZNicola De Lillonicola.delillo@ligo.orgUser's favourites or Followed Events list## Description of feature request
Insert a tabin between the already present tabs LATEST and ALERTS. This tab could be called "FOLLOWED EVENTS" (Not PREFERRED, it would confuse people with "preferred events" in case of a super-events).Th...## Description of feature request
Insert a tabin between the already present tabs LATEST and ALERTS. This tab could be called "FOLLOWED EVENTS" (Not PREFERRED, it would confuse people with "preferred events" in case of a super-events).The tab would link to the page "FOLLOWED EVENTS" which would appear really as it looks now the LATEST page. The difference is that the entry showed in "FOLLOWED EVENTS" page will show only the events or supervents flagged as "FOLLOWED" by the user.
Events should be flag-able either from the SEARCH page or from the LATEST page.
## Use cases
1) internal use for LIGO: It is useful for experts ROTA or advocates or PE-Rotaers that can easily keep track of the events they got assigned.
2) In general any scientist interested in a particular event can follow it. i.e.: if I am interested in tracking the analysis for only Binary Neutron stars system, I would flag as FOLLOW only that events.
## Benefits
I think it helps really tracking the events a scientist want to follow.
## Drawbacks
Not really any drawbacks at the moment as far as I can see. Apart that one has to redesign the toolbar adding the "FOLLOWED EVENTS" tab.
## Suggested solutions
Here attached a figure that shows how to display (the example is for LATEST page, BUt consider please doing that also for the SEARCH page) the boxes to flag in the FOLLOW column. All the flagged events (either a flag or dot or a filling color) will go in the FOLLOWED EVENT page. Of course rememer to add a 'X' symbol in the FOLLOWED EVENT page if one wants to remove that event.
![LATEST_examp](/uploads/4b8a8e328fff42644b539cc6ee63d061/LATEST_examp.png)https://git.ligo.org/computing/gracedb/server/-/issues/148Visually distinguish private vs public information2022-08-04T01:43:26ZStuart AndersonVisually distinguish private vs public informationConsider adding an option to visually distinguish private vs public information for privileged users while they are logged in, e.g., different background color or a water mark overlay. Note, if that is too visually distracting there coul...Consider adding an option to visually distinguish private vs public information for privileged users while they are logged in, e.g., different background color or a water mark overlay. Note, if that is too visually distracting there could be a toggle button, e.g., "highlight public" or "highlight private", to enable an inline comparison of public vs private information.Backloghttps://git.ligo.org/computing/gracedb/server/-/issues/344Unsafe search response DB operations2024-03-27T23:36:06ZDaniel WysockiUnsafe search response DB operationsSentry reported an `IndexError` in `search.response.event_datatables_response` [here](https://ligo-caltech.sentry.io/issues/5107058502/?alert_rule_id=710526&alert_timestamp=1711566827464&alert_type=email&environment=production&notificati...Sentry reported an `IndexError` in `search.response.event_datatables_response` [here](https://ligo-caltech.sentry.io/issues/5107058502/?alert_rule_id=710526&alert_timestamp=1711566827464&alert_type=email&environment=production¬ification_uuid=abebba78-dd00-4d76-b6e1-5f05b8265faa&project=1456379&referrer=alert_email).
Looking into it, I've realized that [this call to `count()`](https://git.ligo.org/computing/gracedb/server/-/blob/77f15d0b34598612f347216aa0e323296b400fe3/gracedb/search/response.py#L348) performs a SQL [`SELECT COUNT(*)`](https://docs.djangoproject.com/en/4.2/ref/models/querysets/#django.db.models.query.QuerySet.count), which can then be outdated by the time we [iterate over a second query](https://git.ligo.org/computing/gracedb/server/-/blob/77f15d0b34598612f347216aa0e323296b400fe3/gracedb/search/response.py#L354). This all seems to be part of optimizations made in !163. I would consider reverting that MR, or if avoiding `list.append` is actually having a measurable performance benefit, using something like a pre-allocated 2D buffer array.Alexander PaceDaniel WysockiAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/342Smooth deployment on Kubernetes2024-03-15T14:57:45ZSara ValleroSmooth deployment on KubernetesThis is to upstream all the patches currently implemented for the deployment of gracedb-test01.igwn.org or a sandboxed deployment on Minikube.
- [ ] unauthenticated access to hopskotch (https://git.ligo.org/computing/gracedb/server/-/me...This is to upstream all the patches currently implemented for the deployment of gracedb-test01.igwn.org or a sandboxed deployment on Minikube.
- [ ] unauthenticated access to hopskotch (https://git.ligo.org/computing/gracedb/server/-/merge_requests/205)
- [ ] generic site name (https://git.ligo.org/computing/gracedb/server/-/merge_requests/206)
- [ ] username/password authSara ValleroSara Vallerohttps://git.ligo.org/computing/gracedb/server/-/issues/337Virgo O4b workflow2024-01-26T14:28:44ZMichael William CoughlinVirgo O4b workflowThere are ongoing discussions about how to use Virgo in LL for O4b.
See DAC issue here: https://git.ligo.org/dac/preparations-for-using-virgo-data-in-o4b-low-latency-analyses/-/issues/1
See gwcelery issue here: https://git.ligo.org/em...There are ongoing discussions about how to use Virgo in LL for O4b.
See DAC issue here: https://git.ligo.org/dac/preparations-for-using-virgo-data-in-o4b-low-latency-analyses/-/issues/1
See gwcelery issue here: https://git.ligo.org/emfollow/gwcelery/-/issues/749
It is possible that some changes to GraceDB will be needed to enable this.O4bhttps://git.ligo.org/computing/gracedb/server/-/issues/334Expanded API calls for analytics2023-11-13T17:55:16ZAlexander PaceExpanded API calls for analyticsFrom an email chain with @andrew.toivonen, @michael-coughlin, @sushant.sharma-chaudhary:
```
Alex,
Following up on your email, we had a discussion as a group about what GraceDB API changes could be useful.
For some context, these are...From an email chain with @andrew.toivonen, @michael-coughlin, @sushant.sharma-chaudhary:
```
Alex,
Following up on your email, we had a discussion as a group about what GraceDB API changes could be useful.
For some context, these are the scripts (and what they fetch) that we have used in the past to fetch from GraceDB/GraceDB Playground:
Playground:
All MDC events (from a range of gpstimes): https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/events_from_gracedb.py
MDC Skymaps: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_skymaps.py
MDC Posterior Samples (from a range of gpstimes): https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_all_PE.py
GraceDB
All data products from a superevent: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_superevent.py
Posterior Samples from a single event: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_PE.py
GCN latencies: https://git.ligo.org/emfollow/em-properties/mdc-analytics/-/blob/main/fetch_data/fetch_O4_gcn.py
First off, if you feel any of these scripts are poorly optimized feel free to let us know. This brings me to my next thought, we know that bulk fetching from Playground for the MDC is very resource intensive and has caused issues in the past. I however think there will always be a need for bulk fetching when it comes to the MDC, simply due to the nature of the study and the numerous triggers. Part of the strain was also caused due to the fact that we did not fetch in an optimized manner (and maybe our method could be optimized event further), so one possible addition to the API would be adding a call to fetch a table of all event quantities as we did, yet done how you would optimize such a query. The same could be said for event data products, such as PE and skymaps. We were maybe wondering if there was a way to add a call that would simply download a file, without having to save it or a list of files as an object?
As for fetching from GraceDB, I think in general our studies will be focused on specific or a small subset of events. What could be most useful would be a call to download the latest skymap or latest posterior samples for a given event. Finally, I know latency was added to the GraceDB page, how is that latency defined? And is there an easy way to fetch that value? Fetching all the latencies for a range of gpstimes or just the entire observing run would be useful as well. Maybe it would also be good to include the ability to fetch all superevents, or just significant ones.
These were our initial thoughts without a great idea of which of these are most easily implemented and would make a difference.
Let us know what you think,
Andrew
```https://git.ligo.org/computing/gracedb/server/-/issues/333search feature not working2023-11-13T06:25:19ZKipp Cannonsearch feature not working## Description of problem
Search feature is not working
## Expected behavior
Go to https://gracedb.ligo.org/search/ and enter "S231020", select "Superevent", click "Search". Only one entry appears, labelled "S231020a". But there we...## Description of problem
Search feature is not working
## Expected behavior
Go to https://gracedb.ligo.org/search/ and enter "S231020", select "Superevent", click "Search". Only one entry appears, labelled "S231020a". But there were many events that day, including "S231020bw", which is the one I was trying to find. Entering "S231020bw" into the search term produces the desired entry. If the search is not implicitly a wild-card search, why does the "a" event appear? If it is a wild-card search, why doesn't the "bw" event appear?
## Steps to reproduce
See above.
## Context/environment
My web browser.
## Suggested solutions
Fix the search feature, or modify the Query Help page to explain how to properly do wild-card searches. Thanks.https://git.ligo.org/computing/gracedb/server/-/issues/332number of log annotations on S190412m causes browser requests to hit timeout2023-10-16T15:00:31ZAlexander Pacenumber of log annotations on S190412m causes browser requests to hit timeoutAttempting to load the internal page for [S190412m](https://gracedb.ligo.org/superevents/S190412m/view/) results in a timeout because the time to retrieve the number of log entries on that event exceeds the 30 second timeout in gunicorn....Attempting to load the internal page for [S190412m](https://gracedb.ligo.org/superevents/S190412m/view/) results in a timeout because the time to retrieve the number of log entries on that event exceeds the 30 second timeout in gunicorn. Note that I had previously implemented a check for a maximum number of log messages to display for g-events (in response to RAVEN repeatedly annotating external events for years on end), but this check never got ported over to superevents. @roberto.depietri brought this up on the [emfollow dev call](https://git.ligo.org/emfollow/gwcelery/-/wikis/telcons/2023-10-16) this morning.
Okay, so what is it about this superevent, and who's writing all those log messages? I went into the database console to see where all the annotations were coming from and I believe they were from the `detchar` user, who annotated the superevent 877 times:
```
In: m = Superevent.get_by_date_id('S190412m')
In: m.log_set.exclude(comment__contains='Tagged message').filter(issuer=detchar).count()
Out: 877
```
This user accessed GraceDB with one of the following certificate subjects back in 2019.
And it looks like there was a server error of some sort server error (not related to GraceDB as far as I can tell) that prevented the upload of some data products from being uploaded because the `Detchar` log messages are mostly ones like these:
```
2019-04-12 05:33:01.508952+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:33:00.555024+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:59.357731+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:58.161597+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:57.160221+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:56.204876+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:55.161672+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:54.276859+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:06.365500+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:04.341545+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
2019-04-12 05:32:03.398589+00:00Attempted upload of 'L1ligocam-S190412m.json' failed due to server issues [message edited by administrator]
```
I've attached the timestamp and comment of each one of the detchar log messages to this issue. [S190412m-detchar-errors.txt](/uploads/3cf5e528934a4af0d2d5af6498531875/S190412m-detchar-errors.txt)
I'll go ahead and implement the maximum log messages error for superevents. @roberto.depietri, if there's anything else you need to help interrogate this 4-year old superevent, please let me know.Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/329GraceDB uploads from the lensing pipelines2024-01-22T15:22:29ZIan HarryGraceDB uploads from the lensing pipelinesI was asked to move this discussion here from an email thread. I'll copy/paste the emails into here one by one, starting with the top-level description/problem statement:
We've been discussing on the PyCBC end about the deployment of th...I was asked to move this discussion here from an email thread. I'll copy/paste the emails into here one by one, starting with the top-level description/problem statement:
We've been discussing on the PyCBC end about the deployment of the O4a
lensing search pipeline, and there was a question about GraceDB
interaction, which I wanted to bring to some experts. I hope I'm
reaching the right GraceDB experts here (alongside the GstLAL lensing
search leads, and search chairs in CC), but let me know if I'm missing
anyone.
Just as an overview/reminder. The lensing searches are run as a
followup to known CBC triggers. They perform a focused search on a
narrow range of parameters around the values obtained from the known
event. Motivation is that there might be a lensed event which appears
as two "images" on Earth, one with SNR > 8 and one with SNR < ~8. The
first "image" can be found by our standard all-sky searches, but the
second might only be extracted if we use information from the first
image.
Practically this means that we will have a set of search triggers for
*every* CBC candidate at *all* times in O4, from both GstLAL and
PyCBC.
These searches will be recovering *other* known BBH events, and given
a bulk of events around mchirp ~ 40, we will likely have some events
recovered in *multiple* lensing searches.
The question is how would we process this in GraceDB? Would we be
uploading all triggers (above some threshold?) to GraceDB? This has
the potential to make some superevents quite confusing on internal
views if there are numerous lensed triggers alongside the numerous
online and offline all sky triggers. What search
tags/columns/names/whatever would be used? Has GstLAL already got a
plan in place for doing this? Any other thoughts?
Thanks!
Ianhttps://git.ligo.org/computing/gracedb/server/-/issues/324Traefik returns 502 when Gunicorn restarts2023-10-04T23:11:51ZAlexander PaceTraefik returns 502 when Gunicorn restartsHere's the scenario: GraceDB will instantly (no 30 second timeout) return a 502 proxy error to the client, then the client code retries and everything works.
Further investigation will show that there's no errors in the gracedb (django...Here's the scenario: GraceDB will instantly (no 30 second timeout) return a 502 proxy error to the client, then the client code retries and everything works.
Further investigation will show that there's no errors in the gracedb (django/gunicorn) logs. but there will be one in the webgateway/traefik logs, ex:
```
# grep -n '" 502 ' *.log
gracedb-swarm-production-us-west-2c-docker-mgr-01.log:49112:Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_webgateway_webgateway.3.91ly1u1voskgcnyo78u2t0xq1: 131.215.113.150 - - [01/Aug/2023:13:45:34 +0000] "POST /api/events/ HTTP/1.1" 502 11 "-" "-" 613542 "gracedb@docker" "http://10.0.1.56:80" 1ms
```
At the same time there's a block in gracedb's logs like:
```
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Autorestarting worker after current request.
Aug 1 13:45:34 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:34 +0000] [3589] [INFO] Worker exiting (pid: 3589)
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Booting worker with pid: 11186
Aug 1 13:45:35 gracedb-swarm-production-us-west-2c-docker-mgr-01 gracedb_docker_gracedb_gracedb.1.we7ztc1fre4kje94k14lz3cy4: GUNICORN | [2023-08-01 13:45:35 +0000] [11186] [INFO] Worker spawned (pid: 11186)
```
Automatic restarting is controlled [here](https://git.ligo.org/computing/gracedb/server/-/blob/81847bbf401c99dabd36d39d66aab5f95deae6d3/config/gunicorn_config.py#L74-86) and [here](https://git.ligo.org/computing/gracedb/deployment/-/blob/0cd096d8230e9a01dadeeed66609d8939dc1129c/swarm-stacks/gracedb-prod-stack.yml#L100-101), and it used to avoid possible memory leaks. As far as I can tell, this restart/502 hasn't actually affected low latency operations, as gwcelery has retried and succeeded each time. `pycbclive` did ping about a 502 twice (2023-07-24 15:21:20 UTC on playground and 2023-07-29 10:48:19 on prod), but as far as I can tell the request was subsequently retried by the client code and succeeded.
Possible solutions could be...? to do nothing, since clients are retrying and succeeding. We could also try increasing the maximum number of requests, and the jitter to see we can space them out and make it less frequent.https://git.ligo.org/computing/gracedb/server/-/issues/323Consider increasing the configuration parameter "max_wal_size".2023-07-28T19:19:26ZAlexander PaceConsider increasing the configuration parameter "max_wal_size".There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. Dur...There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. During the period in question there were the following lines in `gracedb-playground`'s RDS logs:
```
2023-07-28 18:35:50 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:12 UTC::@:[393]:LOG: checkpoint complete: wrote 39902 buffers (16.5%); 0 WAL file(s) added, 0 removed, 16 recycled; write=20.183 s, sync=1.326 s, total=21.691 s; sync files=211, longest=1.323 s, average=0.007 s; distance=1048579 kB, estimate=1048579 kB
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoints are occurring too frequently (23 seconds apart)
2023-07-28 18:36:13 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:39 UTC::@:[393]:LOG: checkpoint complete: wrote 231 buffers (0.1%); 0 WAL file(s) added, 0 removed, 13 recycled; write=25.661 s, sync=0.420 s, total=26.123 s; sync files=112, longest=0.399 s, average=0.004 s; distance=1048586 kB, estimate=1048586 kB
2023-07-28 18:36:49 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:14 UTC::@:[393]:LOG: checkpoint complete: wrote 2019 buffers (0.8%); 0 WAL file(s) added, 2 removed, 17 recycled; write=24.321 s, sync=0.191 s, total=25.505 s; sync files=138, longest=0.190 s, average=0.002 s; distance=1049475 kB, estimate=1049475 kB
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoints are occurring too frequently (28 seconds apart)
2023-07-28 18:37:17 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:24 UTC::@:[393]:LOG: checkpoint complete: wrote 69 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.996 s, sync=0.342 s, total=7.539 s; sync files=34, longest=0.342 s, average=0.011 s; distance=1065103 kB, estimate=1065103 kB
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoints are occurring too frequently (13 seconds apart)
2023-07-28 18:37:30 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:33 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 9 recycled; write=0.480 s, sync=0.190 s, total=2.933 s; sync files=4, longest=0.190 s, average=0.048 s; distance=1056458 kB, estimate=1064239 kB
2023-07-28 18:38:33 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:38:49 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 19 recycled; write=15.533 s, sync=0.120 s, total=16.420 s; sync files=89, longest=0.120 s, average=0.002 s; distance=1034294 kB, estimate=1061244 kB
2023-07-28 18:39:19 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:39:36 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 14 recycled; write=17.051 s, sync=0.006 s, total=17.104 s; sync files=94, longest=0.006 s, average=0.001 s; distance=1063328 kB, estimate=1063328 kB
2023-07-28 18:40:59 UTC::@:[393]:LOG: checkpoint complete: wrote 517 buffers (0.2%); 0 WAL file(s) added, 11 removed, 17 recycled; write=28.949 s, sync=0.112 s, total=29.842 s; sync files=181, longest=0.111 s, average=0.001 s; distance=1040638 kB, estimate=1061059 kB
2023-07-28 18:41:00 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:11 UTC::@:[393]:LOG: checkpoint complete: wrote 118 buffers (0.0%); 0 WAL file(s) added, 0 removed, 14 recycled; write=10.732 s, sync=0.280 s, total=11.601 s; sync files=47, longest=0.280 s, average=0.006 s; distance=1084223 kB, estimate=1084223 kB
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoints are occurring too frequently (14 seconds apart)
2023-07-28 18:41:14 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:16 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 5 recycled; write=1.227 s, sync=0.054 s, total=2.786 s; sync files=2, longest=0.054 s, average=0.027 s; distance=1037553 kB, estimate=1079556 kB
2023-07-28 18:42:12 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:42:16 UTC::@:[393]:LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=3.448 s, sync=0.090 s, total=3.948 s; sync files=22, longest=0.090 s, average=0.005 s; distance=1012093 kB, estimate=1072810 kB
2023-07-28 18:43:39 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:43:41 UTC::@:[393]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 removed, 16 recycled; write=1.116 s, sync=0.181 s, total=2.198 s; sync files=8, longest=0.181 s, average=0.023 s; distance=1103069 kB, estimate=1103069 kB
```
This also occurred during a period of high relational load in the database:
![Screen_Shot_2023-07-28_at_3.13.56_PM](/uploads/95a62730a64f5a8d0c75d39d8c809705/Screen_Shot_2023-07-28_at_3.13.56_PM.png)
I haven't seen these hints and warnings on production, even when the database gets `VACUUM`'ed, so hopefully chalk it up to another example of playground's growing pains. Either way, consider some of the recommendations that the internet has to offer:
* https://www.crunchydata.com/blog/tuning-your-postgres-database-for-high-write-loads
* https://www.enterprisedb.com/blog/tuning-maxwalsize-postgresql
* https://stackoverflow.com/questions/75134262/why-do-i-have-the-message-max-wal-size-suddenly-appearing-in-my-postgres-logs
And once those parameters are tuned and validations in the `gracedb-postgresql-dev` parameter group, apply it to production.https://git.ligo.org/computing/gracedb/server/-/issues/322apache returns 502 (bad gateway) instead of 403 unauthorized2023-07-19T14:49:59ZAlexander Paceapache returns 502 (bad gateway) instead of 403 unauthorizedHere's the scenario: a client attempts to upload an event (`POST /api/events`) without a proper cert or valid auth. Gunicorn properly returns a `403 Unauthorized` error, but when it gets sent back to apache and then to the client, it get...Here's the scenario: a client attempts to upload an event (`POST /api/events`) without a proper cert or valid auth. Gunicorn properly returns a `403 Unauthorized` error, but when it gets sent back to apache and then to the client, it gets turned into a `502 Bad Gateway` error. For instance, this happened on CIT early this morning. Here's an example of the gracedb log line with the 502 and the two lines before it.
```
Jul 18 09:40:12 : DJANGO | 2023-07-18 09:40:12.610 | 9e6074462859 | 10.0.1.42 | performance | INFO | middleware.py, line 58 | create: 403:
Jul 18 09:40:12 : GUNICORN | 131.215.113.168 - - [18/Jul/2023:09:40:12 +0000] "POST /api/events/ HTTP/1.1" 403 58 "-" "gracedb-client/2.10.0"
Jul 18 09:40:12 : APACHE | 10.0.1.35 - - [18/Jul/2023:09:40:12 +0000] "POST /api/events/ HTTP/1.1" 502 315 "-" "gracedb-client/2.10.0"
```
The `DJANGO` performance middleware recognizes it as a `403`, `GUNICORN` says it's a `403`, `APACHE` says `502`.
What's going to happen in this scenario is, a user will see `502` in their error logs, when the issue isn't with GraceDB, per se, but rather it's returning a catch-all error instead of the proper `403`. Manual intervention by looking in the gracedb error logs is required to get the user the correct information.
I think i remember seeing this before, and the issue was a parameter in apache that controlled the maximum size of a unauthorized request... and since `POST` requests to create new events are ~O(1Mb), they exceed this value and so when gunicorn says the request is unauthorized, apache will return the too large bad gateway error.
This isn't a showstopper, but more of a fix for dev sanity. As in, the complaint will be "we got another 502 when creating an event, gracedb is broken", but the real issue was the user wasn't authorized.https://git.ligo.org/computing/gracedb/server/-/issues/312Add time format popup menu for t_start, t_0 and t_end in superevent page2023-06-09T11:02:01ZTito Dal CantonAdd time format popup menu for t_start, t_0 and t_end in superevent pageThe "Superevent Information" table on superevent pages lists t_start, t_0 and t_end as GPS times only, while the "Submitted" time has a nice popup menu where one can choose different time formats. It would be useful to have this menu for...The "Superevent Information" table on superevent pages lists t_start, t_0 and t_end as GPS times only, while the "Submitted" time has a nice popup menu where one can choose different time formats. It would be useful to have this menu for the other times as well (especially for t_0).https://git.ligo.org/computing/gracedb/server/-/issues/310Study: increase overall request throughput.2023-07-03T17:09:32ZAlexander PaceStudy: increase overall request throughput.### Problem
The last 16 months of improving the REST API performance via the MDC on `gracedb-playground` has been fruitful, but `gracedb-playground`'s use case differs from the production server in one critical way. `gracedb-playground`...### Problem
The last 16 months of improving the REST API performance via the MDC on `gracedb-playground` has been fruitful, but `gracedb-playground`'s use case differs from the production server in one critical way. `gracedb-playground` is constantly getting hammered by API requests at a rate that isn't seen in production. Also, it does not see nearly as many visitors on the web as the production server does. And production sees a mix of public and LVK users, where the public users to `gracedb-playground` are nearly non-existent.
The last issue rapidly became apparent when the [database was overloaded](https://git.ligo.org/computing/gracedb/server/-/issues/301), but it's been addressed and fixed. But the issue of people visiting the site and complaining about slow load times remains.
I'm going to approach this problem by throwing more gunicorn workers at the problem in an attempt to increase overall throughput. Why not just throw a bigger computer at the problem? Below is the last 24 hours of CPU usage on production GraceDB, including a spike when [S230601bf](https://gracedb.ligo.org/superevents/S230601bf/view/) dropped:
![Screen_Shot_2023-06-02_at_2.50.06_PM](/uploads/e44280be7397be8e594e1f7fef22707d/Screen_Shot_2023-06-02_at_2.50.06_PM.png)
Even in periods of peak activity we're not getting anywhere near to saturating the cpu, so that makes me think any perceived slowness is due to a limited number of workers that are IO-bound and not serving new requests.
### Method
I'm going to focus on the gunicorn [sync](https://docs.gunicorn.org/en/stable/design.html#sync-workers) worker for this. Despite the claim that "each connection is closed after response has been sent", I've shown (to myself..?) previously that while the connection to the worker is closed, the outside connection is maintained from user--> apache-> gunicorn _server_. The byproduct of this is that there's not a TLS renegotiation after each request, which is what really murders latency. Also I've noticed that the gthread worker can provide some more overall throughput, at lower memory use, but at the expense of memory leaks and "stuck" threads ([issues](https://github.com/benoitc/gunicorn/issues?q=is%3Aissue+is%3Aopen+gthread))
The systematic testing will take place on gracedb-dev1 (a dual-core, 4GB RAM VM), and then translated to playground (which has the same specs at production), and then moved to production. Note that playground/production are running are [c4.xlarge](https://instances.vantage.sh/aws/ec2/c4.xlarge) nodes (4 core, 8GB RAM), so I hope I can just... double the results from dev1 and it will scale. We'll see.
Testing will take place with the [`siege`](https://git.ligo.org/computing/sccb/-/issues/1234) command. Sample output is below:
```
$ siege https://gracedb-dev1.ligo.org/superevents/S230525c/view/ -c 1 -t 30S
...
...
Lifting the server siege...
Transactions: 222 hits
Availability: 100.00 %
Elapsed time: 29.72 secs
Data transferred: 6.72 MB
Response time: 0.13 secs
Transaction rate: 7.47 trans/sec
Throughput: 0.23 MB/sec
Concurrency: 0.99
Successful transactions: 222
Failed transactions: 0
Longest transaction: 3.87
Shortest transaction: 0.06
```
I think I should focus the Transaction rate (trans/sec), and ~~Concurrency (specified as 1 in this example, realized as 0.99)~~ 'Concurrency Efficiency' (Realized Concurrency / Specified Concurrency). What I think will happen if you were run a siege over and over and increase the specified concurrency each time, then the transaction rate would eventually plateau, and then the efficiency would continue to decrease as a limited number of workers.
I'm also going to look at three scenarios to see if I can find an optimal setup for each (or all):
1) Siege of easy requests (loading one public superevent page)
2) Siege of DB-heavy requests (loading all public superevents, from the API)
3) Siege of easy requests WHILE under load (loading a superevent page, while i'm bombarding the server with GETs to both use connections and use the CPU).
Other notes:
- Gunicorn recommends 2*N_{cpu} + 1 workers, which is what we've been using. I think baselining the number of workers vs the number of CPUs is smart. So try [0, 1, 2, 4, 8,...]*N_{cpu}+1 and see how it scales.
- It's also important to establish a baseline with the current setup to see if there's a performance increase or what.
I'll update this ticket with results as i get them