GraceDB Server issueshttps://git.ligo.org/computing/gracedb/server/-/issues2024-03-26T20:48:08Zhttps://git.ligo.org/computing/gracedb/server/-/issues/302investigation of unauthorized (public) queries (get_objects_for_user)2024-03-26T20:48:08ZAlexander Paceinvestigation of unauthorized (public) queries (get_objects_for_user)It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a p...It's been established here: https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232 that unauthorized queries. Context: there's one call coming from `django-guardian` called `get_objects_for_user` that takes in a user, a permission (like "view log"), and a list of objects, and it returns a subset of those objects that a user can actually see. Please see this ticket: https://git.ligo.org/computing/gracedb/server/-/issues/289
I'm going to document the process for making this call faster. I think it's going to be two steps:
1) Mitigation- reducing the number of objects that this function has to filter. Also see the above ticket.
2) Optimization- we very well might be calling this function sub-optimally. So after the first step, see what we might be doing wrong.Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/301What is the plan to anticipate and mitigate future problems like those seen w...2023-06-07T14:18:08ZPeter CouvaresWhat is the plan to anticipate and mitigate future problems like those seen with S230518h (DB being overwhelmed by internal LVK or external public page loads)?First, thanks @alexander.pace for the quick id and fix of this problem. (If there is a postmortem, or relevant tickets, or even LIGO Chat URL of the debugging to link to, please add them to this ticket here for context.)
**What is the ...First, thanks @alexander.pace for the quick id and fix of this problem. (If there is a postmortem, or relevant tickets, or even LIGO Chat URL of the debugging to link to, please add them to this ticket here for context.)
**What is the plan to anticipate and mitigate future problems like those seen with S230518h (GraceDB being overwhelmed by internal LVK or external public page loads)?**
This plan should probably include:
1. Load specification & testing
- Define a spec / requirements for how many simultaneous public & private requests (and of what sort) a production GracedB instance should be able to respond to with a certain latency.
- As part of the CI process, perform synthetic load testing w/simulated public & private users up to the spec and ensure it passes.
- Outside of CI, perform synthetic load testing w/simulated public & private users _beyond_ the spec to understand where it fails and why.
- Do some cost/benefit analysis of preemptive fixes to the bottlenecks identified in the beyond-spec load testing, so we can decide whether to fix them now or wait until it's known to be necessary.
2. Document an emergency procedure for an overwhelmed DB that someone other than Alex can execute.
- How to identify when the DB is overwhelmed (vs. other problems – AWS down, network down, auth down, etc.)
- How to temporarily turn off public access (and turn it back on again, and how/when to decide)
- Can/should we give privileged access to certain MM partners in such an emergency, so they can follow up?
- Anything else that can temporarily speed things up in a pinch.
I would expect this to be part of Phase II of the O4 LLAI review. (If there is a milestone or tag you'd like to use for such LLAI review tasks, please add it to this ticket so we can find them all easily in in the future.)Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/300navbar css breaks for alert and pipeline management pages2023-05-22T13:09:33ZAlexander Pacenavbar css breaks for alert and pipeline management pages![Screen_Shot_2023-05-18_at_7.57.12_AM](/uploads/8f43bb5c865d37eff88d195de08fb795/Screen_Shot_2023-05-18_at_7.57.12_AM.png)
the css for the `btn-sm` buttons further down on the page is taking priority for some reason. fixed by https://g...![Screen_Shot_2023-05-18_at_7.57.12_AM](/uploads/8f43bb5c865d37eff88d195de08fb795/Screen_Shot_2023-05-18_at_7.57.12_AM.png)
the css for the `btn-sm` buttons further down on the page is taking priority for some reason. fixed by https://git.ligo.org/computing/gracedb/server/-/commit/38ac7269494411fe3488dd4cdcdaad6e0329bca7https://git.ligo.org/computing/gracedb/server/-/issues/299RRT view should include t_0, not t_end2023-05-22T13:09:43ZPeter ShawhanRRT view should include t_0, not t_endI noticed last night that the RRT view includes the `t_end` value for the superevent. However, `t_end` is **not** the time of the GW signal; it is the ending time of the superevent's time window, which is typically 1 second after the GW ...I noticed last night that the RRT view includes the `t_end` value for the superevent. However, `t_end` is **not** the time of the GW signal; it is the ending time of the superevent's time window, which is typically 1 second after the GW signal time (for CBC events). The RRT view should be changed to display `t_0` (*) instead of `t_end`. I will email Sushant to alert him to this directly; I can't seem to get his user ID to pop up here using `@<name>`.
(*) In fact, `t_0` is not guaranteed to be the time of the preferred event either, it seems; I think it is set for the first preferred event and not updated if some other event becomes the preferred event. I say this based on some recent examples on gracedb-playground: https://gracedb-playground.ligo.org/superevents/S230516ea/view/, https://gracedb-playground.ligo.org/superevents/S230516ep/view/ . But the time differences are typically at the milliseconds level, so `t_0` should normally be good enough.https://git.ligo.org/computing/gracedb/server/-/issues/298Add new BBH search to GraceDB Playground2024-01-30T16:40:46ZWilliam BenoitAdd new BBH search to GraceDB PlaygroundOur group would like our binary black hole search pipeline, `aframe`, to be added to GraceDB Playground. An example output file is attached.
[event_1368221446.json](/uploads/850dede670d9e1bff55a637616c67b65/event_1368221446.json)
As `a...Our group would like our binary black hole search pipeline, `aframe`, to be added to GraceDB Playground. An example output file is attached.
[event_1368221446.json](/uploads/850dede670d9e1bff55a637616c67b65/event_1368221446.json)
As `aframe` is a machine learning-based pipeline that performs binary classification of the strain data, we do not provide any parameters associated with the signal.
We have a simple online implementation prepared, and are currently developing the complete workflow, which will include automatically re-training and evaluating the network. Please let me know if you need further information.O4 Debugging and Improvementshttps://git.ligo.org/computing/gracedb/server/-/issues/297Support other file formats (other than XML VOEvent) to ingest external events...2024-03-15T17:08:38ZBrandon PiotrzkowskiSupport other file formats (other than XML VOEvent) to ingest external events fromCurrently we can only ingest VOEvent XMl files to create external events via `gracedb.create_event`. There is already a need ingest events delivered via Kafka that have a `.json` format, which we are currently creating a workaround by co...Currently we can only ingest VOEvent XMl files to create external events via `gracedb.create_event`. There is already a need ingest events delivered via Kafka that have a `.json` format, which we are currently creating a workaround by converting to a VOEvent packet here (note this code has not been merged yet and subject to change):
https://git.ligo.org/emfollow/gwcelery/-/blob/dfdd84a97dec60257c4d7bd91d6c0c9442ec3de6/gwcelery/tasks/external_triggers.py#L561-626
Example alert:
https://git.ligo.org/emfollow/gwcelery/-/blob/a35b3ba998ab4726f90d5fb3cdf87d365cccbc65/gwcelery/tests/data/kafka_alert_fermi.json
In general we should make a more flexible system to ingest external events as GCN moves towards Kafka, potentially able to add new notice types/formats as needed (e.g. Kamland notices also have a different format, etc.)
I assume we need to make additional parser functions such as [`populateGrbEventFromVOEventFile`](https://git.ligo.org/computing/gracedb/server/-/blob/79c9b1ead0086fc4789a32c597b84c7abaee9513/gracedb/events/translator.py#L646) and add options to use the and determine which schema is being used [here](https://git.ligo.org/computing/gracedb/server/-/blob/79c9b1ead0086fc4789a32c597b84c7abaee9513/gracedb/events/translator.py#L352).
I can personally help with this development if needed, especially after the start of O4.https://git.ligo.org/computing/gracedb/server/-/issues/295Public Alerts Page: L1/H1 omegascan locations2023-05-11T21:13:26ZAlexander PacePublic Alerts Page: L1/H1 omegascan locationstagging @derek.davis @joseph-areeda
The static url location for DQR is different in O4 than in O3, and that change has been [reflected in the RRT table](https://git.ligo.org/computing/gracedb/server/-/blob/master/gracedb/templates/supe...tagging @derek.davis @joseph-areeda
The static url location for DQR is different in O4 than in O3, and that change has been [reflected in the RRT table](https://git.ligo.org/computing/gracedb/server/-/blob/master/gracedb/templates/superevents/rrt_info_table.html#L127).
The [public alerts page](https://gracedb.ligo.org/superevents/public/O3/) has links for L1/H1 omegascans for publicly-exposed superevents. In O3 those URLs were:
* L1: `https://ldas-jobs.ligo-la.caltech.edu/~detchar/dqr/events/{superevent_id}/L1deepomegascan/`
* H1: `https://ldas-jobs.ligo-wa.caltech.edu/~detchar/dqr/events/{superevent_id}/H1deepomegascan/`
But this needs to be updated for O4.
Could you provide the new URL format for omegascans in O4?Critical Path O4 Developmenthttps://git.ligo.org/computing/gracedb/server/-/issues/294RRT Table: "S" vs "GW" in GCN Circular Link2023-05-22T19:45:41ZAlexander PaceRRT Table: "S" vs "GW" in GCN Circular Link@keita.kawabe @sushant.sharma-chaudhary
I was fixing GraceDB's [public alerts page](https://gracedb.ligo.org/superevents/public/O3/) for O4 and I noticed a discrepancy between the link URL for GCN Circulars between the public alerts pag...@keita.kawabe @sushant.sharma-chaudhary
I was fixing GraceDB's [public alerts page](https://gracedb.ligo.org/superevents/public/O3/) for O4 and I noticed a discrepancy between the link URL for GCN Circulars between the public alerts page and the RRT table.
Both links have the same URL prefix (`https://gcn.gsfc.nasa.gov/other/`). However on the public alerts page, the circular file name has the `GW` prefix, while it begins with `S` for the RRT table.
Example: for [S200316bj](https://gracedb.ligo.org/superevents/S200316bj/view/), the circular URL is correctly [`https://gcn.gsfc.nasa.gov/other/GW200316bj.gcn3`](https://gcn.gsfc.nasa.gov/other/GW200316bj.gcn3).
The RRT Table shows the circular URL for [S230502m](https://gracedb-test.ligo.org/superevents/S230502m/view/) as `https://gcn.gsfc.nasa.gov/other/S230502m.gcn3`.
Is this an oversight in the RRT table or a change for O4 that I need to reflect in the public alerts page?Critical Path O4 Developmenthttps://git.ligo.org/computing/gracedb/server/-/issues/293Allow an easy deployment on k8s infrastructure (gracedb-test01.igwn.org/minik...2023-05-16T12:45:53ZRoberto DePietriAllow an easy deployment on k8s infrastructure (gracedb-test01.igwn.org/minikube)To follow the advice of the LLAI reviewer, we should allow easy deployment of gracedb server code on k8s using the created docker container.
- Brainstorming on LLAI tiers and local development. [DCC](https://dcc.ligo.org/LIGO-G2300724) ...To follow the advice of the LLAI reviewer, we should allow easy deployment of gracedb server code on k8s using the created docker container.
- Brainstorming on LLAI tiers and local development. [DCC](https://dcc.ligo.org/LIGO-G2300724)
- Telecon technical call [2023 05 01](https://git.ligo.org/emfollow/gwcelery/-/wikis/telecons/2023-05-01)
- Standalone GraceDB test instance with Minikube [dcc](https://dcc.ligo.org/LIGO-G2201921)
- old merge request [link](https://git.ligo.org/computing/gracedb/server/-/merge_requests/61)
-- To be completed with the requirements ----
Associate merge request:
1. https://git.ligo.org/computing/gracedb/server/-/merge_requests/130
1. https://git.ligo.org/computing/igwn-alert/overseer/-/merge_requests/3https://git.ligo.org/computing/gracedb/server/-/issues/292Improve ingestion of Burst cWB upload. (ingest chirp mass)2023-07-18T08:53:16ZRoberto DePietriImprove ingestion of Burst cWB upload. (ingest chirp mass)Ingest from cWB the additional information:
SOLVED BY MERGE REQUEST: https://git.ligo.org/computing/gracedb/server/-/merge_requests/134
- new database variable for MultiBurst [strain,mchirp,hoft,code]
- update the translator to fix the...Ingest from cWB the additional information:
SOLVED BY MERGE REQUEST: https://git.ligo.org/computing/gracedb/server/-/merge_requests/134
- new database variable for MultiBurst [strain,mchirp,hoft,code]
- update the translator to fix the ingestion problems
- update dictionary return by the api
- update visualization
Provieded trigger examples:
* BBH: [G1064199](https://gracedb-playground.ligo.org/events/G1064199/view/), event_time=segment\[0\]+chirp\[7\]=1369177046.0+156.851562=1369177202.851562
* AllSky: [G1064211](https://gracedb-playground.ligo.org/events/G1064211/view/), event_time=time\[0\]=1369177488.7303
Roberto: The format it is ok, and the injection of new and legacy triggers works.
============. ORIGINAL REQUEST ==========
- “**chirp mass**”
- change the definition of “event time” of Burst-cWB-BBH
- "**hoft version**" to ingest the channel provided in "trigger.txt"
- instead of the d
-instead of the "**#significance based on the last day**" we should use "**#significance based on the last week**"
We should add the label additional labels: **cWB_XP** and **cWB_2G**.
cWB is providing trigger.txt file with the parameters of the reconstructed event.
Each string in this file is describing a parameter or an array of parameters.
For example, the chirp string has an array of 9 parameters chirp[0:8], where
- chirp[1] is the reconstructed chirp mass
- chirp[7] is the CBC merger time [s] relative to the start of the analysis segment
which is segment[0] in the corresponding segment string in trigger.txt
The array time[0:1] gives the peak time of the event in L1 (time[0]) and H1 (time[1])
The difference between these two gives the measured time-of-flight between the detectors.[Example_description__CWB.rtf](/uploads/6e76da959e6a7b7f2899bead30c09129/Example_description__CWB.rtf)
Currently, time[0], which is the peak time of the event, is used by GWcelery as the “Event time”.
I think it will be wrong to hack the trigger.txt file changing the definition of time[0] to
be the merger time for cWB BBH search. Instead, GWcelery should use
BBH search
- “event time” = segment[0]+chirp[7]
- “peak time" = time[0]
- “**chirp mass**” = chirp[1]
All-sky search
- “event time” = time[0]
- “peak time" = time[0]
- “**chirp mass**” = 0 (not defined for all-sky search)
RELATED issue: https://git.ligo.org/emfollow/gwcelery/-/issues/579https://git.ligo.org/computing/gracedb/server/-/issues/291VersionedFile symlink inconsistency2023-05-04T19:38:30ZAlexander PaceVersionedFile symlink inconsistencyWhen a user uploads multiple files that have the same filename within an exceeeeedingly small time window, there's a chance that the [block of code](https://git.ligo.org/computing/gracedb/server/-/blob/d32071c941c905a13f043dbec16fa41d0fd...When a user uploads multiple files that have the same filename within an exceeeeedingly small time window, there's a chance that the [block of code](https://git.ligo.org/computing/gracedb/server/-/blob/d32071c941c905a13f043dbec16fa41d0fd9bfb4/gracedb/core/vfile.py#L102-110) that creates a symlinked version file can hit a race condition.
This happens pretty rarely, but whenever it does, it's always from gwcelery uploading multiple circular templates, which is a [known](https://git.ligo.org/emfollow/gwcelery/-/issues/480) [bug](https://git.ligo.org/emfollow/gwcelery/-/issues/616) that's being addressed.
That being said, examining the files in question in this superevent [S230504an](https://gracedb-playground.ligo.org/superevents/S230504an/view/):
![Screen_Shot_2023-05-04_at_3.31.12_PM](/uploads/50bfe26f2ba327d66f5969e70f0b4d38/Screen_Shot_2023-05-04_at_3.31.12_PM.png)
The file versioning seems to have worked like it should have? And the symlink seems to be pointing at the right file? But honestly it's difficult to tell when there are so many duplicates of the same file. So I don't know if the Error that Brian Moe raised in that routine is correct.... or if there was a brief moment in that superevent's timeline when the symlink was inconsistent with the intended file, or if that broken symlink was fixed the next time a new file came in, or if it's still broken and just pointing to the wrong file (which happens to be the same?).
Given that, and that it only occurs during the gwcelery bug that's going to get fixed, I'm kind of afraid to touch it without knowing what's really going on and having a good way to test it.O4 Debugging and Improvementshttps://git.ligo.org/computing/gracedb/server/-/issues/290RRT View: initital-circular.txt vs initial-emcoinc-circular.txt2023-05-05T16:30:20ZAlexander PaceRRT View: initital-circular.txt vs initial-emcoinc-circular.txt@keita.kawabe @sushant.sharma-chaudhary
Check out the RRT table on this MDC superevent on gracedb-test: https://gracedb-test.ligo.org/superevents/MS230504f/view/#rapid-response-information
The link to the [circular text](https://graced...@keita.kawabe @sushant.sharma-chaudhary
Check out the RRT table on this MDC superevent on gracedb-test: https://gracedb-test.ligo.org/superevents/MS230504f/view/#rapid-response-information
The link to the [circular text](https://gracedb-test.ligo.org/api/superevents/MS230504f/files/initial-circular.txt) is broken, because that file isn't present. It looks like there's no `initial-ciruclar.txt`, but there is a `initial-emcoinc-circular.txt`:
![Screen_Shot_2023-05-04_at_11.53.05_AM](/uploads/b7769b8218d76ad15eadd4486c6badec/Screen_Shot_2023-05-04_at_11.53.05_AM.png)
Do you know when and under what circumstances the filename changes, and if this circular is the file that the RRT expects to read?
This might be a RAVEN thing, so I'm tagging @brandon.piotrzkowski, who might know.https://git.ligo.org/computing/gracedb/server/-/issues/289Proposal to hide exposed hourly MDC superevents on production2023-05-23T21:06:54ZAlexander PaceProposal to hide exposed hourly MDC superevents on production**Description:** Moving into O4, I've been monitoring the load on the production database, and I noticed that the highest load on the database (over two OOM cpu usage over other requests) occur under a very specific circumstance: when an...**Description:** Moving into O4, I've been monitoring the load on the production database, and I noticed that the highest load on the database (over two OOM cpu usage over other requests) occur under a very specific circumstance: when an _unauthenticated_ user makes a request to view _public_ data products. An example would be, when a member of the public views a public superevent page, or a script scrapes for public skymaps, etc.
I traced this down to the SQL that's generated by a `django-guardian` function called `get_objects_for_user`. There has to be an underlying bug with GraceDB's public `viewexposed` permission, but I haven't been able to find it yet.
That being said, there are a couple of [stackoverflow](https://stackoverflow.com/a/19444128) posts and github issues about this function and this statement is accurate to me:
> Also, if possible, i suggest you don't use get_objects_for_user shortcut when project gets bigger. Its VERY slow query once you get more objects/permissions in the database.
:arrow_up: that seems consistent with some [testing](https://git.ligo.org/computing/gracedb/server/-/issues/249#note_689232) that i've seen this week.
So why wasn't this an issue before? At the end of O3, there were 80 exposed (public) superevents. That's a trivial number of items from a database standpoint. But in the three years since O3 ended, the hourly first-two-years MDC uploads have been exposed to the public. Multiply 24 daily superevents by three years and all of a sudden....
```
In [11]: Superevent.objects.filter(is_exposed=True).filter(category='M').count()
Out[11]: 35354
```
There's over 35,000 exposed superevents and growing by the hour.
A quick test can be to open this file list: https://gracedb.ligo.org/superevents/S200316bj/files/
as an authenticated user (243ms):
![Screen_Shot_2023-05-03_at_11.37.54_AM](/uploads/538d6706b93b2e0b10593b03e72b5c0d/Screen_Shot_2023-05-03_at_11.37.54_AM.png)
and in incog (13.5s :sob:):
![Screen_Shot_2023-05-03_at_11.39.53_AM](/uploads/9d0d798392f395b6467e2a88670b50a9/Screen_Shot_2023-05-03_at_11.39.53_AM.png)
**Proposal:**
1) Unless there are objections, I'm going to hide exposed MDC uploads and see the performance impact.
2) If it works, then I'm going to set up a tool to hide all (or a subset..?) of MDC superevents (which is a bandaid)
3) Figure out what's wrong with the permissions, because finding the bug might have other wider-ranging performance implications
4) Unless there is the desire to have the test uploads public, then modify GWCelery not to expose the test uploads. We can revisit this request based on the results of 1-3.Critical Path O4 DevelopmentAlexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/288RRT view: DQR URL for O4 production2023-05-05T16:30:21ZKeita KawabeRRT view: DQR URL for O4 productionOn gracedb-test, DQR URL (`https://ldas-jobs.ligo.caltech.edu/\\\~dqr/o4dqr/online/events/${YYYMM}/${graceid}/`) is different from the URL the RRT view code expects (`https://ldas-jobs.ligo.caltech.edu/\\\~dqr/o4dqr-results/production/ev...On gracedb-test, DQR URL (`https://ldas-jobs.ligo.caltech.edu/\\\~dqr/o4dqr/online/events/${YYYMM}/${graceid}/`) is different from the URL the RRT view code expects (`https://ldas-jobs.ligo.caltech.edu/\\\~dqr/o4dqr-results/production/events/${YYYMM}/${graceid}/`).
I'd like @derek.davis to chime in. If the URL scheme is final for O4, we'll have to change the code on GraceDB side.https://git.ligo.org/computing/gracedb/server/-/issues/287Updating current list of CBC pycbc uploaders for GraceDB2023-05-04T21:12:00ZBhooshan Uday Gadrebhooshan.gadre@ligo.orgUpdating current list of CBC pycbc uploaders for GraceDBWe request to **add** the following names to the list of pycbc users who can upload to Gracedb:
1. pycbc.offline
2. max.trevor
3. kanchan.soni
4. shreejit.jadhav
5. xan.morice-atkinson
6. stephanie.hoang
7. praveen.kumar
8. ana.lorenzo
9...We request to **add** the following names to the list of pycbc users who can upload to Gracedb:
1. pycbc.offline
2. max.trevor
3. kanchan.soni
4. shreejit.jadhav
5. xan.morice-atkinson
6. stephanie.hoang
7. praveen.kumar
8. ana.lorenzo
9. adrianofrattale.mascioli
10. barna.fekecs
11. bhooshan.gadre
We would like **remove** the following names from the current list:
1. andrewlawrence.miller@ligo.org
2. henning.fehrmann@ligo.org
3. karsten.wiesner@ligo.org
4. collin.capano@ligo.org
5. alex.nitz@ligo.org
6. christopher.biwer@ligo.org
7. stanislav.babak@ligo.org
8. gergely.debreczeni@ligo.org
9. duncan.brown@ligo.org
10. badri.krishnan@ligo.org
11. saeed.mirshekari@ligo.orgO4 Debugging and Improvementshttps://git.ligo.org/computing/gracedb/server/-/issues/286Creating Test External Events on GraceDB-Playground2023-07-28T14:05:10ZRyan FisherCreating Test External Events on GraceDB-PlaygroundI would like to either create test External GRB events on GraceDB-Playground or set up a system where I am able ask for some to be created, if needed.
The need for these events is that the search I am running requires External short GRB...I would like to either create test External GRB events on GraceDB-Playground or set up a system where I am able ask for some to be created, if needed.
The need for these events is that the search I am running requires External short GRB events to appear in the database, such that they overlap with observing mode data. This triggers the medium latency PyGRB search to run. I would replicate previous GRB events, with updated event times such that the events overlap with the observing mode data.
I am not attempting to submit new GW events. Group would be External, Pipeline would be Fermi. Search would not be applicable, etc. It would be exactly like the External event here: https://gracedb-playground.ligo.org/documentation/models.html
I would like to learn how to submit these events (just a pointer to the correct instructions would be fine) and how to get authorization (and authentication) to do so.
If there is already a guarantee that new short GRB events will appear in GraceDB-Playground overlapping with ER15 at a rate of at least 1 per day, then this request can be closed.
Thank you!Alexander PaceAlexander Pacehttps://git.ligo.org/computing/gracedb/server/-/issues/285Occasional shuffling of containers in swarm deployment.2024-03-13T21:18:40ZAlexander PaceOccasional shuffling of containers in swarm deployment.Occasionally docker swarm will shuffle which containers are running on which nodes, and I'm still trying to figure out why. So for example, earlier this morning:
![Screen_Shot_2023-04-21_at_11.36.02_AM](/uploads/d4e8b473b9d61ccbcad2887...Occasionally docker swarm will shuffle which containers are running on which nodes, and I'm still trying to figure out why. So for example, earlier this morning:
![Screen_Shot_2023-04-21_at_11.36.02_AM](/uploads/d4e8b473b9d61ccbcad288725ff4b33a/Screen_Shot_2023-04-21_at_11.36.02_AM.png)
The containers were running stably on nodes a, b, c, but then the gracedb container on b switched to a different node. Looking at the logs for gracedb/traefik/haproxy showed these warnings coming from HAproxy (`webgateway_dockersocket`):
```
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: Stopping backend dockerbackend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: Stopping frontend dockerfrontend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: Proxy dockerbackend stopped (FE: 0 conns, BE: 166378 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: Proxy dockerfrontend stopped (FE: 23768 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (1) : Exiting Master process...
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (1) : Exiting Master process...
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Stopping backend dockerbackend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Stopping frontend dockerfrontend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Stopping frontend GLOBAL in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Proxy dockerbackend stopped (FE: 0 conns, BE: 166378 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Proxy dockerfrontend stopped (FE: 23768 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [WARNING] 110/144524 (7) : Proxy GLOBAL stopped (FE: 0 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Stopping backend dockerbackend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Stopping frontend dockerfrontend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: Stopping backend dockerbackend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: Stopping frontend dockerfrontend in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: Proxy dockerbackend stopped (FE: 0 conns, BE: 63211 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: Proxy dockerfrontend stopped (FE: 9029 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Stopping frontend GLOBAL in 0 ms.
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Proxy dockerbackend stopped (FE: 0 conns, BE: 63211 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Proxy dockerfrontend stopped (FE: 9029 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [WARNING] 110/144524 (7) : Proxy GLOBAL stopped (FE: 0 conns, BE: 0 conns).
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.3.zl5suk39kwwg50fty1u6f3vyt: [ALERT] 110/144524 (1) : Current worker #1 (7) exited with code 0 (Exit)
Apr 21 14:45:24 gracedb_docker_webgateway_dockersocket.1.ntddvqu1j5e3eqms12t8h12d9: [ALERT] 110/144524 (1) : Current worker #1 (7) exited with code 0 (Exit)
```
followed in the same log file as the errors from traefik (`webgateway_webgateway`):
```
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: 131.215.113.226 - - [21/Apr/2023:14:45:24 +0000] "GET /api/events/G989764/log/ HTTP/1.1" 200 1082 "-" "-" 621950 "gracedb@docker" "http://10.0.0.47:80" 95ms
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="accept tcp [::]:443: use of closed network connection" entryPointName=websecure
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="accept tcp [::]:80: use of closed network connection" entryPointName=web
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="Error while starting server: http: Server closed" entryPointName=web
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="Error while starting server: http: Server closed" entryPointName=websecure
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="Error while starting server: http: Server closed" entryPointName=web
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="close tcp [::]:80: use of closed network connection" entryPointName=web
Apr 21 14:45:24 gracedb_docker_webgateway_webgateway.2.rn0rimzkaijqp324b90r9ts09: time="2023-04-21T14:45:24Z" level=error msg="Error while starting server: http: Server closed" entryPointName=websecure
```
on the gracedb side, there were a string of errors coming from kafka:
```
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: 2023-04-21 14:45:24,354 INFO stopped: shibd (exit status 0)
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-2.prod.hop.scimma.org:9092/2: Disconnected (after 1245141ms in state UP)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-2.prod.hop.scimma.org:9092/2: Disconnected (after 710755ms in state UP, 1 identical error(s) suppressed)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: Disconnected (after 710652ms in state UP)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: Disconnected (after 11863073ms in state UP, 1 identical error(s) suppressed)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-0.prod.hop.scimma.org:9092/0: Disconnected (after 22878079ms in state UP)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-1.prod.hop.scimma.org:9092/1: Disconnected (after 9059107ms in state UP, 1 identical error(s) suppressed)"}
Apr 21 14:45:24 gracedb-swarm-playground-us-west-2b-docker-mgr-01 gracedb_docker_gracedb_gracedb.3.rn3qo0xic2k63c3j2oflftrbz: internal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://kb-2.prod.hop.scimma.org:9092/2: Disconnected (after 22094640ms in state UP, 1 identical error(s) suppressed)"}
```
But, given that they're all in the same second, it's hard to tell what caused what. If there's any saving grace, the service rolled over to the other nodes automatically... but users connected to node b probably got disconnection errors from the client.
**To recover:**
trigger the services to scale down to two nodes:
```
docker service scale webgateway_webgateway=2 webgateway_dockersocket=2 gracedb_memcached=2 gracedb_gracedb=2
```
then back up to three:
```
docker service scale webgateway_webgateway=3 webgateway_dockersocket=3 gracedb_memcached=3 gracedb_gracedb=3
```O4 Debugging and Improvementshttps://git.ligo.org/computing/gracedb/server/-/issues/284RRT view: cWB BBH is displayed as unmodelled burst2023-05-11T21:21:29ZKeita KawabeRRT view: cWB BBH is displayed as unmodelled burstAs of April/21/2023 on gracedb-playground, cWB BBH search is displayed as "burst (unmodelled)" in the RRT view while the expected behavior is that cWB BBH events are displayed in the same row as CBC events. It's OK to leave it as is for ...As of April/21/2023 on gracedb-playground, cWB BBH search is displayed as "burst (unmodelled)" in the RRT view while the expected behavior is that cWB BBH events are displayed in the same row as CBC events. It's OK to leave it as is for ER15, but this needs to be fixed for O4.
I don't know if cWB BBH search group is changed to CBC in the future, but until/unless it's done this will continue to be an issue.
For example, in https://gracedb-playground.ligo.org/superevents/S230421em/view/ the preferred event G1016967 is cWB BBH, but it's in "burst (unmodelled)" row.
![Screenshot_2023-04-21_071208](/uploads/5b692441331238d82c3e3bf3719a4216/Screenshot_2023-04-21_071208.png)
![image](/uploads/8b6c5da31e2de3ae48dce465bb67c9cf/image.png)O4 Debugging and Improvementshttps://git.ligo.org/computing/gracedb/server/-/issues/283Alert notification form silently fails if label query invalid2023-04-20T22:22:08ZDaniel WysockiAlert notification form silently fails if label query invalid## Description of problem
<!--
Describe in detail what you are trying to do and what the result is.
Exact timestamps, error tracebacks, and screenshots (if applicable) are very helpful.
-->
When entering a `Label query` in the Notificati...## Description of problem
<!--
Describe in detail what you are trying to do and what the result is.
Exact timestamps, error tracebacks, and screenshots (if applicable) are very helpful.
-->
When entering a `Label query` in the Notification create/edit forms, the validation step works correctly in stopping you from entering an invalid query. However, it does not provide any message about what validation failed, or even that it failed, the page just flickers for a second while it reloads.
## Expected behavior
<!-- What do you expect to happen instead? -->
There should be a message indicating why it failed to validate, e.g., `Invalid label query`. Ideally we'd also link to the [docs on creating label queries](https://gracedb.ligo.org/documentation/notifications.html#creating-a-notification)
## Steps to reproduce
<!-- Step-by-step procedure for reproducing the issue -->
- Go to https://gracedb.ligo.org/alerts/notification/create/
- Fill out the description, select a contact, and then enter anything invalid in `Label query` (e.g., `f00b@r`)
- Click submit
## Context/environment
<!--
Describe the environment you are working in:
* If using the ligo-gracedb client package, which version?
* Your operating system
* Your browser (web interface issues only)
* If you are experiencing this problem while working on a LIGO or Virgo computing cluster, which cluster are you using?
-->
* OS: Arch Linux
* Browser: Firefox 112.0.1 (64-bit)
* Note: this was tested on gracedb-dev.ligo.org
## Suggested solutions
<!-- Any ideas for how to resolve this problem? -->
We already seem to have validation messages for the `Contacts` field on that page. We should just do whatever we did for that.https://git.ligo.org/computing/gracedb/server/-/issues/282oLIB and MLy VOEvents do not have central_frequency and duration values2023-05-05T16:12:48ZRoberto DePietrioLIB and MLy VOEvents do not have central_frequency and duration valuesThe content of VO alert voevents are missing `p_central_freq` `p_duration` for MLy, and `p_duration` for oLIB. Fir oLIB is still present the "fluency" that the collaboration has decided not ti distribute.
Related issue: https://git.ligo...The content of VO alert voevents are missing `p_central_freq` `p_duration` for MLy, and `p_duration` for oLIB. Fir oLIB is still present the "fluency" that the collaboration has decided not ti distribute.
Related issue: https://git.ligo.org/emfollow/gwcelery/-/issues/594
The code should be changed (my guess to):
```python
elif isinstance(event, LalInferenceBurstEvent):
p_freq = vp.Param(
"frequency",
value=float(event.frequency_mean),
ucd="gw.frequency",
unit="Hz",
ac=True,
)
p_freq.Description = "Mean frequency of GW burst signal"
v.What.append(p_freq)
# Calculate the fluence.
# From Min-A Cho: fluence = pi*(c**3)*(freq**2)*(hrss_max**2)*(10**3)/(4*G)
# Note that hrss here actually has units of s^(-1/2)
# XXX obviously need to refactor here.
# try:
# fluence = pi * pow(c,3) * pow(event.frequency,2)
# fluence = fluence * pow(event.hrss,2)
# fluence = fluence / (4.0*G)
#
# p_fluence = vp.Param(
# "Fluence",
# value=fluence,
# ucd="gw.fluence",
# unit="erg/cm^2",
# ac=True
# )
# p_fluence.Description = "Estimated fluence of GW burst signal"
# v.What.append(p_fluence)
### Duration
p_duration = vp.Param(
"Duration",
value=float(event.duration),
unit="s",
ucd="time.duration",
ac=True,
)
p_duration.Description = "Measured duration of GW burst signal"
v.What.append(p_duration)
elif isinstance(event, MLyBurstEvent):
p_central_freq = vp.Param(
"CentralFreq",
value=float(event.central_freq),
ucd="gw.frequency",
unit="Hz",
ac=True,
)
p_central_freq.Description = \
"Central frequency of GW burst signal"
v.What.append(p_central_freq)
### Duration
duration = event.quality_mean / (2 * np.pi * event.frequency_mean)
p_duration = vp.Param(
"Duration",
value=float(duration),
unit="s",
ucd="time.duration",
ac=True,
)
p_duration.Description = "Measured duration of GW burst signal"
v.What.append(p_duration)
except Exception as e:
logger.exception(e)
```O4 Debugging and Improvements