gwcelery issueshttps://git.ligo.org/emfollow/gwcelery/-/issues2023-05-31T15:39:49Zhttps://git.ligo.org/emfollow/gwcelery/-/issues/299Add alarms for large changes in resource usage2023-05-31T15:39:49ZLeo P. SingerAdd alarms for large changes in resource usageAdd alarms for large changes in resource usage (free memory, disk space, load, number of Redis records, etc.) on emfollow.ligo.caltech.edu.
@stuart.anderson, I bet this already exists on your side, right?Add alarms for large changes in resource usage (free memory, disk space, load, number of Redis records, etc.) on emfollow.ligo.caltech.edu.
@stuart.anderson, I bet this already exists on your side, right?Leo P. SingerStuart AndersonDeep Chatterjeedeep.chatterjee@ligo.orgPierre ChanialLeo P. Singerhttps://git.ligo.org/emfollow/gwcelery/-/issues/298Audit notification settings for monitoring systems2023-05-31T15:40:13ZLeo P. SingerAudit notification settings for monitoring systemsSystematically check notification features of all of our monitoring services. For each system, we should answer the following questions:
1. What conditions, services, or subsystems are monitored?
2. Are urgent, high severity conditions ...Systematically check notification features of all of our monitoring services. For each system, we should answer the following questions:
1. What conditions, services, or subsystems are monitored?
2. Are urgent, high severity conditions instantly distinguishable from minor ones?
3. Who is able to act on the alert? Are they subscribed to the system? Are they subscribed in a manner that will immediately get their attention (e.g. phone/SMS) for high severity alerts?
Monitoring services that are relevant to subsystems that we are responsible for include:
1. GraceDB
2. Nagios/Icigna
3. Sentry
[DoD](https://www.agilealliance.org/glossary/definition-of-done/): a table in the docs or on a Wiki page stating the answers to the above questions for all monitoring services.Leo P. SingerStuart AndersonDeep Chatterjeedeep.chatterjee@ligo.orgPierre ChanialLeo P. Singerhttps://git.ligo.org/emfollow/gwcelery/-/issues/297Break GWCelery nagios check into several services2023-04-07T16:47:39ZLeo P. SingerBreak GWCelery nagios check into several servicesI am guilty of basically ignoring Nagios alerts from GWCelery. The reason is that the GCN connection is flaky and goes up and down several times a day. Since there is only one Nagios service for each GWCelery deployment, the flakiness of...I am guilty of basically ignoring Nagios alerts from GWCelery. The reason is that the GCN connection is flaky and goes up and down several times a day. Since there is only one Nagios service for each GWCelery deployment, the flakiness of that one subsystem dilutes the urgency associated with other services. We should break the GWCelery nagios check into several individual services so that we can set alert notifications on the other, non-flaky components.post-O4Leo P. SingerStuart AndersonLuca ReiDeep Chatterjeedeep.chatterjee@ligo.orgLeo P. Singerhttps://git.ligo.org/emfollow/gwcelery/-/issues/198Add Prometheus monitoring of task latency2023-05-31T15:58:56ZLeo P. SingerAdd Prometheus monitoring of task latencyLeo P. SingerLeo P. Singer