Alerts for outages
The grid-exerciser workflow gives a very coarse level of alerting - usually just if the DAG cannot be submitted due to pending or held jobs from the previous instance.
Grafana has support for threshold based alerts but it's not clear I can set thresholds (easily) based on performance outwith the current time window.
What I would like is some kind of heuristic-based alerting for jobs which suddenly start exhibiting anomalous behaviour, as a function of e.g. command and site. This may require an additional data source.