Study of missing notifications during O3 (not attempted by Twilio)
During O3, some people reported not receiving notifications according to how they had configured GraceDB to send them notifications. A small fraction of people reported this, but it seemed to be consistent, i.e. not sporadic. I spent some time looking into it in the summer of 2019. For the record, here is a copy of some email messages I sent to a few people (principally Tanner) at that time.
Email on July 25, 2019:
I have gotten input from a number of people and cross-checked with the Twilio logs. I have not figured out what is happening, but I have learned some things so I thought I would distill my notes and share them with you.
-
The problems people are having are with Call and Text notifications, not Email notifications. Well, I haven't paid much attention to what people mentioned about email notifications, so there could be problems there too, but anyway the problems are not ALL with Email notifications. The people who have communicated with me are primarily relying on calls and/or texts.
-
The Twilio logs corroborate what people have told me. e.g. if they said they haven't gotten text messages and phone calls recently, the Twilio logs agree: it really looks like Twilio was not asked to call/text them. (Well, occasionally a phone call will fail and that will be shown in the Twilio log, but that is not common. It's not the explanation for people's reports of missing notifications.)
-
Lots of people ARE being notified of relevant events. For instance, when S190718y was marked by ADVREQ, the Twilio logs list 98 text messages and 41 voice calls to people to notify them. When S190720a was labeled with ADVREQ, I see 112 text messages; I didn't count the voice calls in that case. When S190724g was labeled with EM_COINC, I see 67 text messages delivered and about 42 voice calls, most of which went through and were answered.
-
Some people are receiving notifications reliably, while others are not receiving any. Some people used to receive notifications but have not been receiving them recently. A few people have observed that it seems like people who set up notifications a long time ago are receiving them, while people who set up notifications recently tend not to be receiving them.
So I think there are two general types of possible reasons: either (1) some call/text requests passed to Twilio are getting lost before Twilio attempts them, or (2) there is something funny in the software that the gracedb server is using to construct the list of contacts to call or text, leading it to omit some. (e.g., before I started looking into this, I had a hypothesis that a database query was being used to get the list of contacts and there was a maximum number of records returned by the query. But having looked at the code, that doesn't fit.)
I know you mentioned that logging is not working reliably on AWS; that's too bad, because from gracedb/alerts/phone.py I can see that every call/text attempt passed to Twilio is being logged. If you have a log file that you believe to be complete for some time that includes an event, I could compare it against the Twilio logs (which I have now exported into spreadsheets, cumulative since January).
There is a note here that "You can send messages to Twilio at a rapid rate as long as the requests do not reach Twilio's API concurrency limit which is at 100", but I don't THINK we would be running into that since call/text requests are made serially and I'm positive that Twilio is designed to queue requests and feed them out at the appropriate rate.
In terms of the software in the gracedb server, I spent some time studying the code in the gracedb/alerts directory, but it is complex enough that I can't trace it by inspection to check how it filters to get a list of matching notifications and then looks up the contact information from the notifications.
If you want a case to try debugging, look at Giacomo Ciani. His notification settings are:
Notifications
Once per year | Superevent created or updated & FAR < 3e-08 -> Text +393476487948, Email giacomo.ciani@unipd.it
Advocate request | Superevent labeled with ADVREQ -> Email giacomo.ciani@unipd.it, Call and text +393476487948
For S190720a he received several "A superevent with GraceDB ID S190720a was updated" text messages and emails (nobody received a "superevent created" message for that because the initial preferred event had too high a FAR), but did not receive any ADVREQ label messages for either S190720a or S190718y, either by text or by voice call. So it seems that the first of his two notifications was acted on but the second was not.
Email on July 26, 2019:
I've taken some more time to digest the input I've received (and collected notes in a Google doc: https://docs.google.com/document/d/1QzDS-JWxi2EAXgYP64sKJaNde29x7MN7v0JtA5IGxl8/edit). Here are my high-level findings:
-
For some people, all of their notifications are working.
-
Working or not working seems to be associated with specific "lines" in a user's notifications configuration. For some people, SOME of their notification lines are working while others are not. For instance, Jenne Driggers has four notification lines configured, but only the first of them is working (i.e. generating text messages logged by Twilio); the other three lines are having no effect. Looking back through the logs, it seems that has been the case since she created the first two lines in early April, and added two more lines around July 18 or 19: her first notification line (which is interesting because it includes the NS candidate condition) has been working reliably, while none of the other lines has produced any notification through Twilio. Similarly, for Giacomo Ciani and Andrea Miani, their first notification line has been working while their second has not. Marco Bazzan's SECOND line has been working while his first has not.
-
At least one user -- Daniel Sigg -- has two notification lines and neither is working. Daniel has in the past received phone calls through Twilio (on July 1 and 6), but he updated his alerts configuration and has not gotten any notifications since July 6.
-
Giacomo Ciani, whom I mentioned above, added another line to his configuration today with voice call notifications, and it worked.
So my best picture of this is that some notification lines work and others don't. I can't tell what determines which ones work and which ones don't. It does seem to be the case that notification lines established a long time ago, or first in a person's list, are more likely to work; but that does not seem to be universal. And anyway, it's pretty clear to me that this is a GraceDB bookkeeping issue of some sort, not a problem with Twilio or individual users' phones or cell providers.
Oh, and for people who are not getting notifications according to their configuration, when they use the Test buttons to send test notifications, those work. (In most cases... Deep does not seem to be able to receive calls or text messages from Twilio on his phone.)