Incorporate LVAlert Heartbeat into nagios monitors
This is a migration of Redmine issue #5416.
Original Description (2017-04-19):
lvalert-heartbeat was packaged and deployed. The request from Reed Essick is below:
"Hi Patrick;
We've finally got lvalert-heartbeat installed on the clusters and hooked up to most (but not all) mission-critical listeners. I've confirmed that I can query the production listeners and get responses back in a timely manner (less than 5 seconds per query). The next step is to begin integrating this with Nagios so that we can leverage that notification scheme to warn developers of unresponsive code.
Could you please tell me what you need from me to make this happen? I believe lvalert_heartbeat-server
should return the correct format for Nagios to interpret the result. However, I'm afraid I'm at a bit of loss as to how to test and deploy this on monitor.ligo.org.
For your reference, the production lvalert-heartbeat repo lives here:
https://git.ligo.org/lscsoft/lvalert-heartbeat
and I've got a test environment of what Nagios might run set up on CIT here:
/home/reed.essick/lvalert-dev/lvalert-heartbeat/test/
In particular, I imagine the plugin would be run with something like:
lvalert_heartbeat-server -V /home/reed.essick/lvalert-dev/lvalert-heartbeat/test/test_config.ini
We may want to break that config up into separate queries for each node (ie, each section in test_config.ini would become a separate file), but that's really a matter of convenience.
Any thoughts or suggestions you have would be much appreciated. cheers"
This ticket is to track progress on this issue.
Update:
The comments on this ticket are as follows:
Updated by Patrick Brockill over 1 year ago
Comment Edit
After discussion with Reed and Alex, if I understand correctly I think this is what is needed:
(*) This is a simple Nagios script (prints line of output to stderr, returns numerical value for OK, WARNING, CRITICAL or UNKNOWN);
(*) The service has to be able to connect to lvalert.cgca.uwm.edu;
(*) This script only needs to be run on one machine. The preference is that it not be run on lvalert.cgca.uwm.edu (apparently: it should be run "far away from all online follow-up processes", including "somewhere besides CIT, LHO and LLO"). So UWM is a good fit (just not on the lvalert machine);
(*) Reed has written the script to be called from Nagios and use few system resources. He has profiled it and it uses less than 40MB of memory. It does not use Condor, does not have to be run continuously, doesn't require user home directories to be mounted, is not CPU intensive, not network intensive, not disk intensive. It just requires access to a few netrc files and a config file.
(*) It would be nice if Reed could get access to the machine for irregular access.
#2 Updated by Patrick Brockill over 1 year ago
Comment Edit
This service is in the process of being tested by Reed and Alex on lowlatency0@uwm and can be found under the "emfollow" group on monitor. As lvalert_heartbeat-server may take some time to complete, it is currently being run with a timeout of 180 seconds.
#3 Updated by Patrick Brockill over 1 year ago
Comment Edit
We ran into more issues with timeouts. The heartbeat monitor usually completes in just less than 60 seconds, but now and then just over 60 seconds (e.g. 61 seconds). This requires three changes in Nagios:
(1) Change "service_check_timeout=60" to "service_check_timeout=75" in /etc/nagios3/nagios.cfg on dashboard;
(2) Change "command_timeout=60" to "command_timeout=75" in /etc/nagios/nrpe.cfg on lowlatency0;
(3) Use "/usr/lib/nagios/plugins/check_nrpe -t 75" in the service definitions.
I'm leaving this ticket open, as I would like the new status monitor to get tied into Nagios. This also has overlap with #2.