Jameson Rollins · 019c57d2
--- a/guardctrl.md
+++ b/guardctrl.md
+# guardctrl: guardian process supervision
+
+`guardctrl` is a tool for managing guardian node processes.  It is essentially just a convenient wrapper around [systemd](https://www.freedesktop.org/wiki/Software/systemd/), the built-in init and service supervision system standard on all major linux distributions.  guardctrl uses systemd and journald to take care of starting/stopping/tracking guardian daemons and capturing/viewing their log messages.
+
+Under the hood each guardian process is handled by a systemd templated service unit, `guardian@.service`, which describes how the processes should be supervised by systemd.
+
+## guardctrl host setup
+
+This section describes how to setup a computer as a guardctrl host. 
+
+### install packages
+
+The `guardctrl` package is available through the [LIGO Debian apt archives](http://apt.ligo-wa.caltech.edu/debian/).
+```shell
+$ wget http://software.ligo.org/lscsoft/debian/pool/contrib/l/lscsoft-archive-keyring/lscsoft-archive-keyring_2016.06.20-2_all.deb
+$ wget https://apt.ligo-wa.caltech.edu/debian/pool/stretch/cdssoft-release-stretch/cdssoft-release-stretch_1.3.0_all.deb
+$ sudo dpkg -i *.deb
+$ sudo apt-get update
+```
+Once that archive is enabled the package can be installed directly:
+```shell
+$ sudo apt install guardctrl
+```
+The `guardctrl` package depends on `guardian` package, so you'll automatically get them both.  guardctrl will install the command line interface, as well as all the needed systemd service unit files.
+
+### creating and configuring the guardctrl user
+
+`guardctrl` uses the `systemd --user` instance of the invoking user.  This means that `guardctrl` should always be invoked as the same user so that processes are managed in a unified way.  The guardctrl interface knows it's running as the correct user by the presence of the `~/.guardctrl-home` file.  If this file is not present, guardctrl will assume it's running remotely and will try to ssh to GUARDCTRL_USER@GUARDCTRL_HOST to issue the command.
+
+For the LIGO site installations we run everything under the `guardian` user.  We therefore start by creating the `guardian` user account on the machine:
+```shell
+$ sudo adduser --gecos '' --uid 1010 --ingroup controls --disabled-password guardian
+```
+LIGO uses `uid=1010` so as not to collide with any of the other standard system users, and the `controls` group, but there is no requirement on those configurations.  (NOTE: For a site setup, where guardctrl will be accessed through a ~passwordless-SSH-interface, the guardian user should not have a password.  Otherwise the guardian user can have a password as usual.)
+
+*The rest of the non-root commands specified below are assumed to be run as the guardctrl user created above (in this case the `guardian` user).*
+
+Create the `~/.guardctrl-home` file in the guardctrl (guardian) user's home directory, to indicate that this is the user handling systemd supervision:
+```shell
+$ touch ~guardian/.guardctrl-home
+```
+
+Finally, enable the `guardian.target` unit for auto-starting nodes on startup:
+```shell
+$ systemctl --user enable guardian.target
+```
+
+### user systemd persistence
+
+Once the desired guardian user account is ready, we need to inform the system systemd instance that the user is "persistent".  This prevents systemd from shutting down the `systemd --user` process when the user is not logged in.  We do this with `loginctl enable-linger`, with the guardian user as argument:
+```shell
+$ sudo loginctl enable-linger guardian
+```
+You might also need to extend the startup timeout for this user, as starting all the guardian processes at boot can take awhile if there are a lot of processes.  10 minutes should be enough, but this can be adjusted.  We handle this with a system-level "drop-in" for the relevant user's user service (NOTE: the number after the escaped `\@` is the relevant user's uid):
+```
+# /etc/systemd/system/user\@1010.service.d/timeout.conf
+[Service]
+TimeoutStartSec=10min
+```
+
+### caRepeater service
+
+It's good to make sure that the EPICS caRepeater is running system-wide before starting any of the guardian nodes.  It's therefore good to declare a dependency of the guardian user on the caRepeater service.  This can also be done with a drop-in:
+```
+# /etc/systemd/system/user\@1010.service.d/ca.conf
+[Unit]
+Wants=caRepeater.service
+After=caRepeater.service
+```
+NOTE: the above assumes the existence of a `caRepeater.service`.  If you don't already have one, here's an example service description:
+```shell
+# /etc/systemd/system/caRepeater.service
+[Unit]
+Description=EPICS caRepeater
+Wants=network-online.service
+After=network-online.service
+
+[Service]
+ExecStart=/usr/bin/caRepeater
+User=nobody
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### configuring journald for persistent logs
+
+The LIGO setups store logs from all guardian processes in perpetuity.  To this end, the journald system logger is configured for "persistent" storage.  This is done by setting `Storage=persistent` in `/etc/systemd/journald.conf` (included below are some other variables for increasing the log rate limit, and for increasing the disk storage limits for the logs):
+```
+# /etc/systemd/journald.conf
+[Journal]
+Storage=persistent
+RateLimitBurst=100000
+SystemMaxUse=200G
+SystemMaxFiles=100000
+```
+Reload the journald config after these changes are made:
+```shell
+$ sudo systemctl force-reload systemd-journald
+```
+
+### specifying local environment
+
+The `guardian@.service` expects an `/etc/guardian/local-env` environment file to exist, for providing any needed environment variables to the supervised guardian processes.  Here's an example of the file for H1 at LHO:
+```shell
+# /etc/guardian/local-env
+IFO=H1
+SITE=LHO
+GUARD_CHANFILE=/opt/rtcds/userapps/release/cds/h1/daqfiles/ini/H1EDCU_GRD.ini
+GUARD_ARCHIVE_ROOT=/ligo/cds/lho/h1/guardian/archive
+NDSSERVER=h1nds0:8088,h1nds1:8088
+```
+Some notes:
+* The `IFO` and `SITE` variables should be set as expected.
+* The `GUARD_CHANFILE` variable points to the location where the guardian channel list ini file will be written, used by the CDS DAQ.  *This file location must be writable by the guardctrl user.* (For the LIGO case above we touch the file and make it writable by the `controls` group, which the `guardian` user is a member of). 
+* The `GUARD_ARCHIVE_ROOT` variable points to a [guardian code archive](guardian-code-archives).
+
+## passwordless SSH interface
+
+The best way to allow remote control of guardctrl is via ssh.  For a site install on a protected network, where you want to allow "remote" users (i.e. users on the same network but on hosts other than the guardctrl host) to be able to control the nodes without entering a password, you can setup a passwordless ssh "ForceCommand" for the guardctrl user.
+
+First, modify the system PAM stack to allow passwordless login via ssh.  Usually PAM is configured to not allow passwordless login on anything except for special TTYs.  To loosen that restriction, on Debian systems, we modify `/etc/pam.d/common-auth` to change the following line:
+```
+auth	[success=1 default=ignore]	pam_unix.so nullok_secure
+```
+to:
+```
+auth	[success=1 default=ignore]	pam_unix.so nullok
+```
+
+Then add to the `sshd_config` a special "Match" stanza for the guardctrl user which specifies that it may login without a password, but is forced to execute only a single command (`guardctrl`).  On most systems this would go in `/etc/ssh/sshd_config`:
+```/etc/ssh/sshd_config
+Match User guardian
+  PermitEmptyPasswords yes
+  PermitTTY yes
+  X11Forwarding no
+  AllowTcpForwarding no
+  ForceCommand /usr/bin/guardctrl
+```
+After adding the Match stanza, reload sshd:
+```shell
+$ sudo systemctl force-reload sshd
+```
+If for some reason you need to pass special environment variables to `guardctrl`, you can point the ForceCommand to something like `/etc/guardian/guardctrl-ssh-bridge` which can be a shell script that sets the needed environment and then execs `/usr/bin/guardctrl` (without arguments).  Make sure the wrapper script is executable.
+
+### local guardian user access
+
+Occasionally it might be necessary to access the guardian user directly, via e.g a terminal.  If passwordless SSH access has been enabled as described above, then it won't be possible to access a guardian user terminal via ssh directly, and you'll need to change user from root.  However, su and sudo do not provide access to the user dbus session needed to interact with `systemctl --user`.  The `systemd-container` package includes the `machinectl` interface whose `shell` command allows for a clean user environment with all dbus interfaces available:
+```shell
+root@h1guardian1:~# machinectl shell guardian@ /bin/bash
+guardian@h1guardian1:~$ systemctl --user status
+* h1guardian1
+    State: running
+     Jobs: 0 queued
+   Failed: 0 units
+    Since: Tue 2018-02-27 15:44:08 PST; 11s ago
+   CGroup: /user.slice/user-1010.slice/user@1010.service
+           `-init.scope
+             |-11818 /lib/systemd/systemd --user
+             `-11820 (sd-pam)
+guardian@h1guardian1:~$ exit
+logout
+Connection to the local host terminated.
+root@h1guardian1:~#
+```
+
+## debugging
+
+If you happen to be cursed with segfaulting processes, here are some things that might help.
+```shell
+$ sudo apt install python-dbg libepics3.15.3-dbg systemd-coredump
+```
+
+### systemd-coredump
+
+`systemd-coredump` is a particularly useful utility in this situation, since it captures and logs core dump files produces by crashing processes under systemd supervision.
+
+NOTE: the coredump files will expire and be cleaned out by 3 days by default.  To completely remove this expiration, creating the following file to override the defaults:
+```shell
+# /etc/tmpfiles.d/00_coredump.conf
+d /var/lib/systemd/coredump 0755 root root -
+```
\ No newline at end of file