Changes

Jonathan Hanks · c4b9fca0
--- a/Early-O4-NDS2-troubleshooting.md
+++ b/Early-O4-NDS2-troubleshooting.md
+# NDS2 system troubleshooting
+The purpose of this document is to outline how and what to check to solve issues with the nds2 system.
+## General Resources
+The first place to look is the [IGWN dashboard](https://dashboard.igwn.org).
+There are summary checks made on each nds cluster, these are really useful in making sure the right bits are running and all filesystems are mounted.
+ * [nodes on nds-cit](https://dashboard.igwn.org/icingaweb2/icingadb/service?name=nodes&host.name=nds-cit)
+ * [nodes on nds-lho](https://dashboard.igwn.org/icingaweb2/icingadb/service?name=nodes&host.name=nds-lho)
+ * [nodes on nds-llo](https://dashboard.igwn.org/icingaweb2/icingadb/service?name=nodes&host.name=nds-llo)
+### Sample output from the nodes check
+As an example the output from nds-llo looks like (note, the order within each server does not matter):
+<pre>
+OK
+nds-1
+OK daqd-stream-server is ok
+OK /archive/frames has files
+OK memcached is ok
+OK /var/lib/diskcache has files
+OK nds_chanlist_sync is ok
+OK /ceph/scratch has files
+OK /ceph/frames has files
+OK nds-proxy is ok
+OK nds-metadata-server is ok
+nds-2
+OK /ceph/frames has files
+OK /archive/frames has files
+OK memcached is ok
+OK nds2-io-node is ok
+OK daqd-stream-server is ok
+OK /ceph/scratch has files
+nds-3
+OK /archive/frames has files
+OK nds2-io-node is ok
+OK daqd-stream-server is ok
+OK memcached is ok
+OK /ceph/scratch has files
+OK /ceph/frames has files
+ndso1i
+OK /var/lib/diskcache has files
+OK /archive/frames has files
+OK /ceph/scratch has files
+OK nds_chanlist_sync is ok
+OK /ceph/frames has files
+</pre>
+If any of these are in error, that will tell you what service or disk system to check on.
+### Nodes job updates and period
+The nodes job is run by cron and its output is read periodically.  So just re-running the output plugin on dashboard will not update things.  You must wait for cron to run the check (and then to run a summary that combines the outputs of all the checks) and then dashboard can run and change to green.  
+## NDS2 clusters
+There are no standalone nds2 servers anymore, all locations LHO, LLO, CIT are clusters of 2 or more servers.
+### NDS2 at the observatories
+At LLO and LHO the nds2 system is made up of a cluster of 3 machines
+ * nds-1
+ * nds-2
+ * nds-3
+and the old nds system (still used for frame indexing)
+#### nds-1
+Nds-1 runs:
+ * nds-proxy
+ * daqd-stream-server - receives live IFO data from CDS at 1/16s + ingests the h(t) frames from the local DMT every 1s.
+   * on nds-1 this is just to get the list of channels
+ * memcached - part of the distributed memory cache
+ * nds-metadata-server - this holds the channel database
+ * nds_chanlist_sync - this syncs indexed data from the old nds system
+#### nds-2 and nds-3
+Nds-2 and NDS-3 run:
+* daqd-stream-server  - receives live IFO data from CDS at 1/16s + ingests the h(t) frames from the local DMT every 1s.
+* nds2-io-node - the nds2 server
+* memcache - the distributed memory cache
+##### nds.dca and ndso1i
+At the sites, the old nds server is still running (nds.dcs at LHO and ndso1i at LLO).  This is strictly to run the frame indexing.  An ad-hoc process runs to sync the channel lists to nds-1.
+In addition this must have all the frame directories and disk cache access as everything else.
+### NDS2 at CIT
+The NDS2 cluster at CIT is made up of two machines
+* nds
+* nds-alt
+#### nds.ligo
+This runs the:
+ * nds-proxy - proxy connections between nds and nds-alt
+ * nds2 - the monolithic nds2 server (cache, metadata, io in one process)
+ * nds_chanlist_sync_static - push channel lists to nds-alt to help keep them in sync
+#### nds-alt.ligo
+This runs:
+nds2 - the monolithic nds2 server
+### Data flow
+#### At the sites
+Requests generally follow this sequence:
+1. client talks to the proxy
+2. proxy talks to a nds2-io-node instance (on nds-2 or nds-3)
+3. nds2-io-node gets metadata from nds-metadata-server
+4. nds2-io-node does frame lookup from diskcache
+5. nds2-io-node reads the frames and returns data via the proxy
+For live data requests it is a different flow
+1. client talks to the proxy
+2. proxy talks to a nds2-io-node instance (on nds-2 or nds-3)
+3. nds2-io-node reads data from the daqd-stream-server
+4. nds2-io-node returns the data via the proxy
+#### At CIT
+1. client talks to the proxy
+2. proxy talks to a nds2 server (on nds or nds-alt)
+3. nds2 then reads the data from disk or shared memory and returns it via the proxy
+### Troubleshooting
+#### Check dashboard
+Check the [IGWN dashboard](https://dashboard.igwn.org) first, see if there is a systemic problem.
+If there is a service down, look at starting it.  If there is a problem accessing disk, consult with the local LDAS admins.
+#### Simple queries, is it a proxy issue or a node issue, or ...
+Run a simple query a few times:
+<pre>
+nds_query -n &lt;nds server here &gt; -k
+</pre>
+If it fails all the time, it may be a proxy issue, restart the nds-proxy service on nds-1 (or nds.ligo).
+If it fails part of the time it may be a node behind the proxy having issues, check the nds2-io-node processes on nds-2 &amp; nds-3 (or the nds2 process on nds.ligo and nds-alt.ligo).  Restart those if needed.
+#### Simple live data check
+You should be able to do a query such as this (change to L1 for LLO):
+<pre>
+nds_query -n &lt;nds server here &gt; -s 0 -e 3 -d 1 H1:GDS-CALIB_STRAIN
+</pre>
+This should read from online data.
+If it fails sporadically, check the proxy.
+Next at the sites, check that the kafka low latency h(t) transfer is stable.  There is a script (/root/watchdir.py) that will tell you if kafka is bunching up data.  It will run and tell you if it sees unexpected jumps in /dev/shm/....  If there are problems there then contact the low latency admins to check on kafka.
+If kafka is stable, maybe restart daqd-stream-server.  
+<pre>
+systemctl stop nds2-io-node; systemctl stop daqd-stream-server; sleep 1; systemctl start daqd-stream-server; systemctl start nds2-io-node
+</pre>
+#### Issues reading archived data
+Check for directory/file share issues on the nds nodes.  If /archive, /ceph (/hdfs at CIT), ... are not available NDS2 cannot read from them.  Contact the local LDAS admin.  If the disk systems are really slow, it will cause nds clients to hit a timeout as well.
+Make sure the diskcache service is running (contact the local LDAS admin).
+For LLO/LHO make sure nds-metadata-server is running on nds-1.
+Make sure the load isn't too high (ie no ram available, ...)