Create a status webpage to monitor runs and display error
Currently our monitoring process involves a lot of manual checking and knowledge of what has gone wrong in the past / what is a sign of a working run. While we can't automate the entire process, we can at least automate some of the basics, and leave room for improvements.
Here's how I'd implement this:
-
Anywhere it's straightforward, have each job write a JSON file with a status report at each step of the process, in a location that is accessible from the summary page (so probably in the webdir). For instance, if a job is running ILE iteration
N
, with job IDX
, then at the very beginning of the job's lifetime, it can write a file called${WEB_DIR}/status/ILE-start-${N}-${X}.json
, containing any relevant information (e.g., the grid point's physical parameters). And when the job finishes, it writes a file called${WEB_DIR}/status/ILE-end-${N}-${X}.json
, containing the status (SUCCESS
,ERROR
, and possiblyWARN
), and additional metadata, either relating to the reason it failed, or information about how it succeeded (e.g., how many samples did it generate, what was the max lnL). Basically any piece of information that might be helpful in diagnosis can go here. -
In the summary page itself, we have some JavaScript that fetches all of the JSON files, and shows the current status in a user friendly way. For instance, a table like:
Iteration Started Running Succeeded Warned Errored 0 100 20 60 10 10 1 N/A N/A N/A N/A N/A and additionally a table showing a list of unique errors, if we're able to group them up nicely.
-
In addition, we can write some helpful tools that need to be run from the terminal on the machine in question. For instance, we've been checking manually for some signifiers of problems, but a utility that just
grep
s for them given the directory name would make that easier for a non-expert.