Modify DB backup on segments.ligo.org to run entirely on segments.ligo.org
Try this:
Modify the DB backup procedure on segments.ligo.org ('seg') to run entirely on seg up through tarballing the backup, then move that tarball to the /backup/
dir.
Background:
The DB backup on seg currently runs by having user ldbd run /usr1/ldbd/bin/run_backup_dqsegdb.sh
at 00:00 PT, which runs /usr1/ldbd/bin/backup_dqsegdb_mysqldatabase_newdir.sh
, which dumps the current DB to a temp dir (/backup/segdb/segments/tmp/mysql_dump_seg_$(date +%Y.%m.%d-%H.%M.%S)
), then tarballs that, with output going to the final dir for the backup (/backup/segdb/segments/primary/${date_string}.tar.gz
) (and deleting the dumped DB after that). I think this was done because seg didn't used to have enough disk space to hold the dumped DB, which currently (early Oct. 2023) runs ~31 GB, with the tarballed backup currently being ~3.1 GB (1/10 the untarballed size). It might have been done this way to improve performance, back when we had spinning platter disks, so reading from and writing to the same disk took much more time than it now does with solid-state disks and put considerably more stress on the disk, so writing to a different disk was advantageous. Either way, that system worked fine, but sometimes the backup takes longer than the planned time (originally 30 min, now 45 min), so the DB restore on segments-backup has to wait for the backup and tarballing to finish, at which point it runs by restoring the DB to a temp dir on /backup/
, then restoring from that dumped DB (then deleting the untarballed DB). If that takes too long, either because the start was delayed by too much or because the DB restore itself took too long (or both), then the regression tests (which start at 03:30 PT, regardless of the DB restore state) fail, because the DB changes during the tests' run. Lately, (starting in mid/late Sept. 2023), the DB backups have been taking noticeably longer than usual, which is visible in output in /backup/segdb/segments/monitor/backup_dqsegdb.txt
, where it can be seen that backups would usually take < 45 min (often taking < 40 min), but from 2023.09.22-10.06 (today), they have never been under 45 min, with 6 of the last 8 backups taking > 1 hour. DB restores (times tracked in /backup/segdb/segments/monitor/populate_from_backup.txt
) have gone from usually finishing before 03:30 (though often close to it) to much later, with no DB restore finishing before 03:30 since 2023.09.22, and the last 8 restores finishing after 04:00. Note that it is not just a matter of the extra time spent on the DB backup pushing the DB restore to finish that same amount of time later. Backups finishing before 00:45 would typically result in restores finishing the same number of minutes before 03:30, but recently the restores have been delayed by more than the time that the backup took over 00:45, e.g., backup finishing at 01:10 (25 min past 00:45) and the restore finishing at 04:13 (43 min past 03:30). I would ask admins to look into this performance issue, but this has been an ongoing intermittent issue for a long time, so if we can fix this on our end, that would be much preferable.
seg itself does not have enough space on the / partition (which includes the /tmp dir) to hold the dumped DB (25 GB free), but /dqxml/ does, as long as old DQXML dirs have been cleared out (57 GB free today, with ~1 GB of new data per calendar day, and additional 10s of GB could be cleared out with some effort). segments-backup has 46 GB free in / and 75 GB free in /dqxml/.
Plan:
- modify the backup script to save the start time, end time, and duration in the log script (
/backup/segdb/segments/monitor/backup_dqsegdb.txt
), for convenience (though the backup always starts at an expected time) - modify the restore script to save the start time, end time, and duration of the actual restore process (not any delay waiting for the DB restore to finish) in the log script (
/backup/segdb/segments/monitor/populate_from_backup.txt
), since the DB restore start time is sometimes delayed by waiting for the DB restore to finish - on seg, run a test DB dump to /dqxml/, to compare how long it takes to how long it took in a recent run dumping to the temp dir on /backup/
- on seg, run a test tarball of dumped DB to /dqxml/, to compare how long it takes to how long it took in a recent run tarballing the dumped DB on /backup/
- if the above tests indicate that the results are comparable or improved compared to using /backup/ (even just 'the same' is OK, if the times are consistent), modify the DB restore script to check the size of the tarballed DB to be restored; check that / has more than 15x that amount of space (1x for the tarball + 10x for the untarballed DB + a margin); if so, use / as the working dir; if not, do the same check for /dqxml/, and if it clears, use /dqxml/ as the working dir; if not, continue to use the same dir on /backup/ (no change to the run) [this will be its own ticket]
- on -backup, run a test that copies the most recent tarballed DB to /tmp, then untarball it, to compare how long it takes to how long it took in a recent run
- on -backup, run a temporary modified version of the DB restore script to restore from the untarballed DB in /tmp, to compare how long it takes to how long it took in a recent run
- if the above tests indicate that the results are comparable or improved compared to using /backup/ (even just 'the same' is OK, if the times are consistent), modify the DB backup script to check the size of the current tarballed DB (in
/backup/segdb/segments/primary/${date_string}.tar.gz
); check that / has more than 15x that amount of space (1x for the tarball + 10x for the untarballed DB + a margin); if so, use / as the working dir; if not, do the same check for /dqxml/, and if it clears, use /dqxml/ as the working dir; if not, continue to use the same dir on /backup/ (no change to the run) [this ticket] - on the replacement (virtualized) machines for seg, -backup, and -dev, have a separate partition created just for the DB backup and/or restore; probably needs to be big enough for at least 2, in case one gets stuck
- see if we can move these tasks from running under user ldbd, with dedicated accounts on the different machines, to running under shared account segdb, using /home/segdb/ for log files