Changes

Tomek Baka · eb208d70
--- a/GWTC3-rerun-instructions.md
+++ b/GWTC3-rerun-instructions.md
+# Instructions for rerunning the events.
+
+## Setting up the environment.
+
+This analysis should be performed on CIT cluster. Exact node shouldn't matter, but the testing was done on `ldas-pcdev6` and it seems like a good choice. Just remember to run all the analysis on the same node to not confuse the asimov. You can log in with `ssh albert.einstein@ldas-pcdev6.ligo.caltech.edu` where you have to replace the name with your credentials.
+
+To grab reviewed and stable versions of common packages, clone igwn conda environment with:
+```
+conda create --name mdr --clone igwn-py310
+conda activate mdr
+```
+You can choose different name, but I will keep using mdr in these instructions.
+
+Next, we update important packages.
+
+`mamba update -c conda-forge bilby bilby_pipe`
+
+The 2 final packages it is important to get development versions of, so we will clone them with git. I recommend keeping them in the same folder for convenience. In your home directory (can move into it with `cd ~/`):
+```
+mkdir mdr_gwtc3
+cd mdr_gwtc3
+git clone git@git.ligo.org:lscsoft/bilby_tgr.git
+git clone git@git.ligo.org:asimov/asimov.git
+cd asimov
+pip install .
+cd ../bilby_tgr
+git checkout mdr_review
+pip install .
+cd ..
+```
+This will get you proper package versions.
+
+Now, due to cluster changes there have been problems with some packages. The fixes to them have not been forwarded to stable versions at the time of writing, so you may need to make 2 manual changes to the code:
+```
+vi ~/.conda/envs/mdr/lib/python3.10/site-packages/bilby_pipe/utils.py
+:919 [enter]
+```
+This should move you to the line:
+`run == "O3"` which you should change to `run = "O3"`. (If you cannot find it you can try searching by typing `/<search string> [enter]`)
+To edit with vim: `i` to enter interactive mode, make your changes, `[esc]` and save and exit with `:x` (you can use close without saving with `:q!`)
+
+And one final edit:
+`vi ~/.conda/envs/mdr/lib/python3.10/site-packages/asimov/configs/bayeswave.ini`
+After line `69` add a line `bayeswave_clean_frame-request-disk=100000`
+
+You are done with making your environment.
+
+## Setting up asimov project
+
+Ensure you are in your `mdr` environment (if not, `conda activate mdr`).
+
+In your `mdr_gwtc3` folder, make a folder for your project:
+```
+mkdir project
+cd project
+```
+And initialize asimov:
+`asimov init "MDR reanalysis"`
+
+Change project settings:
+`vi .asimov/asimov.conf`
+
+The important fields to change:
+`webroot = /home/albert.einstein/public_html/mdr-gwtc3` (substitute your username)
+`accounting = ligo.prod.o4.cbc.testgr.tiger`
+`user = albert.einstein`
+`environment = /home/albert.einstein/.conda/envs/mdr` (ensure it points to your mdr environment)
+
+You need to download 4 files below and put them into your project folder:
+`cp /home/tomasz.baka/public_html/files_for_mdr/* .`
+
+Apply default settings for the runs:
+`asimov apply -f analysis_defaults.yaml`
+
+And you are ready to submit the runs.
+
+## Test run
+Before running ~12 analyses that each of us will do, it would be good to verify that everything is running ok - it took me some time to get rid of all the bugs, so there is a possibility I forgot a step in the setup above.
+
+The `apply_events.py` can apply your set of events for asimov. You will use it with just one event in it for the test:
+
+Haris: `python apply_events.py --filename gwtc3_events.csv --index_min 13 --index_max 13`
+Balazs: `python apply_events.py --filename gwtc3_events.csv --index_min 25 --index_max 25`
+Johannes: `python apply_events.py --filename gwtc3_events.csv --index_min 48 --index_max 48`
+
+This should populate your project with the selected event. Now run
+`asimov manage build` to build config files for your analysis.
+
+**Important:** Before submitting asimov jobs, like below, you have to ensure you have right credentials. To get them, run:
+```
+kinit
+htgettoken -a vault.ligo.org -i igwn
+```
+and use your ligo password. They will expire when you log out of the cluster, so you will have to reapply them next time.
+
+Now, you can submit your job with:
+`asimov manage submit`
+
+This 1st job will be very quick, as it just computes psd (should be done in less than 15 min).
+
+You can check the status of the job with `condor_q`:
+1. If you see that the status of this job is idle or running, you have to wait some more.
+2. If the status of the job is `held`, it probably needs more resources. Run `condor_q -hold` to see the reason behind the problem. The probable cause is not enough disc or memory. For individual job run `condor_qedit jobID RequestMemory 8000` or some other number bigger than the one that caused the problem (`RequestDisk1 if the disk is the problem). If you have multiple problematic jobs, I suggest `condor_qedit -constraint 'JobStatus == 5' RequestMemory 8000` to modify all held jobs at once. You then have to release the jobs for them to start again with `condor_release -all`.
+3. If the job is not appearing in the que, then either it finished successfully or an error occured. Run `asimov monitor` and the asimov will check job completion status and tell you of any errors. It updates information at most every 15 min. If you want to force the update earlier, you have to delete cashe file: `.asimov/_cache_jobs.yaml`. If there is an error, you can find error logs in `working/<event name>/Prod0/logs/` with `.err` suffix. Let me know if it happens, as it shouldn't have.
+
+If the `asimov monitor` told you that the jobs finished ok, you can now submit analysis proper. Run (remember to have your credentials generated if you haven't done so this session):
+`asimov manage build`
+`asimov manage submit`
+It should inform you that it has submited 10 analysis proper that will take around day to finish. if after ~30 min you don't see active jobs with `condor_q`, it means some error occured and let me know (logs are in `working/<event name>/<production type>/log_data_generation/`)
+
+Next day check the progress with `condor_q` (to check for possible hold reasons) and with `asimov monitor` (the command only works inside your `project` directory). If the analysis are finished, the postprocessing will start (creation of the webpages). It should be done in few minutes and you will have to run `asimov monitor` again to catch it. You can check the webpages at https://ldas-jobs.ligo.caltech.edu/~<albert.einstein>/mdr-gwtc3/. You might need to run `asimov report html` for the updates to catch up.
+
+If the webpages for all analysis are complete, then everything works ok and you can run full analysis.
+
+## Running full analysis
+
+Essentially you follow the steps above, but now populate multiple events: 
+
+Haris: `python apply_events.py --filename gwtc3_events.csv --index_min 14 --index_max 24`
+Balazs: `python apply_events.py --filename gwtc3_events.csv --index_min 26 --index_max 36`
+Johannes: `python apply_events.py --filename gwtc3_events.csv --index_min 37 --index_max 47`
+
+Then again, ensure you have credentials, then run:
+`asimov manage build`
+`asimov manage submit`
+
+To get psd, then after some time:
+```
+asimov monitor
+asimov manage build
+asimov manage submit
+```
+To run analysis proper. You might need to run it multiple times if not all psd calculations have finished the 1st time.
+
+After all analyses are launched (you are not launching anything with command `asimov manage submit`) you can switch to just once in a while (a day or few):
+`condor_q -hold` to monitor if something is held
+`asimov monitor` - to check for completion
+`asimov report html` - to update webpages when you want
+
+Until everything finishes.
+
+If `asimov monitor` tells you there are any problems, let me know and we can try to check what is wrong.
+
+Technically everything after the `python apply_events.py` can be replaced with a single `asimov start`, to do the steps above every 15 min, but condor stops it after ~1 day, so at this point I think just running `asimov monitor` every once in a while is better. 
\ No newline at end of file