Update 100323 authored by Shio Sakon's avatar Shio Sakon
......@@ -15,35 +15,125 @@ Shio
Char / Minute Taker / Focus Session [Rota](https://git.ligo.org/groups/gstlal/-/wikis/West-call/Rota-table)
## Action Items for next week
* [ ] Pratyusava, Cort, Divya: Discuss Condor upgrades at CIT
* [ ] Debnandini: communicate that we’re doing an IMBH search to the CBC chairs
(**Leave action items here for next week.**)
## Last week's call\
* [ ] Rachael: Discuss Condor upgrades at CIT
- Rachael: I think it was Becca/Shomik who agreed to implement the changes.
- Becca: I remember talking at the F2F about making changes to get rid of env=True. I have not worked on it and I don’t have time. I can talk to someone else.
- Divya: I can tell exactly what to do.
- Leo: There was this action item. Make patch for master to remove getenv=True from subs (Divya, Becca)
- Becca: as of now we’re not letting them update condor as a work around
- Cort: condor is updated but with a hack.
- Rachael: the hack is not preferred
- Prathamesh: keep it as an action item.
- Divya: can someone volunteer for this?
- Chad: can I nominate someone? This would be close to what Cort is doing? Amanda or Zach?
- Leo: And we need to specify all env variables in sub files instead?
- Pratyusava: I can work on it.
- Chad: this will be necessary for running offline inj runs
- Prathamesh: change the person in charge to Pratyusava (lead) and Cort, with Divya helping .
- Cort: I have a lot on my plate rn and won’t be able to do it urgently.
## Agenda / Minutes
* Announcements (5 minutes)
- Please check the rota for next week's call
- Confirmation of next week's focus session
- We will be doing review but please put things for focus sessions.
* Last week's East call
* Quick updates (45 minutes)
- Operations (5 minutes)
- LL CBC operations
- Shomik: zombi jobs on Charlie and Edward, notified admins. Edward: kill jobs and restart. Charlie: problematic node hasn’t produced data since Sep 27. It was running fine before taking it down for maintenance.
- Zach: something was going on with the nodes. Stuart just replied.
- Shomik: Stuart: - There were several condor_starter processes hung on node1679 trying to start jobs, which I believe was due to a hung NFS mount for /ldcg that I could not clear with “umount -f”, so I have rebooted that node and expect that it should be working now. However, please let us know if you still see any problems.
- Becca: last night there were issues. condor_q and dagman.out were not agreeing. There are issues with nodes but also our pipeline has error handling issues.
- Chad: zombi jobs have been issues. How are we not capturing these zombi jobs and monitoring.
- Becca: the ICINGA alerts caught them. But Charlie didn’t have all the ICINGA alerts set up yet.
- Prathamesh: rare, but some errors slip through ICINGA alerts.
- Rachael: @chad that got rectified btw. Ron and I set it up this morning.
- Divya: thanks Rachael and Ron
- Becca: there is a panel in the dashboard. It’s the max time since the last metric, which is aggregated.
- Ron: is this what you’re talking about Chad? Or are we talking about CIT? https://ldas-jobs.gwave.ics.psu.edu/grafana/d/MOIkdd67z/frame-lag?orgId=1
- Divya: jobs were running for Charlie
- Chad: is there sth in there in influx that could have caught this?
- Divya: on the dashboard if you query individual jobs you’d see that data wasn’t produced. But would haven’t been able to notice without ICINGA producing alerts.
- Chad: ICINGA should show red
- Rachael: Also if you look at the Charlie dashboard right now, that metric says like 5,000 seconds since last metric. But is that just because the detectors are down
- Rachael: influx alerts should spit out job ID
- Becca: someone should volunteer for that. Someone who is interested in monitoring side of things.
- Surabhi: I can voluntell Urja (she is not here today)
- Prathamesh: Becca will provide guidance.
- Rachael: added an issue here for this: https://git.ligo.org/gstlal/projects/-/issues/794
- Prathamesh: we should have a knowledge transfer discussion. Let’s talk about this in sprint.
- Prathamesh: have node issues (nodes not having frames) died out? are we still seeing that?
- Becca: No
- LL IDQ operations
- Rachael: iDQ is super stable other than the cluster catching fire last week
- O4 Dev (30 minutes)
- Low latency integrated testing and Monitoring
- Becca: I’m meeting with Stuart, Ron, Phillip etc. to get things moving on with burning v-stat on kubernitties
- Chad: get new generation people working on MDC for O4b code preparedness. That team will be in charge for O4b online production task. That team can also take on monitoring issue tasks. Put that on the action items for this call. There will be paper with this. Document on gwsci. Shomik, Zach, Shio, Amanda, Pratyusava, etc. Team should meet.
- Leo: I can at least lead the feature dev part of the team. + ops
- Prathamesh: is this team responsible for testing O4b wishlist?
- Chad: they should be the ones who do the integrated testing on MDC branch for an extended period of time. The container should be ready for March 2024, reviewed.
- Leo: (Am I too old to sign up…?)
- Becca: I *think* the point is to transfer knowledge to next generation of students so… yes?
- Prathamesh: issue is here: https://git.ligo.org/gstlal/projects/-/issues/795
- Template bank
- Debnandini: I cannot be on the gstlal dev call today, so here is my update: I am currently finishing a run using the latest build (including latest noise model changes), IMBH-specific dtdphi and more svd bins, to improve background from the previous run (based on the discussion at the F2F). I am also finishing some flopulator tests to implement the mu based sorting. I will post results and summary pages as these complete.
- Leo: I suggested to Debnandini to run floppulators for IMBH bank.
- Surabhi: mu1 mu2 might be more relevant for inspiral part of the waveform so it might be worth exploring m_total sorting for IMBH.
- Leo: Yeah I agree with Surabhi
- Chad: I agree with Surabhi. Computational cost isn’t a worry for this search so we should use what makes the search sensitive. We don’t want to spend too much time on tuning. Get IMBH inj from R&P, run those inj and zerolag for the proposed IMBH search, get VT, run the same thing with the all sky bank. This search will not have a dedicated paper, so it needs to come out at the same time as the catalog paper. We need to have started dogs. Latest by the end of the month.
- Surabhi: is there a dedicated call for the IMBH dev work?
- Chad: I don’t think so. Just this one? Shio and Debnandini have met a few times as once offs?
- Chad: manifold over covers the IMBH space. Prathamesh has interesting results from MASS in terms of glitchiness. Bandwidth and chi_eff? If someone wants to try grouping for glitchiness, they can look into this.
- Jolien: is bank chisq part of the IMBH search?
- Chad: I wish. It’s reviewed, right?
- Prathamesh: yes.
- Chad: it’ll be great if this is the first search to use this.
- Surabhi: wouldn’t bandwidth be related to total mass?
- Chad: not as much as you think?
- Divya: About IMBHs, are ppl particularly interested in the event from Oct 1 which was q>10? Any news on that event ?
- Prathamesh: I think it's in the IMBH list of candidates right? That the IMBH science case team is in charge of
- Likelihood ratio, background and foreground sampling
- Leo: I talked to Kipp about the background collection issue. https://git.ligo.org/lscsoft/gstlal/-/merge_requests/540 Pinged Ryan
- Surabhi: we’d want to start MDC with inj and we need to get it reviewed.
- Leo: this will be emergency patch for EW
- Chad: O4a might end in January. We would want to run two analyses (the currently running one and test), compare, and switch after confirming it’s working
- Surabhi: we can run two analyses in parallel, one for live data, the other for MDC.
- Becca: Leo’s code’s —min-instruments and —min-instruments-candidates can be confusing for future users.
- Leo: I could have added more description
- Prathamesh: does setting —min-insturuments = 1 and —min-instruments-candiates = 2 throw out single time candidates?
- Leo: yes. Patch needs to be tested.
- Injection file format
- Victoria: I’m meeting with Jolien and Kipp tmr at the east call. Chad, please join if able.
- Chad: I will be there
- offline DAG
- Prathamesh: Merging with online branch is done, there is only 1 offline branch now.
- Leo: Action items from F2F: - Having config validation in CI for offline repo, - Put some warning msg for total number of samples in calc_rank_pdf as part of validation (Leo). The first one was mentioned during the F2F.
- Rachael: someone keeping an eye on R&P inj?
- Divya: Cody pinged
- Prathamesh: Amanda, Pratyusava etc. are keeping an eye on that?
- Amanda: yes
- Divya: https://git.ligo.org/reed.essick/rpo4-injections/#offline-injections
- Leo: I will take a look at his patch
- DQ dev
- HM search
- Exploratory development (5 minutes)
- Misc projects
- Leo: we should form a small team to wrap this up: https://docs.google.com/document/d/19iCH4dMObYB9O60v33vLoCTo13OFaxu1PiKIlv1v_4A/edit#heading=h.xly7p14o7llv Let us know if you want to work on this.
- Prathamesh: update on SNR optimizer. Becca added Kafka stuff so that SNR optimizer can pick up Kafka queue. Started testing on Jacob MDC. First online test that we’re doing. Zach is working on making it more efficient. Jim will be working on review next Thursday.
- Paper Updates
- Prathamesh: count tracker paper accepted
* Focus session
- Review call
- continue with offline workflow. Recorded.
* AOB
## Chat log