Plan for archiving MDC data at CIT
Context:
Before the MDCs started, it was a policy on gracedb-playground
to remove events and the associated data after 21 days. After some pushback from the low-latency chairs, that operation was suspended, and the whole of the MDC remains to be archived on gracedb-playground
, in the cloud. The costs of storage in Amazon EFS notwithstanding, this has been a useful exercise from a GraceDB development and optimization standpoint: having multiple users and pipelines interact with a database that is stuffed with test events (there are approximately 3x more events and superevents in gracedb-playground
than in the production system) has been invaluable to identify and fix some fundamental low-level performance bottlenecks (see: #249, !95 (merged), !96 (merged)).
That being said, in the past two weeks, I have received three private communications over email and mattermost (@roberto.depietri, @shaon.ghosh, @geoffrey.mo, @gaurav.waratkar) regarding bulk-data transfers of MDC data from AWS to CIT. In debugging and optimizing low-latency operations over the past months, I have observed other periods of increased download and query activity as well, where users are moving large numbers of files (O(1,000)-O(10,000)) from AWS to various user accounts and headnodes at CIT. These periods of activity correlate with the beginning of new rounds of MDC, as I suspect users are analyzing data from the previous round.
There hasn't been a clear definition of what constitutes "fair use" of resources; GraceDB is sort-of just there for the collaboration to use so no individual user is at "fault" in this situation. That being said, these ad hoc data transfers do affect the performance of low-latency operations, and results in redundant storage and network traffic at CIT.
Action Required:
I am requesting that the low-latency chairs who initially requested that MDC data be retained (again, a worthwhile effort) coordinate with the admins at CIT for a permanent and organized transfer and archive of MDC data from AWS to CIT. This would involve (and I'm thinking off the top of my head):
- deciding on a namespace on where to store the data (other than random users' home directories)
- deciding on a system and folder hierarchy (GraceDB uses its own system which is obtuse to someone not using the database)
- communicating to users in the various working groups that the MDC data is locally-available on the LDG to use instead of making 10,000's of requests to the internet
When it comes time to do the actual transfer, I can coordinate with the CIT admins to open up a security group to directly mount the EFS partition at CIT for a bulk rsync, if need be. There might be a better idea, I dunno.
@roberto.depietri, @shaon.ghosh: as we move into O4 low-latency operations, please coordinate with @stuart.anderson and @philippe.grassia to get MDC data out of the cloud and onto an LDG resource. If anyone tagged on this ticket has other proposals, please chime in.