-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hot/popular file replication on the pools #7675
Labels
enhancement
A request that enhances existing behaviour
Comments
@DmitryLitvintsev Please explain, "MAPE?" |
greenc-FNAL
added a commit
that referenced
this issue
Dec 20, 2024
**Motivation** allow modified replication behavior to support hotspot mitigation. Specifically we need replication jobs to complete without error when the number of replicas is maximized with respect to the available pools, even if more replicas were requested. **Modification:** * In `Job.java`: `Error` -> `Note`. The names of all related methods are changed accordingly (e.g. `addError()` -> `addNote()`). * `MigrationModule` gets new option `-target-deficit-mode=(wait|limit)` (default `wait`) determining behavior when requested replicas exceeds target pool availability: * `wait` The job will wait for new pools to become available (e.g. added to specified pool group). This is the default. * `limit` If more replicas are requested than exist available pools, the job will complete successfully (with a note) when the number of replicas is maximized for the pools currently available. * In interface `TaskCompletionHandler`: new method `taskCompletedWithNote(Task, String)` with default implementation forwarding to `taskCompleted(Task)`. * New (manually-generated) files `Task_sm.dot`, and `Task_sm.dot.png` showing the state machine model encoded by `Task.sm`. * New method in `Task.java`: `moreReplicasPossible()` for use in state transition guards. * New finite state machine (FSM) action `notifyCompletedWithInsufficientReplicas()` in `Task.java`. * In `Task.sm`, `Done` entry action `notifyCompleted()` removed in favor of per-transition actions `notifyCompleted()`, or `notifyCompletedWithInsufficientReplicas()` as appropriate. It was determined by analysis that moving the post-state-change entry action to a pre-state-change transition did not disrupt the intended semantics of the state machine. * New guarded transitions to `Done` when more replicas are requested than are possible with the target pools available currently. * `Task_sm.dot`, and `Task_sm.dot.png` updated manually to reflect the changes to the state machine. **Result:** * Partially implements #7675 * Replication jobs can be issued with `-target-deficit-mode=limit` to terminate successfully when all pools available currently have been populated with replicas, but more replicas were requested. **Target:** master **Request:** - **Patch:** https://rb.dcache.org/r/14362/ **Closes:** **Requires-notes:** yes **Requires-book:** **Acked-by:** Tigran Mkrtchyan, Dmitry Litvintsev
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Current hot pool replication is triggered by PoolManager if thresholds for hot pool replication are met and it is enabled. Any request to read a file whose replica is on that pools would trigger a p2p request. This solution does not work well due to certain "inertia" - PoolManager determines that pool is too hot when it is too late and because shuffling all files via p2p creates even more load on the hot pool.
Proposal is to have hot/popular file replica replication on the pool. And make it pro-active. I.e. when pool detects N transfers (active and queued) for a particular file it will issue migration copy
-replicas=COPIES
to spread that file across multiple pool before receiving new request to read this file.If COPIES > number of available pools the migration job goes into sleeping mode once it has places replica to all pools. This might be OK, or behavior needs to be modified - complete with warning into log file.
So the pool config would get 3 parameters:
For possible future MAPE based implementation: we can envisage a policy driven "central" service that would just send necessary parameters to the pools to execute the copy (that is N and COPIES will be changed dynamically per message from such a central service). For now we want to put a foundation for this work on the pools.
The text was updated successfully, but these errors were encountered: