Implement hot/popular file replication on the pools #7675

DmitryLitvintsev · 2024-10-08T17:25:01Z

Current hot pool replication is triggered by PoolManager if thresholds for hot pool replication are met and it is enabled. Any request to read a file whose replica is on that pools would trigger a p2p request. This solution does not work well due to certain "inertia" - PoolManager determines that pool is too hot when it is too late and because shuffling all files via p2p creates even more load on the hot pool.

Proposal is to have hot/popular file replica replication on the pool. And make it pro-active. I.e. when pool detects N transfers (active and queued) for a particular file it will issue migration copy -replicas=COPIES to spread that file across multiple pool before receiving new request to read this file.

If COPIES > number of available pools the migration job goes into sleeping mode once it has places replica to all pools. This might be OK, or behavior needs to be modified - complete with warning into log file.

So the pool config would get 3 parameters:

one parameter enables the feature
two parameters replication threshold N and COPIES to control the module

For possible future MAPE based implementation: we can envisage a policy driven "central" service that would just send necessary parameters to the pools to execute the copy (that is N and COPIES will be changed dynamically per message from such a central service). For now we want to put a foundation for this work on the pools.

The text was updated successfully, but these errors were encountered:

greenc-FNAL · 2024-10-15T12:52:31Z

@DmitryLitvintsev Please explain, "MAPE?"

DmitryLitvintsev · 2024-10-15T13:03:27Z

https://en.wikipedia.org/wiki/Autonomic_computing

**Motivation** allow modified replication behavior to support hotspot mitigation. Specifically we need replication jobs to complete without error when the number of replicas is maximized with respect to the available pools, even if more replicas were requested. **Modification:** * In `Job.java`: `Error` -> `Note`. The names of all related methods are changed accordingly (e.g. `addError()` -> `addNote()`). * `MigrationModule` gets new option `-target-deficit-mode=(wait|limit)` (default `wait`) determining behavior when requested replicas exceeds target pool availability: * `wait` The job will wait for new pools to become available (e.g. added to specified pool group). This is the default. * `limit` If more replicas are requested than exist available pools, the job will complete successfully (with a note) when the number of replicas is maximized for the pools currently available. * In interface `TaskCompletionHandler`: new method `taskCompletedWithNote(Task, String)` with default implementation forwarding to `taskCompleted(Task)`. * New (manually-generated) files `Task_sm.dot`, and `Task_sm.dot.png` showing the state machine model encoded by `Task.sm`. * New method in `Task.java`: `moreReplicasPossible()` for use in state transition guards. * New finite state machine (FSM) action `notifyCompletedWithInsufficientReplicas()` in `Task.java`. * In `Task.sm`, `Done` entry action `notifyCompleted()` removed in favor of per-transition actions `notifyCompleted()`, or `notifyCompletedWithInsufficientReplicas()` as appropriate. It was determined by analysis that moving the post-state-change entry action to a pre-state-change transition did not disrupt the intended semantics of the state machine. * New guarded transitions to `Done` when more replicas are requested than are possible with the target pools available currently. * `Task_sm.dot`, and `Task_sm.dot.png` updated manually to reflect the changes to the state machine. **Result:** * Partially implements #7675 * Replication jobs can be issued with `-target-deficit-mode=limit` to terminate successfully when all pools available currently have been populated with replicas, but more replicas were requested. **Target:** master **Request:** - **Patch:** https://rb.dcache.org/r/14362/ **Closes:** **Requires-notes:** yes **Requires-book:** **Acked-by:** Tigran Mkrtchyan, Dmitry Litvintsev

DmitryLitvintsev added the enhancement A request that enhances existing behaviour label Oct 8, 2024

DmitryLitvintsev assigned greenc-FNAL Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hot/popular file replication on the pools #7675

Implement hot/popular file replication on the pools #7675

DmitryLitvintsev commented Oct 8, 2024

greenc-FNAL commented Oct 15, 2024

DmitryLitvintsev commented Oct 15, 2024

Implement hot/popular file replication on the pools #7675

Implement hot/popular file replication on the pools #7675

Comments

DmitryLitvintsev commented Oct 8, 2024

greenc-FNAL commented Oct 15, 2024

DmitryLitvintsev commented Oct 15, 2024