Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hot/popular file replication on the pools #7675

Open
DmitryLitvintsev opened this issue Oct 8, 2024 · 2 comments
Open

Implement hot/popular file replication on the pools #7675

DmitryLitvintsev opened this issue Oct 8, 2024 · 2 comments
Assignees
Labels
enhancement A request that enhances existing behaviour

Comments

@DmitryLitvintsev
Copy link
Member

Current hot pool replication is triggered by PoolManager if thresholds for hot pool replication are met and it is enabled. Any request to read a file whose replica is on that pools would trigger a p2p request. This solution does not work well due to certain "inertia" - PoolManager determines that pool is too hot when it is too late and because shuffling all files via p2p creates even more load on the hot pool.

Proposal is to have hot/popular file replica replication on the pool. And make it pro-active. I.e. when pool detects N transfers (active and queued) for a particular file it will issue migration copy -replicas=COPIES to spread that file across multiple pool before receiving new request to read this file.

If COPIES > number of available pools the migration job goes into sleeping mode once it has places replica to all pools. This might be OK, or behavior needs to be modified - complete with warning into log file.

So the pool config would get 3 parameters:

  1. one parameter enables the feature
  2. two parameters replication threshold N and COPIES to control the module

For possible future MAPE based implementation: we can envisage a policy driven "central" service that would just send necessary parameters to the pools to execute the copy (that is N and COPIES will be changed dynamically per message from such a central service). For now we want to put a foundation for this work on the pools.

@DmitryLitvintsev DmitryLitvintsev added the enhancement A request that enhances existing behaviour label Oct 8, 2024
@greenc-FNAL
Copy link
Contributor

@DmitryLitvintsev Please explain, "MAPE?"

@DmitryLitvintsev
Copy link
Member Author

greenc-FNAL added a commit that referenced this issue Dec 20, 2024
**Motivation** allow modified replication behavior to support hotspot
mitigation. Specifically we need replication jobs to complete without
error when the number of replicas is maximized with respect to the
available pools, even if more replicas were requested.

**Modification:**

* In `Job.java`: `Error` -> `Note`.

  The names of all related methods are changed accordingly
  (e.g. `addError()` -> `addNote()`).

* `MigrationModule` gets new option `-target-deficit-mode=(wait|limit)`
  (default `wait`) determining behavior when requested replicas exceeds
  target pool availability:

    * `wait`

        The job will wait for new pools to become available (e.g. added
        to specified pool group). This is the default.

    * `limit`

        If more replicas are requested than exist available pools, the
        job will complete successfully (with a note) when the number of
        replicas is maximized for the pools currently available.

* In interface `TaskCompletionHandler`: new method
  `taskCompletedWithNote(Task, String)` with default implementation
  forwarding to `taskCompleted(Task)`.

* New (manually-generated) files `Task_sm.dot`, and `Task_sm.dot.png`
  showing the state machine model encoded by `Task.sm`.

* New method in `Task.java`: `moreReplicasPossible()` for use in state
  transition guards.

* New finite state machine (FSM) action
  `notifyCompletedWithInsufficientReplicas()` in `Task.java`.

* In `Task.sm`, `Done` entry action `notifyCompleted()` removed in favor
  of per-transition actions `notifyCompleted()`, or
  `notifyCompletedWithInsufficientReplicas()` as appropriate. It was
  determined by analysis that moving the post-state-change entry action
  to a pre-state-change transition did not disrupt the intended
  semantics of the state machine.

* New guarded transitions to `Done` when more replicas are requested
  than are possible with the target pools available currently.

*  `Task_sm.dot`, and `Task_sm.dot.png` updated manually to reflect the
   changes to the state machine.

**Result:**

* Partially implements #7675

* Replication jobs can be issued with `-target-deficit-mode=limit` to
  terminate successfully when all pools available currently have been
  populated with replicas, but more replicas were requested.

**Target:** master
**Request:** -
**Patch:** https://rb.dcache.org/r/14362/
**Closes:**
**Requires-notes:** yes
**Requires-book:**
**Acked-by:** Tigran Mkrtchyan, Dmitry Litvintsev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A request that enhances existing behaviour
Projects
None yet
Development

No branches or pull requests

2 participants