Scheduling: Hierarchies #2

anshumanmohan · 2024-04-07T16:42:03Z

anshumanmohan
Apr 7, 2024
Maintainer

UPDATE. We have mostly abandoned this line, because our setup (Core C can do the work of both A and B...) has begun to feel a little outlandish.

How would a PIFO tree ease scheduling on a smartNIC? In this discussion I suggest that we focus on hierarchies. At the moment I am primarily seeking feedback regarding whether my setup is realistic. First, I present this setup. Then, I explain how a PIFO tree with minor extensions can help in this setting. I sketch those extensions. Finally, I point out some known issues.

Setup

A smartNIC receives incoming traffic in the form of packets of work. Across a PCIe link from the smartNIC are many cores, which can do work after receiving packets. There are three categories of cores, and we name these categories A, B, and C.
- Cores belonging to categories A and B have distinct capabilities (neither can do the work that the other can). These kinds of cores are very performant, but we have few of them.
- Core C can do the work of both A and B. Cores of this kind are less performant, but we have more of them.
A packet of work arrives at the smartNIC and then goes across the PCIe link to a core that can handle it. Packets may come from a variety of different clients, and the clients do not have a way of knowing which category of core is best-suited to handle which packet.
We have two goals:
- Determine which core will handle which packet.
- Schedule packets, i.e., determine the order in which packets will traverse the PCIe link (in the figure, left-to-right) and receive service.

Our Scheduler

PIFO Tree

I propose that our scheduler live on the smartNIC and be in the shape of a binary-branching PIFO tree of height 2. As a reminder, a PIFO tree already has a scheduling transaction, which has two roles:

Determine which leaf a packet should go into.
Assign the packet a rank (achieved in reality using an insertion path) that eventually determines the order in which buffered packets are popped from the tree.

Our tree will partition the incoming traffic by core: packets that can be served by cores of category A (resp. B) will go into leaf a (resp. b). We will default to maintaining FIFO order within a leaf.

Popping the Tree

When a core is ready for a task, it pops the tree (via a signal sent right-to-left over the PCIe link). In particular:

A core of category C, which can handle either kind of task, just pops the tree as usual.
A core of category A or B executes a parameterized pop, which is a new but relatively straightforward operation: pop a subtree by providing the address of the subtree's root, and housekeep above the subtree to ensure well-formedness. Details about this operation are below.

Relaying Additional Information

A core can also send a right-to-left signal to the tree to communicate information that may only be known to a core, or only revealed once work on a packet has begun. We can put this information into the tree via a dummy push: the push affects a reordering of previously buffered packets and itself enqueues a dummy packet at a leaf with very low priority. Details about this operation are below.

Alternative Strategy

In the strategy above, cores "pull" packets over the PCIe link by (perhaps parametrically) popping the tree. Each core does this when it is ready for more work. The alternate would be for the tree to "push" packets across the link on its own. However, this is challenging:

Even if the tree has enough information to predetermine the order in which packets should depart, it will not have enough information to predetermine the rate of transmission for each class of packets. A steady rate of transmission may not even exist.
If the tree sends packets too slowly, the CPUs will be idle and we will violate work conservation.
If the tree sends packets across too eagerly, packets will have to wait at the cores in buffers that the cores will have to maintain. Further, "committing" a packet to a busy core too early can mean that, if another core becomes available later on, we cannot easily send the packet to that new core.

We reject this alternative strategy.

Extensions

Parameterized Pop

Consider the tree at the top of the following figure. If popped three times as usual, this tree will yield P1, then P2, then P3.

However, say we perform a new kind of pop, which takes the address of a subtree. Instead of $pop(t)$, we now do $pop(t, [L])$. That is, start at the root of $t$, follow the address given (here, $[L]$) to find a subtree, and pop that tree as usual. Doing this yields the packet P2. However, just doing this results in an ill-formed tree. We must also go housekeep "above" the subtree by removing an occurrence of $L$ in the room PIFO. In the figure below, I have marked the order of operations in blue.

Dummy Push

Consider the same initial tree as above. If popped three times as usual, this tree will yield P1, then P2, then P3. We can push a dummy value, marked with a hole in red, down some insertion path marked in red. The result is that the tree will now yield P3, then P1, then P2, and then the dummy value.

Note that we cannot always guarantee that the dummy value will come out last. Consider the example below. After the dummy value is pushed, the tree will yield P3, then P1, then P2, then the dummy value, and then P4. To get around this, the pop method will need to be altered a little: if a pop yields a dummy value, recurse and pop again. Popping an empty tree is undefined, so this will terminate.

Known Issues

Parameterized pops are inefficient! The popping of the subtree itself is efficient $(O(height(subtree))$, but housekeeping the tree above the subtree can be expensive because we are not just popping the heads of PIFOs.
Whenever cores associated with leaf nodes (in our example, cores of categories A and B) pull work over to themselves, we are basically not doing scheduling. It's actually worse than not doing scheduling, since we do do some scheduling when inserting the packet into the PIFO tree and must then undo that work via a parameterized pop. Anyway, if the hierarchical arrangement feels awkward and expensive, the natural next option is to just do away with the PIFO tree arrangement and let each category of cores maintain its own queue. These per-category queues may not need to be as sophisticated as PIFO trees.

anshumanmohan · 2024-06-04T12:55:11Z

anshumanmohan
Jun 4, 2024
Maintainer Author

Just updating this discussion to reflect decisions we've made synchronously: we have mostly abandoned this line, because our setup (Core C can do the work of both A and B...) has begun to feel a little outlandish. I'll also add a note to the top of the discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling: Hierarchies #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Scheduling: Hierarchies #2

anshumanmohan Apr 7, 2024 Maintainer

Setup

Our Scheduler

PIFO Tree

Popping the Tree

Relaying Additional Information

Alternative Strategy

Extensions

Parameterized Pop

Dummy Push

Known Issues

Replies: 1 comment

anshumanmohan Jun 4, 2024 Maintainer Author

anshumanmohan
Apr 7, 2024
Maintainer

anshumanmohan
Jun 4, 2024
Maintainer Author