Backwards fault recovery #2

NikolayBlagoev · 2024-03-06T10:13:49Z

Great work with this paper and congraturlations!

I had a quick question how disconnects are handled during a backwards pass. From the paper it seems that timeouts are only triggered on a forward pass. But during a backwards pass you need to return the gradients of a node's input to a node which has had that batch pass through it. I couldn't find any explanation on how this is handled in the paper and from what I see in the code, it seems just a random new expert is chosen, which doesn't seem to be a sound solution.

I was wondering if I am missing something.

Also, to replicate the results, what commands do you use to run the setup.py? Both install and build seem to be insufficient

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backwards fault recovery #2

Backwards fault recovery #2

NikolayBlagoev commented Mar 6, 2024

Backwards fault recovery #2

Backwards fault recovery #2

Comments

NikolayBlagoev commented Mar 6, 2024