-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifying how sharding works #6853
base: main
Are you sure you want to change the base?
Conversation
Sorry :)
Thank you for looking it through and fixing my mistakes :) Co-authored-by: Zoddo <[email protected]>
Word choice and formula disambiguation Co-authored-by: Oliver Wilkes <[email protected]> Co-authored-by: Zoddo <[email protected]>
|
||
###### Sharding Formula | ||
|
||
```python | ||
shard_id = (guild_id >> 22) % num_shards | ||
(guild_id >> 22) % num_shards == shard_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure why this change is necessary - both say the same thing, and I think the prior version says it a bit more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean whether shard_id
is first or last, or whether it's an assignment =
or comparison ==
?
In the first case, it is to less ambiguously show that it is, in fact, intended to be a comparison and not an assignment.
The whole point of this PR is to rewrite the documentation in terms of which of the guilds that a bot is in to which a gateway "session" with a certain shard_array
will be subscribed to events, as opposed to deciding which "shard_id
a certain event will be sent to" as this is very confusing in the context of multiple "sessions" with the same shard_id
and different num_shards
. See the original discussion that led to this PR and Zoddos reply in Lulalaby's review above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a = b + c
is much more clear than b + c == a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, the formula is written in the context of - when a shard is connecting to the gateway - iterating through all guilds that a bot is part of and deciding whether the shard should receive events from that guild based on the guilds guild_id
and the shard
array with shard_id
and num_shards
that the shard provided.
Sure, if you see it as a mathematical equation, a single =
would suffice, but this is documentation that programmers will be reading to understand how sharding behaves, and when programmers see an expression like that with things that are very clearly variable names they would assume that =
is an assignment, which this formula is not.
I agree that writing the single variable on the left-hand side looks cleaner, but as discussed in the thread I mentioned, that would create some ambiguity of whether it was actually meant as a comparison or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this not an assignment? The original wording is much clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I appreciate the discussion, I would still prefer the original form to remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can keep it as an assignment. It still has to be in the context of a single shard though, as each session has its own (potentially different) value of num_shards
. Then we can just write with words afterwards that "The session will receive events from every guild that, using the formula above, evaluates to the sessions shard_id
using the corresponding num_shards
".
I don't see how adding this level of indirection makes it any clearer though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that simply keeping it as is for the time being is fine. I feel the formula is clear enough and one can grok the relationship.
I've always thought the larger meaning here is: "given a guild, and the number of shards I'm running, which shard does a given guild land on". This answers a practical question, for example if you want to locate the shard that is handling a given guild, so you can look up data from that shard's process. Internally, we do such lookups (albeit with hash rings, but they are logically the same.). This is how the original formula was written.
The, "which guilds precisely will land on this shard" is a less useful question to answer, as it's not something the developer has control over, and is a fact that one can derive from the original formula as well.
@@ -561,27 +561,33 @@ When connecting to the gateway as a bot user, guilds that the bot is a part of w | |||
|
|||
## Sharding | |||
|
|||
As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway connections. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate connections to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds. | |||
As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway sessions. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate sessions to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really have something called a "gateway session" but I understand the intention here.
To elaborate on the terminology, a gateway connection refers to a connection to our websocket gateway at gateway.discord.gg, and a gateway connection then spawns a session, or re-establishes a connection to an existing session. The session outlives the gateway connection, since you can re-connect to the gateway when you're disconnected, and RESUME to re-establish the gateway socket's connection to a given session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it should just be "session" instead of "Gateway session"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is session more accurate than connection here? IMO connections are more intuitive than sessions, so rewriting this section in terms of sessions makes it harder to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that session is more accurate yes.
As Jake wrote, a session can be RESUMED in a new connection to the gateway. And, as per the documentation, a connection will be sent all missed events from a session once it resumes it.
Thus, events being sent to a session which "forwards" them over the active connection or stores them if a connection is not currently active is a more accurate way of thinking of it (unless I completely misunderstand how it works).
As an example, if you wanted to split the connection between three shards, you'd use the following values for `shard` for each connection: `[0, 3]`, `[1, 3]`, and `[2, 3]`. Note that only the first shard (`[0, 3]`) would receive DMs. | ||
Every session with `shard_id = 0` will be subscribed to DMs and other non-guild related events. | ||
|
||
As an example, if you wanted to split events equally between three shards, you'd use the following values for `shard` for each session: `[0, 3]`, `[1, 3]`, and `[2, 3]`. DMs would only be sent to the `[0, 3]` shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "evenly" is not necessarily true, since the volume of events dispatched to a given session is entirely dependent on the guilds on that shard.
E.g. in a pathological case, if your bot is in 2 guilds, and you have one of the guilds has 1 member, and the other 500,000 members, you definitely won't see an "even" distribution of events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. "As an example, if you wanted to split guilds equally between three shards" would probably be better wording, along with a remark that this of course won't necessarily split the events evenly if some guilds produce more events than others.
I'll include this in a batch once all your comments have been resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no "split equally" either - for example, if you are running 3 shards, and you leave all guilds on shard 0, then your guilds are not and will no longer be split equally, and re-connecting would of course not repair this bias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire basis of this system is that it takes a probabilistic approach to distributing guilds between shards, based on the millisecond that the guild was created in our system. With enough guilds and shards, it should balance out, but there definitely is no guarantee of an even or equal split one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that very well. There is an implicit assumption in that sentence that the bot is part of many guilds (as it would most likely be when sharding becomes necessary). I could definitely make that assumption explicit.
|
||
As an example, if you wanted to split events equally between three shards, you'd use the following values for `shard` for each session: `[0, 3]`, `[1, 3]`, and `[2, 3]`. DMs would only be sent to the `[0, 3]` shard. | ||
|
||
Note that `num_shards` does not relate to (or limit) the total number of potential sessions, and can be different between multiple sessions existing at the same time. It is only used to decide whether an event will be sent to the associated session using the [Sharding Formula](#DOCS_TOPICS_GATEWAY/sharding-sharding-formula) above. In the simple case like the example above, where every session has the same `num_shards` and the sessions respective `shard_id`'s cover every value from `0` to `num_shards - 1`, the events will be split evenly between the sessions. This is probably how most bots will operate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the word "evenly" here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I agree.
I and at least one other user in this Forum Channel thread on the Discord Developers server were confused about how sharding worked with different
num_shards
for different sessions. I have rewritten some of the sharding explanation to hopefully be a bit more clear about how it actually works, based on discussion with @Zoddo in the aforementioned thread. Please correct anything I may have gotten wrong.Here's a screenshot of the original discussion with a transcript: