-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifying how sharding works #6853
base: main
Are you sure you want to change the base?
Changes from all commits
17758ba
8f3636b
1e8ea76
e2c6974
9cb9abc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
The Gateway API lets apps open secure WebSocket connections with Discord to receive events about actions that take place in a server/guild, like when a channel is updated or a role is created. There are a few cases where apps will *also* use Gateway connections to update or request resources, like when updating voice state. | ||
|
||
> info | ||
> In *most* cases, performing REST operations on Discord resources can be done using the [HTTP API](#DOCS_REFERENCE/http-api) rather than the Gateway API. | ||
> In *most* cases, performing REST operations on Discord resources can be done using the [HTTP API](#DOCS_REFERENCE/http-api) rather than the Gateway API. | ||
|
||
The Gateway is Discord's form of real-time communication used by clients (including apps), so there are nuances and data passed that simply isn't relevant to apps. Interacting with the Gateway can be tricky, but there are [community-built libraries](#DOCS_TOPICS_COMMUNITY_RESOURCES/libraries) with built-in support that simplify the most complicated bits and pieces. If you're planning on writing a custom implementation, be sure to read the following documentation in its entirety so you understand the sacred secrets of the Gateway (or at least those that matter for apps). | ||
|
||
|
@@ -72,7 +72,7 @@ Gateway connections are persistent WebSockets which introduce more complexity th | |
At a high-level, Gateway connections consist of the following cycle: | ||
|
||
![Flowchart with an overview of Gateway connection lifecycle](gateway-lifecycle.svg) | ||
|
||
1. App establishes a connection with the Gateway after fetching and caching a WSS URL using the [Get Gateway](#DOCS_TOPICS_GATEWAY/get-gateway) or [Get Gateway Bot](#DOCS_TOPICS_GATEWAY/get-gateway-bot) endpoint. | ||
2. Discord sends the app a [Hello (opcode `10`)](#DOCS_TOPICS_GATEWAY/hello-event) event containing a heartbeat interval in milliseconds. **Read the section on [Connecting](#DOCS_TOPICS_GATEWAY/connecting)** | ||
3. Start the Heartbeat interval. App must send a [Heartbeat (opcode `1`)](#DOCS_TOPICS_GATEWAY_EVENTS/heartbeat) event, then continue to send them every heartbeat interval until the connection is closed. **Read the section on [Sending Heartbeats](#DOCS_TOPICS_GATEWAY/sending-heartbeats)** | ||
|
@@ -463,7 +463,7 @@ Apps **without** the intent will receive empty values in fields that contain use | |
- Content in messages that an app sends | ||
- Content in DMs with the app | ||
- Content in which the app is [mentioned](#DOCS_REFERENCE/message-formatting-formats) | ||
- Content of the message a [message context menu command](#DOCS_INTERACTIONS_APPLICATION_COMMANDS/message-commands) is used on | ||
- Content of the message a [message context menu command](#DOCS_INTERACTIONS_APPLICATION_COMMANDS/message-commands) is used on | ||
|
||
## Rate Limiting | ||
|
||
|
@@ -561,27 +561,33 @@ When connecting to the gateway as a bot user, guilds that the bot is a part of w | |
|
||
## Sharding | ||
|
||
As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway connections. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate connections to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds. | ||
As apps grow and are added to an increasing number of guilds, some developers may find it necessary to divide portions of their app's operations across multiple processes. As such, the Gateway implements a method of user-controlled guild sharding which allows apps to split events across a number of Gateway sessions. Guild sharding is entirely controlled by an app, and requires no state-sharing between separate sessions to operate. While all apps *can* enable sharding, it's not necessary for apps in a smaller number of guilds. | ||
|
||
> warn | ||
> Each shard can only support a maximum of 2500 guilds, and apps that are in 2500+ guilds *must* enable sharding. | ||
> Each shard can only support a maximum of 2500 guilds, and apps that are in 2500+ guilds *must* enable sharding. | ||
|
||
To enable sharding on a connection, the app should send the `shard` array in the [Identify](#DOCS_TOPICS_GATEWAY_EVENTS/identify) payload. The first item in this array should be the zero-based integer value of the current shard, while the second represents the total number of shards. DMs will only be sent to shard 0. | ||
Sessions that would like to only receive events from a subset of guilds should send the `shard` array in the [Identify](#DOCS_TOPICS_GATEWAY_EVENTS/identify) payload. The first item in this array is `shard_id`, the zero-based integer value of the current shard, while the second is `num_shards` and represents the total number of shards. | ||
|
||
> info | ||
> The [Get Gateway Bot](#DOCS_TOPICS_GATEWAY/get-gateway-bot) endpoint provides a recommended number of shards for your app in the `shards` field | ||
|
||
To calculate which events will be sent to which shard, the following formula can be used: | ||
A certain gateway session is only subscribed to events from guilds with a `guild_id` that satisfies the following formula, using the `shard_id` and `num_shards` that the session provided in the [Identify](#DOCS_TOPICS_GATEWAY_EVENTS/identify) event: | ||
|
||
###### Sharding Formula | ||
|
||
```python | ||
shard_id = (guild_id >> 22) % num_shards | ||
(guild_id >> 22) % num_shards == shard_id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely sure why this change is necessary - both say the same thing, and I think the prior version says it a bit more clearly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean whether In the first case, it is to less ambiguously show that it is, in fact, intended to be a comparison and not an assignment. The whole point of this PR is to rewrite the documentation in terms of which of the guilds that a bot is in to which a gateway "session" with a certain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this PR, the formula is written in the context of - when a shard is connecting to the gateway - iterating through all guilds that a bot is part of and deciding whether the shard should receive events from that guild based on the guilds There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is this not an assignment? The original wording is much clearer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although I appreciate the discussion, I would still prefer the original form to remain. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, we can keep it as an assignment. It still has to be in the context of a single shard though, as each session has its own (potentially different) value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that simply keeping it as is for the time being is fine. I feel the formula is clear enough and one can grok the relationship. I've always thought the larger meaning here is: "given a guild, and the number of shards I'm running, which shard does a given guild land on". This answers a practical question, for example if you want to locate the shard that is handling a given guild, so you can look up data from that shard's process. Internally, we do such lookups (albeit with hash rings, but they are logically the same.). This is how the original formula was written. The, "which guilds precisely will land on this shard" is a less useful question to answer, as it's not something the developer has control over, and is a fact that one can derive from the original formula as well. |
||
``` | ||
|
||
As an example, if you wanted to split the connection between three shards, you'd use the following values for `shard` for each connection: `[0, 3]`, `[1, 3]`, and `[2, 3]`. Note that only the first shard (`[0, 3]`) would receive DMs. | ||
Every session with `shard_id = 0` will be subscribed to DMs and other non-guild related events. | ||
|
||
As an example, if you wanted to split events equally between three shards, you'd use the following values for `shard` for each session: `[0, 3]`, `[1, 3]`, and `[2, 3]`. DMs would only be sent to the `[0, 3]` shard. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The word "evenly" is not necessarily true, since the volume of events dispatched to a given session is entirely dependent on the guilds on that shard. E.g. in a pathological case, if your bot is in 2 guilds, and you have one of the guilds has 1 member, and the other 500,000 members, you definitely won't see an "even" distribution of events. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. "As an example, if you wanted to split guilds equally between three shards" would probably be better wording, along with a remark that this of course won't necessarily split the events evenly if some guilds produce more events than others. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no "split equally" either - for example, if you are running 3 shards, and you leave all guilds on shard 0, then your guilds are not and will no longer be split equally, and re-connecting would of course not repair this bias. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The entire basis of this system is that it takes a probabilistic approach to distributing guilds between shards, based on the millisecond that the guild was created in our system. With enough guilds and shards, it should balance out, but there definitely is no guarantee of an even or equal split one way or another. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I understand that very well. There is an implicit assumption in that sentence that the bot is part of many guilds (as it would most likely be when sharding becomes necessary). I could definitely make that assumption explicit. |
||
|
||
Note that `num_shards` does not relate to (or limit) the total number of potential sessions, and can be different between multiple sessions existing at the same time. It is only used to decide whether an event will be sent to the associated session using the [Sharding Formula](#DOCS_TOPICS_GATEWAY/sharding-sharding-formula) above. In the simple case like the example above, where every session has the same `num_shards` and the sessions respective `shard_id`'s cover every value from `0` to `num_shards - 1`, the events will be split evenly between the sessions. This is probably how most bots will operate. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would remove the word "evenly" here as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above, I agree. |
||
|
||
On the other hand, sessions do not have to be identified in an evenly-distributed manner when sharding. You can establish multiple sessions with the same `[shard_id, num_shards]`, or sessions with different `num_shards` values, in which case events may be sent to multiple sessions. For example, two sessions with the respective `shard` arrays `[2, 3]` and `[4, 5]` will both receive events from the guild with id `613425648685547541` (you can open up Python and check for yourself that `2 == (613425648685547541 >> 22) % 3` and `4 == (613425648685547541 >> 22) % 5`). | ||
|
||
Note that `num_shards` does not relate to (or limit) the total number of potential sessions. It is only used for *routing* traffic. As such, sessions do not have to be identified in an evenly-distributed manner when sharding. You can establish multiple sessions with the same `[shard_id, num_shards]`, or sessions with different `num_shards` values. This allows you to create sessions that will handle more or less traffic for more fine-tuned load balancing, or to orchestrate "zero-downtime" scaling/updating by handing off traffic to a new deployment of sessions with a higher or lower `num_shards` count that are prepared in parallel. | ||
This allows you to create sessions that will handle more or less traffic for more fine-tuned load balancing, or to orchestrate "zero-downtime" scaling/updating by handing off traffic to a new deployment of sessions with a higher or lower `num_shards` count that are prepared in parallel. | ||
|
||
###### Max Concurrency | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really have something called a "gateway session" but I understand the intention here.
To elaborate on the terminology, a gateway connection refers to a connection to our websocket gateway at gateway.discord.gg, and a gateway connection then spawns a session, or re-establishes a connection to an existing session. The session outlives the gateway connection, since you can re-connect to the gateway when you're disconnected, and RESUME to re-establish the gateway socket's connection to a given session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it should just be "session" instead of "Gateway session"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is session more accurate than connection here? IMO connections are more intuitive than sessions, so rewriting this section in terms of sessions makes it harder to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that session is more accurate yes.
As Jake wrote, a session can be RESUMED in a new connection to the gateway. And, as per the documentation, a connection will be sent all missed events from a session once it resumes it.
Thus, events being sent to a session which "forwards" them over the active connection or stores them if a connection is not currently active is a more accurate way of thinking of it (unless I completely misunderstand how it works).