-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DiscordGatewayClient
allocation and method improvements
#254
Comments
A good writeup. For the |
Thanks! I was under the impression that Remora would explicitly drop netstandard support in favor of .NET 6/7, since .NET Core 3.1 becomes EOL this year. If not, this would definitely need to be taken into consideration. |
Nice writeup!
Yep, I added a 'first priority' send queue, which internal connection management packets were placed on. Anything on this queue was dispatched before any other enqueued packets and if I recall correctly, I ensured all priority packets were dispatched before allowing the send loop to check if it should sleep. I was also toying around with a few allocation improvements in the transport service. I'd managed to cut back on send allocations significantly, not that those are particularly troublesome, and using the |
NetStandard does not EOL with Core 3.1 cause it is still a thing with NET 7.0: But in the end it still depends if Remora wants to support all |
|
I found it. initial question and his initial answer, though everything's kind of up in the air since there hasn't been much further discussion since. |
I have no intention of dropping support for .NET Standard - it will remain a runtime-agnostic library target. |
|
Description
As it stands right now, the gateway client is not only the single biggest, single-purpose class* in the library, but it is also the top allocator as shown by a profiler. It's unlikely that these allocations pose a memory leak, but it is partially unnecessary when it's possible to reduce these allocations.
Allocations were initially my main goal when it came to sprucing up the client, as described below, but as I continued to look for room for improvement I realized just how much of a mess the client really is (and, I don't fault Jax here, as the client is somewhat non-trivial in its operation).
However, there is a lot of code that could be simplified by being broken out, as I'll describe later in this issue.
* This is excluding
____API
classes, which I would personally consider "multi-purpose".Why This is Needed
As described above, the gateway client is responsible for the top five allocations across the entire span of the application lifetime, even in comparison to EF Core and the available Redis caching, which marshals objects when deserializing.
The screenshot below is the measurement over a six-hour period, and while the overall allocations are rather trivial for most, and for the most part will be GC'd, these can still be potentially reduced by trivial means.
Alternatives Considered
Short of rewriting the entire client (e.g. #176), there aren't many alternatives to be had here.
Additional Details
Here is the meat of the document, and I'll point out everything that I think can be reasonably fixed.
Passing around
heartbeat_interval
Call this a nitpick, but I feel like this issue is just large enough to be somewhat of an issue; not necessarily for maintainability or functionality, but for ease of writing code. There's also potential defensive copying at play here, though that's a bit of a micro-optimization.
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 585 in 7935da1
GatewaySenderAsync
issuesThis is a bundle of issues, but I'll put it under one main section
Heartbeating should be its own loop
As shown below, the gateway prioritizes heartbeats over user-submitted payloads which is fine, but I think this can be slightly improved by adjusting how payload rate-limiting is accounted for.
Instead of having heartbeating and user payloads both be rate-limiting, we could instead reserve a single slot (heartbeating is ~45s, and it's 120 events/min, so there's only ever 1 heartbeat per ratelimit window).
That single slot ensures that heartbeats are always sent (and even if they aren't, this is technically OK because Discord doesn't immediately disconnect you from the socket). The only issue would be that the client automatically disconnects if the expected OP11 (
HEARTBEAT_ACK
) isn't received in time, but otherwise, this is a fine solution.As for prioritization, a
heartbeatLock
field could be introduced; it'd be something along the lines of aSemaphoreSlim
or even aReadWriteLockSlim
, and sending payloads waits on hearbeating to complete for consistency.I'm pretty sure @carlst99 (Falcon) addressed this issue in the common gateway PR, but I'm not entirely certain.
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 924 to 963 in 7935da1
Sending payloads shouldn't rely on sleep time
In addendum to the above, this code only exists because of the above and can be safely fixed with a concrete delay. For reasons I'll get into below, using
Task.Delay
isn't the best choice for this either.Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 967 to 972 in 7935da1
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 990 to 1001 in 7935da1
Edit: November 29th, 2022:
Another solution would be omitting the delay entirely, and looking into a different solution such as a
Channel<T>
for receiving payloads. One of Falcon's improvements upon the gateway client in his aforementioned PR was using a channel for this purpose, however.https://github.com/carlst99/Remora.Discord/blob/798fa73a5c663c747b2fcaa30471b166f8a4757e/Backend/Remora.Discord.Gateway/BaseGatewayClient.cs#L535-L541
If heartbeating was seperated entirely, then instead of using the synchronous, and quick-returning
TryRead
, we could useWaitToReadAsync
.From there, we'd make a for loop that looks something along the lines of:
Stop using
Task.Delay
(Please)This isn't a blanket statement; it's a suggestion to use the alternative, given it's appropriate for the context.
With Remora targeting .NET 6/7 now, we're given access to the new APIs introduced in those versions, which allow for simplifying code and reducing allocations with just a few lines of code being changed.
PeriodicTimer
is a new timer class introduced in .NET 6, and is an alternative to previous timer APIs which were unideal for one reason or another.PeriodicTimer
'sWaitForNextTickAsync
method returns aValueTask<bool>
, which is already rather ideal.Its return is indicative of whether it will continue to tick, so much like a
ValueTask<Result>
, gateway termination can be gracefully handled. UnlikeTask.Delay
, cancelling the token passed toWaitForNextTickAsync
simply returnsValueTask<bool>(false)
while the former throws an exception which must be caught (and is typically measurably slower, though async tends to smooth these nuances).This was what I initially set out to do, because one of, if not the top allocation throughout the lifetime of the gateway is
DelayPromise
andTimerQueueTimer
(Task.Delay
allocates every time you call it!). For this reason, it's very unideal to useTask.Delay
in a tight/semi-tight loop, especially when its delay is static, as it'll cause lots of avoidable heap allocations.There's two notable spots where this is an issue.
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 970 in 7935da1
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 632 in 7935da1
Inconsistencies on which token is passed
This edges the line between nitpick and major issue, but it is potential for some absolute headaches for anyone who has to refactor/modify the gateway.
As it stands right now, two tokens are held.
stopRequested
and_disconnectRequestedSource
; the latter of which is actually aCancellationTokenSource
, but that's somewhat moot.The issue is that these two are seemingly arbitrarily passed to methods, the latter of which is also cancelled for various reasons, however it's strewn throughout the entire class. The latter of the two is held as a field, and seems to be intended to be passed to external classes (e.g. the payload transport), but then sometimes it's not, and is for some reason passed as a parameter to methods within the gateway, when those methods can access the same field.
For example:
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 587 in a974038
And then just three lines down:
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 590 in a974038
This also technically has the effect of potentially holding the callee captive; while a lot of methods throw when their cancellation token is cancelled, which will bubble up to the callee (
RunAsync
), if they don't, the method doesn't return until all the work has been completed.I suggest using a linked cancellation token and consolidating on the usage of the field vs the parameter.
Nitpickings
recommanded_shards
is not just a recommendationMy issue with this section, the client treats
recommended_shards > set_shards
as a warning when the proper log level for this situation isLogLevel.Error
(and potentially even returning a preemptive gateway error). The reason being is that if the bot is in >2500 * shards
guilds, connecting to the gateway will ALWAYS return a 4011 (SHARDING_REQUIRED
) error. See discord's docsRemora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 462 to 473 in 7935da1
Re-calculating heartbeat windows
This is a small issue; very small even, but it's something I picked up on regardless. Despite this only being done once per (re-)connection, I find it strange that this is recalculated. Then again shifting this up to the constructor could be considered bad practice, even if it's technically static-ish data.
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 894 to 898 in 7935da1
new
ing timeout policies all over the placeI'm not entirely familiar with Polly, so I may be wrong here, but I find it somewhat daft that there are three instances of a timeout being
new
'd up, all with the same arbitrary(?) time limit, only to be used once. Are Polly timeout policies not thread-safe? Or I suppose not intended to be reused, otherwise, these should be merged into a field, IMO.Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 545 in 7935da1
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 718 in 7935da1
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Line 545 in 7935da1
There should be a common
DisconnectAsync
methodAs it stands right now, there's no semblance of a disconnection method that handles the termination of the gateway (graceful or otherwise). Instead, inconsistent logic is sprinkled across call sites that varies based on context and can cause bugs.
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 235 to 250 in 7935da1
Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs
Lines 637 to 664 in 7935da1
The text was updated successfully, but these errors were encountered: