rfc: smp server cluster #1422

epoberezkin · 2024-12-16T10:53:24Z

No description provided.

spaced4ndy · 2024-12-20T14:59:56Z

rfcs/2024-12-15-server-cluster.md

+
+## Problem
+
+Currently we can only scale servers on a given address vertically, which has 2 problems:


Suggested change

Currently we can only scale servers on a given address vertically, which has 2 problems:

Currently we can only scale servers on a given address vertically, which has 3 problems:

spaced4ndy · 2024-12-20T15:14:46Z

rfcs/2024-12-15-server-cluster.md

+
+The second approach makes it easy to migrate  parts of the state between servers in the cluster, as message queues are already grouped in folder with the top level having 2 letters in folder name. This would allow to have up to 4096 servers in the cluster.
+
+The proxy would then choose a random server from the list of servers to create a queue, and the server would have to be configured to use specific 2 letters in base64 encoding of queue addresses. For existing queues, the server will be choosed based on the queue ID.


... and the server would have to be configured to use specific 2 letters

probably should be configured to select from range of 2 letters, and not specific 2 letters. this way when new server is added, queues transferred from this server can be split more evenly by further splitting ranges. I think Cassandra does something similar to that..

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeHashing.html

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeVnodesUsing.html

spaced4ndy · 2024-12-20T15:24:30Z

rfcs/2024-12-15-server-cluster.md

+To group servers into a cluster we need to map requests to specific servers. This can be done in one of two ways:
+- additional server ID in the cluster added to transmissions.
+- map the first two letters in base64 encoding of queue ID to the server ID.


we should consider advantages and implementation of first approach

spaced4ndy · 2024-12-20T15:25:06Z

rfcs/2024-12-15-server-cluster.md

+- additional protocol commands used only by load-balancing proxy to create references to servers that have the actual queue from other IDs.
+- use the same 2 letters for all IDs.
+
+The latter approach is simpler, but it it cannot be used if some of the IDs are generated client-side and some IDs are generated server-side - we would need to generate all IDs in one place. Alternatively, the client can generate some IDs with the same 2 letters in the ID, and the server in the cluster will be chosen to match this ID.


also, how existing queues will migrate

spaced4ndy · 2024-12-20T15:28:40Z

rfcs/2024-12-15-server-cluster.md

+
+If we decide on the second approach, and add client-generated IDs, we already may start rejecting IDs that contain different first 2 letters. It would effectively reduce ID entropy from 192 to 180 bits which could be a better tradeoff than additional protocol commands and requests to find the queue, that would add to the request latency.
+
+The advantage of the first approach is that it is more generic, and does not impose any restriction on IDs, and making additional requests within the operators network would add a small fraction to the latency, compared with much larger latency to the end user. The balancing proxy could cache the results of dereferencing requests.


also second approach leaks some metadata possibly..? for example if sender and notification server collude.. basically they'd know they're referring to same node in cluster. not sure, seems far fetched.

overall I think approach 2 (for question "A separate question is how to map other queue IDs (sender, notifier, link) to the recipient ID", not necessarily overall) is better. Seems reasonable for cluster to know its state

spaced4ndy · 2024-12-20T15:32:15Z

rfcs/2024-12-15-server-cluster.md

+- proxy -> server 2 chosen based on sender ID: SET_SREF
+- proxy <- server 2: OK
+- proxy -> server 3 chosen based on notifier ID: SET_NREF
+- proxy <- server 3: OK


why do servers 2 and 3 need to know anything here?

as in - it's proxy that's responsible for routing / load balancing, or does this sequence imply servers aren't yet in the same cluster?.. not sure I understand

spaced4ndy · 2024-12-20T15:37:12Z

rfcs/2024-12-15-server-cluster.md

+**The sequence of requests to send the message**:
+
+- client -> proxy: SEND
+- proxy -> mapped server 1 based on sender ID: GET_SREF
+- proxy <- server 1: SREF
+- proxy -> mapped server 2 based on recipient ID (or cluster ID) in SREF: SEND
+- proxy <- server 2: OK
+- client <- proxy: OK


also don't get it. aren't these servers in same cluster and behind the same proxy address? why server 1 (and not proxy) has to know sender id for queue that's on server 2?

rfc: smp server cluster

6d1841e

epoberezkin requested a review from spaced4ndy as a code owner December 16, 2024 10:53

spaced4ndy reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: smp server cluster #1422

rfc: smp server cluster #1422

epoberezkin commented Dec 16, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024

spaced4ndy Dec 20, 2024


		## Problem

		Currently we can only scale servers on a given address vertically, which has 2 problems:


		The second approach makes it easy to migrate parts of the state between servers in the cluster, as message queues are already grouped in folder with the top level having 2 letters in folder name. This would allow to have up to 4096 servers in the cluster.

		The proxy would then choose a random server from the list of servers to create a queue, and the server would have to be configured to use specific 2 letters in base64 encoding of queue addresses. For existing queues, the server will be choosed based on the queue ID.


		If we decide on the second approach, and add client-generated IDs, we already may start rejecting IDs that contain different first 2 letters. It would effectively reduce ID entropy from 192 to 180 bits which could be a better tradeoff than additional protocol commands and requests to find the queue, that would add to the request latency.

		The advantage of the first approach is that it is more generic, and does not impose any restriction on IDs, and making additional requests within the operators network would add a small fraction to the latency, compared with much larger latency to the end user. The balancing proxy could cache the results of dereferencing requests.

rfc: smp server cluster #1422

Are you sure you want to change the base?

rfc: smp server cluster #1422

Conversation

epoberezkin commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment