Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some useful endpoints to Admin API #17948

Open
wants to merge 46 commits into
base: develop
Choose a base branch
from

Conversation

H-Shay
Copy link
Contributor

@H-Shay H-Shay commented Nov 20, 2024

Adds endpoints to the Admin API to:

  • fetch the number of invites the provided user has sent after a given timestamp
  • fetch the number of rooms the provided user has joined after a given timestamp, regardless if they have left/been banned from the rooms subsequently
  • Get report IDs of event reports where the provided user was the sender of the reported event

@H-Shay H-Shay requested a review from a team as a code owner November 20, 2024 17:17
@github-actions github-actions bot deployed to PR Documentation Preview November 20, 2024 17:18 Active
@github-actions github-actions bot deployed to PR Documentation Preview November 20, 2024 17:23 Active
@H-Shay H-Shay marked this pull request as draft November 20, 2024 18:31
@github-actions github-actions bot deployed to PR Documentation Preview November 20, 2024 18:56 Active
@H-Shay H-Shay marked this pull request as ready for review November 20, 2024 19:31
@github-actions github-actions bot deployed to PR Documentation Preview November 20, 2024 21:49 Active
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
docs/admin_api/event_reports.md Outdated Show resolved Hide resolved
docs/admin_api/event_reports.md Outdated Show resolved Hide resolved
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
synapse/storage/databases/main/room.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_worker.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_worker.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events_worker.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
@github-actions github-actions bot deployed to PR Documentation Preview November 21, 2024 22:42 Active
@github-actions github-actions bot deployed to PR Documentation Preview November 21, 2024 22:52 Active
@H-Shay
Copy link
Contributor Author

H-Shay commented Nov 21, 2024

Thanks for the quick review! I've re-written to expand endpoints where possible and added an index on events.received_ts. Sorry for the wonky wording on some of the documentation, I was trying to be specific but sounds like I was obtuse instead.

@H-Shay H-Shay requested a review from MadLittleMods November 21, 2024 23:17
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
docs/admin_api/event_reports.md Outdated Show resolved Hide resolved
docs/admin_api/user_admin_api.md Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_event_reports.py Outdated Show resolved Hide resolved
tests/rest/admin/test_event_reports.py Outdated Show resolved Hide resolved
Comment on lines 337 to 342
self.db_pool.updates.register_background_index_update(
update_name="events_received_ts_index",
index_name="received_ts_idx",
table="events",
columns=("received_ts",),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we okay with the extra disk space and extra CPU cycles dealing with this index on insert/update?

Seems necessary to the use case so the answer is probably "this is ok" but I'm not the one dealing with the database woes on a daily basis to know for sure. There are already 7 other indexes on the events table.

$ psql synapse
> \d+ events
...
Indexes:
    "event_contains_url_index" btree (room_id, topological_ordering, stream_ordering) WHERE contains_url = true AND outlier = false
    "events_event_id_key" UNIQUE CONSTRAINT, btree (event_id)
    "events_jump_to_date_idx" btree (room_id, origin_server_ts) WHERE NOT outlier
    "events_order_room" btree (room_id, topological_ordering, stream_ordering)
    "events_room_stream" btree (room_id, stream_ordering)
    "events_stream_ordering" UNIQUE, btree (stream_ordering)
    "events_ts" btree (origin_server_ts, stream_ordering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I am unclear on this so I've consulted the backend team for their thoughts - will update once I hear back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two angles to the cost of adding an index.

As mentioned above, insert/update/delete performance. The most interesting one for us here is insert — we don't really want to make it take longer to persist new events if we don't have a good justification for doing so. But I can't really give you any numbers, so it's hard to comment on it from a technical perspective.

Update — I don't think we ever change the received_ts of an event, so not applicable.

Delete — we purge events, but not really a problem if this goes a bit slower since it's already a 'background'-ish job.

The other angle is how much space this will take on disk. Looking at events_stream_ordering (which is also an index on a BIGINT, so the same), this takes 247 GB on matrix.org currently. (See with \di+ in psql BTW)

So adding this index will cost us another 247 GB roughly. That feels a bit on the chunky side unless it's very useful.

Do we have any alternatives? I can think of a few ideas that would help this.


BRIN

A neat trick that applies in this situation is that, because the values in received_ts are timestamps of (roughly) when we insert the event, they will — excluding clock issues and so on — only increase.

We could therefore create a BRIN (Block-Range INdex) on this, which is a more lightweight type of index that stores, for each page of rows in the table, the minimum and maximum value of the column.

I'm not sure if this type of index might be upset by VACUUM FULL, or restoring a backup, especially since we might be left with gaps in the tables due to purging rooms or events. So in that sense it's a little bit risky.


Using stream_ordering instead

stream_ordering also (excluding backfilled events) only increases. It's possible you could efficiently convert from received_ts to stream_ordering using a binary search or something like that, then perform your queries using stream_ordering.

However if the server clock isn't consistently / monotonically increasing, then you can't really use stream_ordering in its place and so this approach probably reaches its limit.

It'd be suitable for some rough approximation if you were poking around but probably no good for a feature built into Synapse without being able to bypass this problem.


Partial index

Both of your queries only care about m.room.member events!

We can restrict the index to only indexing m.room.member events and then make sure the query planner notices that's all we care about, when you come to query.
This is mostly good because it makes the index smaller (which should also provide a minor performance boost when reading, because there will be less extraneous index rows to have to consider)

        self.db_pool.updates.register_background_index_update(
            update_name="events_received_ts_index",
            index_name="received_ts_idx",
            table="events",
            columns=("received_ts",),
            where_clause = "type = 'm.room.member'"
        )
  • SELECT COUNT(event_id) FROM events WHERE sender = ? AND type = 'm.room.member' AND received_ts > ? AND state_key != ?`

    This query already should make it obvious enough to the query planner that we only care about m.room.member events.

  • SELECT c.room_id
    FROM current_state_events c
    JOIN events e ON c.event_id = e.event_id
    WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ?

    likely needs to be adapted to:

    SELECT c.room_id
    FROM current_state_events c
    JOIN events e ON c.event_id = e.event_id
    WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ? AND e.type = 'm.room.member'

Look-up table

Not saying this is my favourite suggestion right now, but you could make Synapse build and maintain a lookup table that converts the received_ts to earliest stream_ordering, e.g. on a daily basis.

Sketch:
(please note you would need something symmetrical for negative stream orderings, which I won't write here because it's just a sketch)

CREATE TABLE events_received_ts_lookup_table(received_day_ts BIGINT NOT NULL PRIMARY KEY, min_stream_ordering BIGINT NOT NULL);

when inserting a new event:

received_day_ts = received_ts - (received_ts % 86400_000)
INSERT INTO events_received_ts_lookup_table (received_day_ts, min_stream_ordering)
VALUES (?, ?) -- the received_day_ts and stream_ordering of the event
ON CONFLICT (received_day_ts) DO UPDATE min_stream_ordering = LEAST(min_stream_ordering, EXCLUDED.min_stream_ordering); -- this is probably not entirely correct but you get the idea

When querying, you'd first query for the min_stream_ordering for the day (falling back to earlier days if it's not present, so SELECT ... WHERE received_day_ts <= ? ORDER BY .. DESC LIMIT 1) and use it as an extra filter in your existing query, just as a 'hint' to the query planner so it can use the stream_ordering index.

  SELECT c.room_id
  FROM current_state_events c
  JOIN events e ON c.event_id = e.event_id
  WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ?
  AND stream_ordering >= ? -- this `?` is the `min_stream_ordering` for the timestamp (rounded down to the day)

Honestly: this is probably too much effort for the circumstances. But it gives you the freedom to choose how much space to dedicate to the look-up table and is pretty space efficient by default. It does add work into the insertion path, but you could do it in a batch job in the background occasionally if needed (again, this is effort though. It's the kind of thing that could be useful in the right circumstances but I think this probably doesn't justify it yet).


All in all, I'd probably say the partial index is the right mix for now until proven otherwise. However I am a bit concerned that the queries may not be all that fast, even with the index, because the query still has to scan through a substantial number of rows, only to discard the ones with the wrong sender.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream_ordering also (excluding backfilled events) only increases. It's possible you could efficiently convert from received_ts to stream_ordering using a binary search or something like that, then perform your queries using stream_ordering.

It's worth noting that we already lookup the stream ordering for 24 hr ago (in store stream_ordering_day_ago)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Shay any decision on this front?

The idea to use stream_ordering from @reivilibre and the helper that @erikjohnston pointed out (_find_first_stream_ordering_after_ts_txn(...)) seems very useful to avoid the extra index and create an efficient query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @reivilibre's comment "All in all, I'd probably say the partial index is the right mix for now until proven otherwise. ", I added a partial index here.

Copy link
Contributor

@MadLittleMods MadLittleMods Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, it looks like you might have gone with the partial index approach, 4af0387#diff-28da6922f7ab91da0023fa1d9b4ba5d8fa3d5abe924e9160e0646c852b31f6faR342

Seems workable. My only concern is it might be a bit niche and it's just more to add to the database.

Any reason that made you stray away from the stream_ordering approach? (answered by comment above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For total completeness one might note that AFAICT finding the timestamp correlated with a stream ordering appears to rely on the function _find_first_stream_ordering_after_ts_txn which searches using event.recevied_ts and so without adding an index is also subject to the original concern of slowness re: that column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database query for _find_first_stream_ordering_after_ts_txn(...) searches by stream_ordering which has an index. From my point of view, _find_first_stream_ordering_after_ts_txn(...) is potentially slow because it takes multiple round-trips from database <-> application until we complete the binary search. I don't think the comment there is accurate on why it's slow.

Relevant query:

sql = """
SELECT received_ts FROM events
WHERE stream_ordering <= ?
ORDER BY stream_ordering DESC
LIMIT 1
"""

There is potential to improve the binary search by using some sort of WITH RECURSIVE query to avoid the round-trips but that's a separate thing to fix.


If we needed find_first_stream_ordering_after_ts(...) in more scenarios, there's an argument to just add the new index for received_ts directly for all events because it would improve all of these scenarios:

  • History retention purge_history_for_rooms_in_range(...)
  • /purge_history Admin API
  • Removing push notifications that are older than a day ago (synapse/storage/databases/main/event_push_actions.py)
  • The new use case in this PR for some admin API's to find invites and join membership

Currently, the use cases for find_first_stream_ordering_after_ts(...) seem like they apply to more one-shot and occasionally scheduled tasks which aren't being hammered by normal users. Since these tasks don't require immediate speed, we just want to make sure we're not destroying the database.

It feels like these new admin API's fall into the same camp so using find_first_stream_ordering_after_ts(...) seems viable.


However I am a bit concerned that the queries may not be all that fast, even with the index, because the query still has to scan through a substantial number of rows, only to discard the ones with the wrong sender.

@reivilibre's comment about discarding a bunch of rows is still applicable but is harder to avoid unless we utilize/add more indexes for all of the conditions we're trying to filter by. Overall, it doesn't seem like we stress about this on many of the admin endpoints since they are run occasionally on demand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enlightening! Given this do you believe I should switch to using find_first_stream_ordering_after_ts rather than adding the partial index? I am learning here and while I can read about the tradeoffs I don't have an intuition as to which to choose, and the tradeoffs seem relatively equal.

@H-Shay I feel like I don't have a hard lean (or convincing/good reason) given you've already made the few changes for the partial index already.

I probably would have chosen the stream_ordering approach given the prior art in Synapse for it 🤷

Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding comments for database perf as summoned

Comment on lines 337 to 342
self.db_pool.updates.register_background_index_update(
update_name="events_received_ts_index",
index_name="received_ts_idx",
table="events",
columns=("received_ts",),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two angles to the cost of adding an index.

As mentioned above, insert/update/delete performance. The most interesting one for us here is insert — we don't really want to make it take longer to persist new events if we don't have a good justification for doing so. But I can't really give you any numbers, so it's hard to comment on it from a technical perspective.

Update — I don't think we ever change the received_ts of an event, so not applicable.

Delete — we purge events, but not really a problem if this goes a bit slower since it's already a 'background'-ish job.

The other angle is how much space this will take on disk. Looking at events_stream_ordering (which is also an index on a BIGINT, so the same), this takes 247 GB on matrix.org currently. (See with \di+ in psql BTW)

So adding this index will cost us another 247 GB roughly. That feels a bit on the chunky side unless it's very useful.

Do we have any alternatives? I can think of a few ideas that would help this.


BRIN

A neat trick that applies in this situation is that, because the values in received_ts are timestamps of (roughly) when we insert the event, they will — excluding clock issues and so on — only increase.

We could therefore create a BRIN (Block-Range INdex) on this, which is a more lightweight type of index that stores, for each page of rows in the table, the minimum and maximum value of the column.

I'm not sure if this type of index might be upset by VACUUM FULL, or restoring a backup, especially since we might be left with gaps in the tables due to purging rooms or events. So in that sense it's a little bit risky.


Using stream_ordering instead

stream_ordering also (excluding backfilled events) only increases. It's possible you could efficiently convert from received_ts to stream_ordering using a binary search or something like that, then perform your queries using stream_ordering.

However if the server clock isn't consistently / monotonically increasing, then you can't really use stream_ordering in its place and so this approach probably reaches its limit.

It'd be suitable for some rough approximation if you were poking around but probably no good for a feature built into Synapse without being able to bypass this problem.


Partial index

Both of your queries only care about m.room.member events!

We can restrict the index to only indexing m.room.member events and then make sure the query planner notices that's all we care about, when you come to query.
This is mostly good because it makes the index smaller (which should also provide a minor performance boost when reading, because there will be less extraneous index rows to have to consider)

        self.db_pool.updates.register_background_index_update(
            update_name="events_received_ts_index",
            index_name="received_ts_idx",
            table="events",
            columns=("received_ts",),
            where_clause = "type = 'm.room.member'"
        )
  • SELECT COUNT(event_id) FROM events WHERE sender = ? AND type = 'm.room.member' AND received_ts > ? AND state_key != ?`

    This query already should make it obvious enough to the query planner that we only care about m.room.member events.

  • SELECT c.room_id
    FROM current_state_events c
    JOIN events e ON c.event_id = e.event_id
    WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ?

    likely needs to be adapted to:

    SELECT c.room_id
    FROM current_state_events c
    JOIN events e ON c.event_id = e.event_id
    WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ? AND e.type = 'm.room.member'

Look-up table

Not saying this is my favourite suggestion right now, but you could make Synapse build and maintain a lookup table that converts the received_ts to earliest stream_ordering, e.g. on a daily basis.

Sketch:
(please note you would need something symmetrical for negative stream orderings, which I won't write here because it's just a sketch)

CREATE TABLE events_received_ts_lookup_table(received_day_ts BIGINT NOT NULL PRIMARY KEY, min_stream_ordering BIGINT NOT NULL);

when inserting a new event:

received_day_ts = received_ts - (received_ts % 86400_000)
INSERT INTO events_received_ts_lookup_table (received_day_ts, min_stream_ordering)
VALUES (?, ?) -- the received_day_ts and stream_ordering of the event
ON CONFLICT (received_day_ts) DO UPDATE min_stream_ordering = LEAST(min_stream_ordering, EXCLUDED.min_stream_ordering); -- this is probably not entirely correct but you get the idea

When querying, you'd first query for the min_stream_ordering for the day (falling back to earlier days if it's not present, so SELECT ... WHERE received_day_ts <= ? ORDER BY .. DESC LIMIT 1) and use it as an extra filter in your existing query, just as a 'hint' to the query planner so it can use the stream_ordering index.

  SELECT c.room_id
  FROM current_state_events c
  JOIN events e ON c.event_id = e.event_id
  WHERE c.state_key = ? AND c.membership = 'join' AND e.received_ts > ?
  AND stream_ordering >= ? -- this `?` is the `min_stream_ordering` for the timestamp (rounded down to the day)

Honestly: this is probably too much effort for the circumstances. But it gives you the freedom to choose how much space to dedicate to the look-up table and is pretty space efficient by default. It does add work into the insertion path, but you could do it in a batch job in the background occasionally if needed (again, this is effort though. It's the kind of thing that could be useful in the right circumstances but I think this probably doesn't justify it yet).


All in all, I'd probably say the partial index is the right mix for now until proven otherwise. However I am a bit concerned that the queries may not be all that fast, even with the index, because the query still has to scan through a substantial number of rows, only to discard the ones with the wrong sender.

synapse/storage/databases/main/roommember.py Outdated Show resolved Hide resolved
@@ -0,0 +1,3 @@
Add endpoints to Admin API to fetch the number of invites the provided user has after a given timestamp,
fetch the number of rooms the provided user has joined after a given timestamp, and get report IDs of event
reports against a provided user (ie where the user was the sender of the reported event).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that the review has been a bit stochastic and I wasn't able to pick up on all of these pieces in a single blast.

You might also want to try these queries on matrix.org to see if they run fast enough for you. Having to match on sender without an index might be more detrimental than we expect.

@github-actions github-actions bot deployed to PR Documentation Preview December 4, 2024 18:56 Active
@github-actions github-actions bot deployed to PR Documentation Preview December 9, 2024 19:55 Active
@github-actions github-actions bot deployed to PR Documentation Preview December 9, 2024 20:09 Active
@github-actions github-actions bot deployed to PR Documentation Preview December 9, 2024 20:33 Active
@H-Shay H-Shay requested a review from MadLittleMods December 9, 2024 21:07
@H-Shay
Copy link
Contributor Author

H-Shay commented Dec 9, 2024

I suspect complement test failures are flakes...

tests/rest/admin/test_event_reports.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Show resolved Hide resolved
@github-actions github-actions bot deployed to PR Documentation Preview December 10, 2024 20:11 Active
@github-actions github-actions bot deployed to PR Documentation Preview December 10, 2024 20:16 Active
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
changelog.d/17948.feature Show resolved Hide resolved
tests/rest/admin/test_user.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/roommember.py Show resolved Hide resolved
synapse/storage/databases/main/roommember.py Outdated Show resolved Hide resolved
@github-actions github-actions bot deployed to PR Documentation Preview December 11, 2024 18:46 Active
@github-actions github-actions bot deployed to PR Documentation Preview December 11, 2024 19:01 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants