Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: add non-skippable apply config operations. #11344

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lubennikovaav
Copy link
Contributor

@lubennikovaav lubennikovaav commented Mar 21, 2025

to fix audit_logging configuration.

apply_config() step of compute start is controlled by skip_pg_catalog_updates flag,
this is a performance optimization to decrease compute startup time, but it introduces extra dependency on cplane.
Introduce small subset of operations that we run always, independent from this flag.


For reviewers:

It will slow down compute starts, but most of the times it will be no-op, so operations will be really fast.

The alternative solution is to check if log_level has changed on the cplane side of the operation and set skip_pg_catalog_updates=False
(or rater introduce new flag, because we don't want to run all catalog updates just for this small change).

But this "if log_level has changed" query to cplane database also takes some time, so I'm not sure if compute start time will be better.

…ng (#11308)"

This reverts commit e5aef37.

The logic of this commit was incorrect:
enabling audit requires a restart of the compute,
because audit extensions use shared_preload_libraries.
So it cabnnot be done in the configuration phase,
require endpoint restart instead.
apply_config() step of compute start is controlled by skip_pg_catalog_updates flag,
this is a performance optimization to decrease compute startup time, but it introduces extra dependency on cplane.
Introduce small subset of operations that we run always, independent from this flag.
@lubennikovaav lubennikovaav requested a review from a team as a code owner March 21, 2025 15:30
@lubennikovaav lubennikovaav requested review from knizhnik, tristan957, hlinnaka, ololobus and a team and removed request for a team March 21, 2025 15:30
Copy link
Member

@tristan957 tristan957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather see this is a refactor of self.apply_config() if possible. I'm happy to explain further if that doesn't make sense.

@@ -1619,8 +1640,24 @@ impl ComputeNode {
"updated postgresql.conf to set neon.disable_logical_replication_subscribers=false"
);
}
self.pg_reload_conf()?;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description makes it sound like we should run this always, but the existence of the else seems to indicate we only run it sometimes. Can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't delete the operations from regular audit_config(), so they run as usual when the skip_pg_catalog_updates flag is set to False.

I agree that we can refactor apply_config() further,
but before that I'd like to discuss if this change of logic makes sense.

Pls, check my reasoning:

It will slow down compute starts, but most of the times it will be no-op, so operations will be really fast.

The alternative solution is to check if log_level has changed on the cplane side of the operation and set skip_pg_catalog_updates=False
(or rater introduce new flag, because we don't want to run all catalog updates just for this small change).

But this "if log_level has changed" query to cplane database also takes some time, so I'm not sure if compute start time will be better.

*also added this to PR description.

@tristan957 tristan957 self-requested a review March 21, 2025 15:35
@tristan957
Copy link
Member

Ignore the approval, whoops.

@@ -1497,6 +1497,27 @@ impl ComputeNode {
Ok::<(), anyhow::Error>(())
}

/// Apply config operations that are not covered by `skip_pg_catalog_updates`
#[instrument(skip_all)]
pub fn apply_config_non_skippable(&self, compute_state: &ComputeState) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative solution is to check if log_level has changed on the cplane side of the operation and set skip_pg_catalog_updates=False

I think this is the right approach, let's do not do some tricky exclusions in compute. We do this for first starts after branch creation or when roles/DBs are created via API. So the logic should be:

  1. Trigger apply config without updates skip on all running computes
  2. Marks the rest of the branches with force catalog updates, so the next start will force updates as well

This is a small-to-medium cplane change, so I think someone from cplane may help @clipperhouse and review this logic

Copy link
Contributor Author

@lubennikovaav lubennikovaav Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following.

Looks like this will be some tricky exclusion on cplane side....

The feature is: allow changing audit_log_level for existing projects.
It requires endpoint restart, because we need to add extensions to shared_preload_libraries.

So the question is who should watch for audit_log_level change to run Create extension`.

Options I see:

  1. don't watch at all, always try to create extension pgaudit if audit logging is enabled.

We already do this thing with neon extension update by the way.
See post_apply_config() So we already have an exception (which we can unite with this code later).
I don't think 2 noop DDLs will impact performance a lot.

  1. start_compute operation need to check previous state of audit_log_level (I don't think we preserve it anywhere, so this will be some strange query..)

  2. setting change should update some flag in cplane for all existing branches, so that start_compute knows that is must force catalog update. + start_compute should reset this flag per branch, when it's done.

Maybe this is not too hard to code, but I don't know if it's worth to introduce complicated logic for this change.

WDYT?

Copy link
Member

@ololobus ololobus Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See post_apply_config() So we already have an exception (which we can unite with this code later).

post_apply_config() is async, I don't think it's a good idea to put any audit-related there as it means that there are zero guarantees that i) it will succeed at all; ii) it won't take a lot of time, so it will race with any user actions, and the latter won't be audited <- this is not what I'd expect from the 'audit' feature. Please, correct me if I miss something

setting change should update some flag in cplane for all existing branches, so that start_compute knows that is must force catalog update. + start_compute should reset this flag per branch, when it's done.

Yes, this is somewhat my proposal, we already do this for branch creation, just in this case we need to update all live branches, so it will be more like

  1. 'Stop the world' -- suspend all computes, we already do this for enabling LR and adding new preload libs (soon)
  2. Mark all branches to do full config at next start
  3. This is likely it, but someone needs to review the current state of the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_apply_config() is async, I don't think it's a good idea to put any audit-related there

Agree.

'Stop the world' -- suspend all computes, we already do this for enabling LR and adding new preload libs (soon)

Wow, I didn't know that this code exists. It makes this change a lot simpler.
And it makes total sense to restart all branches when we enable audit log.

Thank you for the review. I'll create a subtask for cplane issue.

@lubennikovaav lubennikovaav requested a review from ololobus March 21, 2025 16:24
Copy link

2916 tests run: 2723 passed, 65 failed, 128 skipped (full report)


Failures on Postgres 17

Failures on Postgres 15

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_dropdb[release-pg15] or test_endpoint_crash[release-pg15-\U0001f4a3] or test_multixact[release-pg15] or test_sql_regress[release-pg15-v2-None] or test_sql_regress[release-pg15-v1-4] or test_isolation[release-pg15-v1-None] or test_isolation[release-pg15-v2-None] or test_isolation[release-pg15-v1-4] or test_pg_regress[release-pg15-v2-4] or test_sql_regress[release-pg15-v1-None] or test_pg_regress[release-pg15-v1-None] or test_sql_regress[release-pg15-v2-4] or test_pg_regress[release-pg15-v2-None] or test_isolation[release-pg15-v2-4] or test_pg_regress[release-pg15-v1-4] or test_subxacts[release-pg15-interpreted] or test_subxacts[release-pg15-vanilla] or test_dropdb[release-pg17] or test_dropdb[debug-pg17] or test_dropdb[release-pg17] or test_multixact[release-pg17] or test_multixact[debug-pg17] or test_multixact[release-pg17] or test_sql_regress[release-pg17-v2-None] or test_sql_regress[debug-pg17-v2-None] or test_sql_regress[release-pg17-v2-None] or test_pg_regress[release-pg17-v1-None] or test_pg_regress[release-pg17-v1-None] or test_pg_regress[debug-pg17-v1-None] or test_sql_regress[release-pg17-v1-4] or test_sql_regress[debug-pg17-v1-4] or test_sql_regress[release-pg17-v1-4] or test_pg_regress[release-pg17-v2-4] or test_pg_regress[debug-pg17-v2-4] or test_pg_regress[release-pg17-v2-4] or test_pg_regress[release-pg17-v1-4] or test_pg_regress[release-pg17-v1-4] or test_pg_regress[debug-pg17-v1-4] or test_isolation[release-pg17-v2-None] or test_isolation[release-pg17-v2-None] or test_isolation[debug-pg17-v2-None] or test_isolation[release-pg17-v1-4] or test_isolation[release-pg17-v1-4] or test_isolation[debug-pg17-v1-4] or test_pg_regress[release-pg17-v2-None] or test_pg_regress[release-pg17-v2-None] or test_pg_regress[debug-pg17-v2-None] or test_sql_regress[release-pg17-v2-4] or test_sql_regress[release-pg17-v2-4] or test_sql_regress[debug-pg17-v2-4] or test_sql_regress[release-pg17-v1-None] or test_sql_regress[release-pg17-v1-None] or test_sql_regress[debug-pg17-v1-None] or test_isolation[release-pg17-v1-None] or test_isolation[release-pg17-v1-None] or test_isolation[debug-pg17-v1-None] or test_isolation[release-pg17-v2-4] or test_isolation[debug-pg17-v2-4] or test_isolation[release-pg17-v2-4] or test_subxacts[release-pg17-vanilla] or test_subxacts[release-pg17-vanilla] or test_subxacts[debug-pg17-vanilla] or test_subxacts[release-pg17-interpreted] or test_subxacts[release-pg17-interpreted] or test_subxacts[debug-pg17-interpreted]"
Flaky tests (2)

Postgres 17

Postgres 15

Test coverage report is not available

The comment gets automatically updated with the latest test results
15b06a6 at 2025-03-21T17:07:10.690Z :recycle:

Base automatically changed from revert_audit_log_level_change to main March 21, 2025 18:18
@lubennikovaav lubennikovaav requested a review from a team as a code owner March 21, 2025 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants