Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: FileWatcherCertificateProvider is leaked #11692

Open
ejona86 opened this issue Nov 14, 2024 · 8 comments · May be fixed by #11909
Open

xds: FileWatcherCertificateProvider is leaked #11692

ejona86 opened this issue Nov 14, 2024 · 8 comments · May be fixed by #11909
Assignees
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Nov 14, 2024

In investigating #11678 (comment) , it was discovered that XdsSecurityClientServerTest looks to create two FileWatcherCertificateProviders each test, but only shuts down one. Since it is using the Channel/Server APIs, this means one of the client or server is highly likely leaking the certificate provider.

CC @kannanjgithub

@vinodhabib
Copy link
Contributor

vinodhabib commented Dec 24, 2024

@ejona86 @shivaspeaks Request you to provide the repro steps for the same so that it will be helpful for us to debug and fix the root cause/leakage issue.
in the above linked comments it was not clear on how to repro the same issue which you found earlier as most of the comments are related to flaky test.
Thanks in Advance.

@ejona86
Copy link
Member Author

ejona86 commented Dec 26, 2024

The flakiness was unrelated. Run XdsSecurityClientServerTest, and see that FileWatcherCertificateProvider.start() is called more than close().

It appears to apply to most tests, so you can choose your favorite and run just the one test. For example, in the xds/ directory, run ../gradlew test --tests 'XdsSecurityClientServerTest.tlsClientServer_noClientAuthentication[enableSpiffe=false]'. If you use println debugging, you can find the stdout/stderr in build/reports/tests/test/index.html.

I see SecurityProtocolNegotiators$ClientSecurityHandler in the stack of the non-shutdown instance, so it appears the leak is on client-side. Unfortunately, the client-side is overly complex in the number of classes involved.

@vinodhabib
Copy link
Contributor

vinodhabib commented Dec 31, 2024

build/reports/tests/test/index.html

@ejona86 @shivaspeaks As per your above suggestion I can see the below println (added in start() and close() methods ) statement for 1 of the test (as attached in below snippet) as below with 2 consecutive starts call execution followed by 1 close call.

Here the expected behaviour for this UT should have only 1 start call followed by 1 close call as below? please confirm ?

Executed start()
Executed close()
Executed start()
Executed close()

image

image

image

@ejona86
Copy link
Member Author

ejona86 commented Jan 3, 2025

@vinodhabib The number of start() calls should match the number of close() calls. The number of start() calls per test is likely to be 2, one for server and one for client. But this issue is that some are started but not closed.

@ejona86 ejona86 added the xds label Jan 9, 2025
@vinodhabib
Copy link
Contributor

vinodhabib commented Feb 12, 2025

@ejona86 @shivaspeaks I was going through the issue and try to debug/analyze on how the close() flow is triggering for server side cert (FileWatcherCertificateProvider) release, which is part of XdsServerWrapper.java class calling
XdsServerWrapper-> internalShutdown()-> discoveryState.shutdown()-> shutdown() as mentioned in the below snippet and code reference link

https://github.com/grpc/grpc-java/blob/764a4e3f08ec91eda231e59cbb1d75e80a2be6aa/xds/src/main/java/io/grpc/xds/XdsServerWrapper.java#L447C7-L447C29

Image

Image

I commented the below code in the shutdown() method and try to ran one of test XdsSecurityClientServerTest.tlsClientServer_noClientAuthentication which does not trigger the close() flow for cert release for server as mentioned in the below snippet

Image

Image

As per my above analysis/observations, i feel like a similar implementation for Client ( maybe XdsClientWrapper.java) does not exist.
Also Request you to share the details as you mentioned in your above comments saying you can see "SecurityProtocolNegotiators$ClientSecurityHandler in the stack of the non-shutdown instance" which may helpful for my further Analysis.

Request you to share your inputs on my above analysis/observations also much appreciated your further hints/tips.

Thanks in Advance.

@ejona86
Copy link
Member Author

ejona86 commented Feb 12, 2025

To see ClientSecurityHandler in the stack of the non-shutdown instances I did new Throwable().printStackTrace() on creation and System.out.println("" + this) (or something like it) on creation and shutdown. I then just looked at which instances weren't shut down and could see their creation stack.

@vinodhabib
Copy link
Contributor

To see ClientSecurityHandler in the stack of the non-shutdown instances I did new Throwable().printStackTrace() on creation and System.out.println("" + this) (or something like it) on creation and shutdown. I then just looked at which instances weren't shut down and could see their creation stack.

@ejona86 Thanks for your reply and details I will look into it. Any inputs on my above Analysis related to the similar server side implementation is missing at the client side?

@ejona86
Copy link
Member Author

ejona86 commented Feb 13, 2025

When I filed the bug I said it wasn't getting shut down, i.e., close() wasn't being called. So I don't understand what your analysis was trying to contribute. I'll note that client-side is organized quite differently than server-side.

The server-side organization actually looks like a bad idea, as FilterChain doesn't have a lifetime and is part of XdsListenerResource.LdsUpdate so is multi-consumer. Yet it has SslContextProvideSupplier. But I can believe it would end up having mostly the correct user-visible behavior today. (I see at least one edge case that could leak, but it isn't that severe.)

Client-side is properly not storing the supplier in CdsUpdate. And it looks like the same class closes the supplier that creates it. That's great. I also see code that appears to clean up on shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants