Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19061 Capture exception from rpcRequestSender.start() in IPC.Connection.run() #6519

Merged
merged 4 commits into from
Feb 3, 2024

Conversation

xinglin
Copy link
Contributor

@xinglin xinglin commented Feb 1, 2024

Description of PR

rpcRequestThread.start() can fail due to OOM. This will immediately crash the Connection thread, without removing itself from the connections pool. Then for all following getConnection(remoteid), we will get this bad connection object and all rpc requests will be hanging, because this is a bad connection object, without threads being properly running (Neither Connection or Connection.rpcRequestSender thread is running due to OOM.).

In this PR, we moved the rpcRequestThread.start() to be within the try{}-catch{} block, to capture OOM from rpcRequestThread.start() and proper cleaning is followed if we hit OOM.

How was this patch tested?

trivial change. let jenkins build.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@simbadzina
Copy link
Member

simbadzina commented Feb 1, 2024

Why doesn't the OOM cause the client to fail with the existing code on trunk, i.e. where is the OOM suppressed?

After your fix, what error will the client fail with? I'm worried that by suppressing this OOM (due to thread creation) we will end up with an OOM elsewhere and it won't be easily to trace when we have two many open connections.

@xinglin
Copy link
Contributor Author

xinglin commented Feb 2, 2024

Why doesn't the OOM cause the client to fail with the existing code on trunk, i.e. where is the OOM suppressed?

It is not suppressed/captured at all: it caused the Connection thread to crash. That is why we don't see Connection thread in our thread dump.

After your fix, what error will the client fail with? I'm worried that by suppressing this OOM (due to thread creation) we will end up with an OOM elsewhere and it won't be easily to trace when we have two many open connections.

I made slight change to my PR, to capture this exception but also throw the exception after we do some cleanup and remove this Connection object from IPC.client.connections pool.

So, the original code would keep the bad Connection object around when the Connection thread crashes (because it does not call close() method). The new code would remove that bad connection object and a new good one will be created next time.

@xinglin xinglin changed the title HADOOP-19061 Capture exception from rpcRequestSender.start() in IPC.C… HADOOP-19061 Capture exception from rpcRequestSender.start() in IPC.Connection.run() Feb 2, 2024
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 58s trunk passed
+1 💚 compile 16m 9s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 14m 32s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 1m 12s trunk passed
+1 💚 mvnsite 1m 34s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 29s trunk passed
+1 💚 shadedclient 35m 8s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 52s the patch passed
+1 💚 compile 15m 24s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 15m 24s the patch passed
+1 💚 compile 14m 41s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 14m 41s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 7s hadoop-common-project/hadoop-common: The patch generated 0 new + 65 unchanged - 2 fixed = 65 total (was 67)
+1 💚 mvnsite 1m 33s the patch passed
+1 💚 javadoc 1m 5s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 52s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 35s the patch passed
+1 💚 shadedclient 34m 33s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 13s hadoop-common in the patch passed.
+1 💚 asflicense 0m 56s The patch does not generate ASF License warnings.
211m 18s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/1/artifact/out/Dockerfile
GITHUB PR #6519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 84a1cbe3f6e8 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 50abdfc
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/1/testReport/
Max. process+thread count 3152 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@simbadzina
Copy link
Member

The new code would remove that bad connection object and a new good one will be created next time.
Do we log the bag connection which was cleaned up?

What happens if there is an OOM when creating the new connection? Wondering is we can get into a live-lock.

Please do excuse the many questions, I'm not familiar with this code so this is to help my understanding.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 18m 20s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 48m 32s trunk passed
+1 💚 compile 18m 28s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 16m 45s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 1m 17s trunk passed
+1 💚 mvnsite 1m 40s trunk passed
+1 💚 javadoc 1m 15s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 32s trunk passed
+1 💚 shadedclient 39m 26s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 53s the patch passed
+1 💚 compile 17m 38s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 17m 38s the patch passed
+1 💚 compile 16m 25s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 16m 25s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 11s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 3 new + 65 unchanged - 2 fixed = 68 total (was 67)
+1 💚 mvnsite 1m 36s the patch passed
+1 💚 javadoc 1m 7s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 39s the patch passed
+1 💚 shadedclient 39m 26s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 13s hadoop-common in the patch passed.
+1 💚 asflicense 0m 59s The patch does not generate ASF License warnings.
254m 15s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/2/artifact/out/Dockerfile
GITHUB PR #6519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 2558aee188e3 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 126523f
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/2/testReport/
Max. process+thread count 1259 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@xinglin
Copy link
Contributor Author

xinglin commented Feb 2, 2024

The new code would remove that bad connection object and a new good one will be created next time.
Do we log the bad connection which was cleaned up?

What happens if there is an OOM when creating the new connection? Wondering is we can get into a live-lock.

Please do excuse the many questions, I'm not familiar with this code so this is to help my understanding.

made some code change. prefer to have a single try-catch block to avoid code duplication.

Totally valid question and i'd have to chat with Gobblin team to understand more about their use case and confirm whether this PR would indeed mitigate their job hangings.

If we remain low on memory, we will keep getting OOM and yes, we will get into live-lock. However, the difference is in the original implementation, if we ever hit a single OOM, we will end up with a bad connection for this JVM permanently and all RPC requests to that destination will be blocked forever. For gobblin, they'd have to manually delete that JVM (and recreate a new one). With the new code, we don't have that problem and can tolerate certain degree of OOMs. And the expectation is we won't be hitting OOM all the time. If we are hitting OOM all the time, we should look for memory leak or bump memory/heap size.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 16s trunk passed
+1 💚 compile 18m 14s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 16m 38s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 1m 17s trunk passed
+1 💚 mvnsite 1m 39s trunk passed
+1 💚 javadoc 1m 14s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 50s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 32s trunk passed
+1 💚 shadedclient 39m 21s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 54s the patch passed
+1 💚 compile 17m 27s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 17m 27s the patch passed
+1 💚 compile 16m 37s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 16m 37s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 11s hadoop-common-project/hadoop-common: The patch generated 0 new + 65 unchanged - 2 fixed = 65 total (was 67)
+1 💚 mvnsite 1m 36s the patch passed
+1 💚 javadoc 1m 8s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 39s the patch passed
+1 💚 shadedclient 39m 27s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 3s hadoop-common in the patch passed.
+1 💚 asflicense 0m 58s The patch does not generate ASF License warnings.
233m 56s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/3/artifact/out/Dockerfile
GITHUB PR #6519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux f18ab07d4900 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4c89beb
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/3/testReport/
Max. process+thread count 2620 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@xinglin xinglin requested a review from simbadzina February 2, 2024 17:08
@simbadzina
Copy link
Member

Change LGTM.

It there a way to test this, that isn't too complicated.
One path I see to get an exception from Thread.start() is calling it twice. The second call will throw IllegalThreadStateException(). So you can start rpcRequestThread before Connection.run() and then validate that the Connection is immediately marked as closed.

I'm +1 without the test though.

@xinglin
Copy link
Contributor Author

xinglin commented Feb 2, 2024

Thanks Simba for reviewing and approving PR.

Change LGTM.

Is there a way to test this, that isn't too complicated. One path I see to get an exception from Thread.start() is calling it twice. The second call will throw IllegalThreadStateException(). So you can start rpcRequestThread before Connection.run() and then validate that the Connection is immediately marked as closed.

To do this, i would assume we need to create the rpcRequestThread externally and start it and then add a new API setRpcRequestThread() to Connection class, to set it to the new thread. Does this sound right? Or we need to add a new constructor for Connection class, to take rpcRequestThread as a parameter. Doesn't sound worth doing this for testing this 'trivial' change.

I'm +1 without the test though.

@simbadzina
Copy link
Member

To do this, i would assume we need to create the rpcRequestThread externally and start it and then add a new API setRpcRequestThread() to Connection class, to set it to the new thread. Does this sound right? Or we need to add a new constructor for Connection class, to take rpcRequestThread as a parameter. Doesn't sound worth doing this for testing this 'trivial' change.

Yes, we'd need a new API to either set the thread, or start it before Connection.run(). I couldn't easily find where the Connection thread is starting. I agree it doesn't seem worth it.

@xinglin
Copy link
Contributor Author

xinglin commented Feb 2, 2024

Yes, we'd need a new API to either set the thread, or start it before Connection.run(). I couldn't easily find where the Connection thread is starting. I agree it doesn't seem worth it.

Search for start() in setupIOStream(). it is there. took me a while to find that too.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 48m 35s trunk passed
+1 💚 compile 18m 9s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 16m 32s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 1m 13s trunk passed
+1 💚 mvnsite 1m 36s trunk passed
+1 💚 javadoc 1m 13s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 31s trunk passed
+1 💚 shadedclient 39m 42s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 53s the patch passed
+1 💚 compile 19m 50s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 19m 50s the patch passed
+1 💚 compile 19m 21s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 19m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 16s hadoop-common-project/hadoop-common: The patch generated 0 new + 65 unchanged - 2 fixed = 65 total (was 67)
+1 💚 mvnsite 1m 37s the patch passed
+1 💚 javadoc 1m 6s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 49s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 2m 41s the patch passed
+1 💚 shadedclient 40m 3s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 6s hadoop-common in the patch passed.
+1 💚 asflicense 0m 58s The patch does not generate ASF License warnings.
242m 6s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/4/artifact/out/Dockerfile
GITHUB PR #6519
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 74c0a178fa52 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 805cdfa
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/4/testReport/
Max. process+thread count 1263 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6519/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@ctrezzo ctrezzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think the added work for a unit test here is needed. +1 from me. Thank you for the patch @xinglin and thank you for the extra review @simbadzina!

@ctrezzo ctrezzo merged commit d74e516 into apache:trunk Feb 3, 2024
1 of 3 checks passed
jiajunmao pushed a commit to jiajunmao/hadoop-MLEC that referenced this pull request Feb 6, 2024
…onnection.run() (apache#6519)

* HADOOP-19061 - Capture exception from rpcRequestSender.start() in IPC.Connection.run() and proper cleaning is followed if an exception is thrown.

---------

Co-authored-by: Xing Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants