Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-17368. HA: Standby should exit safemode when resources are available. #6518

Merged
merged 3 commits into from
Mar 26, 2024

Conversation

zhuzilong2013
Copy link
Contributor

Description of PR

Refer to HDFS-17368.

The NameNodeResourceMonitor automatically enters safemode when it detects that the resources are not suffcient. NNRM is only in ANN. If both ANN and SNN enter SM due to low resources, and later SNN's disk space is restored, SNN willl become ANN and ANN will become SNN. However, at this point, SNN will not exit the SM, even if the disk is recovered.

Consider the following scenario:

  • Initially, nn-1 is active and nn-2 is standby. The insufficient resources of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode.
  • At this point, nn-1 is in safemode (ON) and active, while nn-2 is in safemode (OFF) and standby.
  • After a period of time, the resources in nn-2's dfs.namenode.name.dir recover, triggering failover.
  • Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode (OFF) and active.
  • Afterward, the resources in nn-1's dfs.namenode.name.dir recover.
  • However, since nn-1 is standby but in safemode (ON), it unable to exit safe mode automatically.

If SNN is detected to be in SM(because low resource), it will exit.

How was this patch tested?

Test in a production environment

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 21s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 31m 55s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 35s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 0m 38s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 3s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 46s trunk passed
+1 💚 shadedclient 20m 34s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 36s the patch passed
+1 💚 compile 0m 39s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 39s the patch passed
+1 💚 compile 0m 33s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 0m 33s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 29s the patch passed
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 29s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 57s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 43s the patch passed
+1 💚 shadedclient 20m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 199m 36s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
286m 13s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/1/artifact/out/Dockerfile
GITHUB PR #6518
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux a7b554722c26 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 13bccda
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/1/testReport/
Max. process+thread count 4188 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 6m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
-1 ❌ mvninstall 32m 50s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 compile 0m 40s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 40s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 0m 39s trunk passed
+1 💚 mvnsite 0m 43s trunk passed
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 56s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 42s trunk passed
+1 💚 shadedclient 20m 38s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 36s the patch passed
+1 💚 compile 0m 39s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 39s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 0m 35s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 29s the patch passed
+1 💚 mvnsite 0m 36s the patch passed
+1 💚 javadoc 0m 30s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 58s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 42s the patch passed
+1 💚 shadedclient 20m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 200m 56s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
294m 41s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/2/artifact/out/Dockerfile
GITHUB PR #6518
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 256047b7ad38 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4126317
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/2/testReport/
Max. process+thread count 4362 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@zhuzilong2013
Copy link
Contributor Author

@Hexiaoqiao @tasanuma @aajisaka Hi~ sir. Could you please help me review this PR when you are free? Thanks.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 21s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 31m 49s trunk passed
+1 💚 compile 0m 41s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 38s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 0m 38s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 42s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 0s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 42s trunk passed
+1 💚 shadedclient 20m 25s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 36s the patch passed
+1 💚 compile 0m 38s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 38s the patch passed
+1 💚 compile 0m 34s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 0m 34s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 30s the patch passed
+1 💚 mvnsite 0m 37s the patch passed
+1 💚 javadoc 0m 29s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 58s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 1m 43s the patch passed
+1 💚 shadedclient 20m 14s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 202m 25s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
288m 39s
Reason Tests
Failed junit tests hadoop.hdfs.TestReconstructStripedFile
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/3/artifact/out/Dockerfile
GITHUB PR #6518
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 9844b3816c24 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / b80669b
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/3/testReport/
Max. process+thread count 5006 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6518/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@zhuzilong2013
Copy link
Contributor Author

The failed unit test seems unrelated to the change.

@@ -1582,6 +1582,10 @@ void startStandbyServices(final Configuration conf, boolean isObserver)
standbyCheckpointer = new StandbyCheckpointer(conf, this);
standbyCheckpointer.start();
}
if (isNoManualAndResourceLowSafeMode()) {
LOG.info("Standby should not enter safe mode when resources are low, exiting safe mode.");
leaveSafeMode(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reasonable at first glance, not think carefully, any cases to trigger Standby leave safemode untimely? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the logic from HDFS-17231, and I believe there is no issue. HDFS-17231 enables the ANN to automatically exit ResourceLowSafeMode.
At the same time, I noticed that the 'leaveSafeMode(false)' method also exits 'StartupSafeMode'. I'm not sure if this is an issue; I mentioned this phenomenon in HDFS-17402.
If necessary, I can fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mention in "HDFS-2915" is that SNN should not enter resource low safe mode. NNRM thread removed from SNN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following scenario:

Initially, nn-1 is active and nn-2 is standby. The insufficient resources of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode.
At this point, nn-1 is in safemode (ON) and active, while nn-2 is in safemode (OFF) and standby.
After a period of time, the resources in nn-2's dfs.namenode.name.dir recover, triggering failover.
Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode (OFF) and active.
Afterward, the resources in nn-1's dfs.namenode.name.dir recover.
However, since nn-1 is standby but in safemode (ON), it unable to exit safe mode automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Got it. Please check if the failed unit test is related with this changes. Others look good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I passed the failed unit test locally.

@Hexiaoqiao
Copy link
Contributor

cc @zhangshuyan0 any more suggestions?

Copy link
Contributor

@Hexiaoqiao Hexiaoqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1. Will check in if no more comments here for two workdays.

@Hexiaoqiao Hexiaoqiao changed the title HDFS-17368. HA: Standby should exit safemode when resources are from low available HDFS-17368. HA: Standby should exit safemode when resources are available. Mar 26, 2024
@Hexiaoqiao Hexiaoqiao merged commit 37f9ccd into apache:trunk Mar 26, 2024
1 of 4 checks passed
@Hexiaoqiao
Copy link
Contributor

Committed to trunk. Thanks @zhuzilong2013 for your contributions!

@zhuzilong2013
Copy link
Contributor Author

Thanks @Hexiaoqiao for your review and merge~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants