Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 22 version bump with ABI possibly causing severe timeout problem in buildbot #26078

Open
hnyman opened this issue Mar 2, 2025 · 13 comments

Comments

@hnyman
Copy link
Contributor

hnyman commented Mar 2, 2025

cc @robimarko @Ansuel @nxhack @ianchi

We have a frequent timeout/stall problem in the packages buildbot, which timeout is destroying quite many builds due to a hangup like failed 'make -j12 ...' (failure) (timed out.

 make[3] -C feeds/packages/admin/zabbix compile
command timed out: 3600 seconds without output running [b'make', b'-j12', b'IGNORE_ERRORS=n m y', b'BUILD_LOG=1', b'CONFIG_AUTOREMOVE=y', b'CONFIG_SIGNED_PACKAGES='], attempting to kill
process killed by signal 9
program finished with exit code -1

About 1/3 of the builds in the affected targets end in timeout. As it does not affect all builds of the same target architecture and 2/3 times the build succeeds, it is likely some kind of concurrency/race problem, so that the building order of the packages (or submodules) affects the features detected by a second package, causing a config prompt, or something like that.

That seems to have started approx 3 months ago.
The oldest failures that I have spotted are from around November 23, 2024

The failures happen on aarch64, arm, i386, x86
But not on armeb, arm_xscale, mips, powerpc, loongarch

It is hard to figure out what is happening, as the buildbot compile step logs available for casual users is just the launch of each package's compilation. And due to concurrent building, the packages are built in sligthly different order each time, so there is not direct diff possibility of the 4000+ line logs.

However, I did debugging by sorting the compile step output, and then comparing from the same target a recent ok build.
I noticed that from both analysed targets (x86_64, arm8vfpv3), the exact same package lines were missing from the timeouted build:

*** 1164,1183 ****
   make[3] -C feeds/packages/lang/node clean-build
   make[3] -C feeds/packages/lang/node compile
   make[3] -C feeds/packages/lang/node host-compile
-  make[3] -C feeds/packages/lang/node-arduino-firmata clean-build
-  make[3] -C feeds/packages/lang/node-arduino-firmata compile
-  make[3] -C feeds/packages/lang/node-cylon clean-build
-  make[3] -C feeds/packages/lang/node-cylon compile
-  make[3] -C feeds/packages/lang/node-hid clean-build
-  make[3] -C feeds/packages/lang/node-hid compile
-  make[3] -C feeds/packages/lang/node-homebridge clean-build
-  make[3] -C feeds/packages/lang/node-homebridge compile
-  make[3] -C feeds/packages/lang/node-javascript-obfuscator clean-build
-  make[3] -C feeds/packages/lang/node-javascript-obfuscator compile
-  make[3] -C feeds/packages/lang/node-serialport clean-build
-  make[3] -C feeds/packages/lang/node-serialport compile
-  make[3] -C feeds/packages/lang/node-serialport-bindings clean-build
-  make[3] -C feeds/packages/lang/node-serialport-bindings compile
   make[3] -C feeds/packages/lang/node-yarn host-compile
   make[3] -C feeds/packages/lang/perl clean-build
   make[3] -C feeds/packages/lang/perl compile

The node main compilation is started and also node-yarn host-compile gets started (as the first node module?). But then there is no trace that compiling other modules ever starts, until a timeout kills the whole buildbot build round.

So, my guess for the reason is #25435 : node: upgrade to 22.11.0 LTS on 23 Nov 2024 , which commit in addition to the major version bump, also added ABI versioning to node modules.

Node is restricted with DEPENDS:=@HAS_FPU @(i386||x86_64||arm||aarch64) to build on the affected targets, which increasingly points out to node being the reason for the major timeouts.

So for some reason, the node builds likely fails 1/3 of the times, but succeeds 2/3.
Curious.

Sorted logs:
sort x86 ok stdio.txt
sort x86 error stdio.txt

Original:
x86 ok stdio.txt
x86 error stdio.txt

@hnyman
Copy link
Contributor Author

hnyman commented Mar 2, 2025

@nxhack

Do you have any idea what might make node to occasionally react badly to extreme concurrency (-j 12 or 14) in building? Are some node modules dependent on each other without declaring that explicitly?

Is the new ABI versioning you implemented with that version bump really mandatory?

Unless a fix is figured rather soon, we might need to test my debugging results either by

  • removing the new ABI,
  • disable parallel builds for node, or
  • disabling whole node to test the hypothesis of its being the culprit for the timeouts. Maybe remove just some architectures, e.g. the arm ones, to see if that fix the buildbot runs for those targets.

@hnyman
Copy link
Contributor Author

hnyman commented Mar 2, 2025

Alternatively, we could disable node subpackages like node-yarn, which seems to be the one that gets built first (and maybe hangs). Looking at its Makefile, it has not been updated along the main node. We seem to be using really ancient yarn version 1.22. Quite possible that it is not in sync with the much newer main node.

https://github.com/yarnpkg/yarn#readme

This repository holds the sources for Yarn 1.x (latest version at the time of this writing being 1.22). New releases (at this time the 3.2.3, although we're currently working on our next major) are tracked on the yarnpkg/berry repository, this one here being mostly kept for historical purposes and the occasional hotfix we publish to make the migration from 1.x to later releases easier.

If you hit bugs or issues with Yarn 1.x, we strongly suggest you migrate to the latest release

@robimarko
Copy link
Contributor

@hnyman I tried building node-yarn multiple times with 32 threads locally in the snapshot SDK but I cannot get it to fail

@nxhack
Copy link
Contributor

nxhack commented Mar 2, 2025

@hnyman

Do you have any idea what might make node to occasionally react badly to extreme concurrency (-j 12 or 14) in building? Are some node modules dependent on each other without declaring that explicitly?

In my experience, there is no problem in building node.js itself, but I am aware of an extreme increase in npm cli threads when building node packages.

@nxhack
Copy link
Contributor

nxhack commented Mar 3, 2025

I also tried testing it on -j32, and it built without any problems. (4 cores, VT-x 8 threads, 32GB memory)

When building node packages, the number of threads increases to over 100, but it built without any problems.

@nxhack
Copy link
Contributor

nxhack commented Mar 3, 2025

@hnyman
Would you be able to test this?

diff --git a/lang/node-arduino-firmata/Makefile b/lang/node-arduino-firmata/Makefile
index 90c1c5b34..6c0e94eb0 100644
--- a/lang/node-arduino-firmata/Makefile
+++ b/lang/node-arduino-firmata/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=d7157e02867eae82887cb5e17b90c963fe7489bacd464110bfd20c672b8d5a98
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-cylon/Makefile b/lang/node-cylon/Makefile
index 28b3c635b..3bb1c16d0 100644
--- a/lang/node-cylon/Makefile
+++ b/lang/node-cylon/Makefile
@@ -20,6 +20,7 @@ PKG_SOURCE_SUBDIR:=$(PKG_SRC_NAME)-$(PKG_VERSION)
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=Apache-2.0
diff --git a/lang/node-hid/Makefile b/lang/node-hid/Makefile
index 575f9d579..0437fb63d 100644
--- a/lang/node-hid/Makefile
+++ b/lang/node-hid/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=6c1f05935215feed4e8d2f4aecf31abbad8fa783d252b0bd6041ed2f2e96e9ba
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT or X11
diff --git a/lang/node-homebridge/Makefile b/lang/node-homebridge/Makefile
index 7c6d124bc..d638a2fdc 100644
--- a/lang/node-homebridge/Makefile
+++ b/lang/node-homebridge/Makefile
@@ -15,6 +15,7 @@ PKG_HASH:=f91ab0058707a0498d97d87f45f19682065f80660fac942e0985caf9bb205f2a
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=ISC Apache-2.0
diff --git a/lang/node-javascript-obfuscator/Makefile b/lang/node-javascript-obfuscator/Makefile
index 281656331..fc2b3c5f4 100644
--- a/lang/node-javascript-obfuscator/Makefile
+++ b/lang/node-javascript-obfuscator/Makefile
@@ -14,10 +14,10 @@ PKG_SOURCE_URL:=https://registry.npmjs.org/$(PKG_NPM_NAME)/-/
 PKG_HASH:=9bc89b04c78277130bc6f699563871d211f6fc85803c874f6114a632d9456f7b
 
 PKG_BUILD_DEPENDS:=node/host
-HOST_BUILD_PARALLEL:=1
+HOST_BUILD_PARALLEL:=0
 
 HOST_BUILD_DEPENDS:=node/host
-PKG_BUILD_PARALLEL:=1
+PKG_BUILD_PARALLEL:=0
 PKG_BUILD_FLAGS:=no-mips16
 
 PKG_MAINTAINER:=Zbynek Kocur <[email protected]>
diff --git a/lang/node-serialport-bindings/Makefile b/lang/node-serialport-bindings/Makefile
index e6352781f..d0daa9b39 100644
--- a/lang/node-serialport-bindings/Makefile
+++ b/lang/node-serialport-bindings/Makefile
@@ -16,6 +16,7 @@ PKG_HASH:=aec200860bd175e4b14b4ab1aa56a5f750172b6c8e20ccb234846206395848d4
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-serialport/Makefile b/lang/node-serialport/Makefile
index 336d4b2e7..4c0f4af02 100644
--- a/lang/node-serialport/Makefile
+++ b/lang/node-serialport/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=e19fe993ad16ae0e03fc42e24cfe4babf8fd90f8358e1885d5e216277dda1086
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-yarn/Makefile b/lang/node-yarn/Makefile
index 47c7112f2..b5527189f 100644
--- a/lang/node-yarn/Makefile
+++ b/lang/node-yarn/Makefile
@@ -19,7 +19,7 @@ PKG_LICENSE_FILES:=LICENSE
 
 PKG_HOST_ONLY:=1
 HOST_BUILD_DEPENDS:=node/host
-HOST_BUILD_PARALLEL:=1
+HOST_BUILD_PARALLEL:=0
 
 include $(INCLUDE_DIR)/host-build.mk
 include $(INCLUDE_DIR)/package.mk

hnyman added a commit to hnyman/packages that referenced this issue Mar 3, 2025
Disable parallel builds for node downstream packages, as the
buildbot is showing frequent timeout problems
for aarch644, arm, i386 and x86, and node & node packages
are the primary suspect.

Based on discussion in
openwrt#26078

Signed-off-by: Hannu Nyman <[email protected]>
@hnyman
Copy link
Contributor Author

hnyman commented Mar 3, 2025

Thanks. I applied that to the master branch to test if that is enough to fix things.
Next few days will show that. If there are still timeouts in the next few days (so that this is not enough), the next test step might be to temporarily disable node for aarch64 and i386 to both prove if node really is the culprit.

Ps.
Note that the node itself still has parallel build enabled.
I wonder how heavy and long the node build itself is? Maybe there is just a genuine timeout if it is among the last packages to be compiled and the compilation takes over an hour.

@robimarko
Copy link
Contributor

And it timed out again on couple of archs

@hnyman
Copy link
Contributor Author

hnyman commented Mar 5, 2025

I think that we should temporarily mark the node package itself as BROKEN just to verify that it really is the reason for the frequent hangups.

@robimarko
Copy link
Contributor

Sounds fine to me

@ynezz
Copy link
Member

ynezz commented Mar 5, 2025

@hnyman thanks a lot for looking into this!

Maybe there is just a genuine timeout if it is among the last packages to be compiled and the compilation takes over an hour.

I just bumped it from 1 to 2 hours, lets see.

@hnyman
Copy link
Contributor Author

hnyman commented Mar 5, 2025

I marked node BROKEN an hour ago, so let's see if that takes care of the timeouts.

Having the timeout period lengthened to two hours might help in case the node package really is that hard to compile. But then the question raised is if it is wise to spend that much resources for the probably really rarely used package. Node.js is not something typically installed into an OpenWrt home router, I think.

@ynezz
Copy link
Member

ynezz commented Mar 6, 2025

When building node packages, the number of threads increases to over 100, but it built without any problems.

Maybe there is some issue, where node's build system doesn't honor the build concurrency constraints? Or there is some deadlock/race somewhere, being exhibited only on build systems with lower I/O throughput? If it was changed in that update, maybe the diff between those two versions could show the culprit? How was the previous node version (one which built fine on buildbots) behaving?

But then the question raised is if it is wise to spend that much resources for the probably really rarely used package.

Indeed, but t seems to be actively maintained, so there are users.

Node.js is not something typically installed into an OpenWrt home router, I think.

We could say this about a lot of other packages as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants