FPGA CI feature requests #66

nathanaelhuffman · 2024-11-15T22:17:43Z

Over in FPGA land we have a few, somewhat unique challenges (some of our own making) that pushed us to using GHA and self-hosted runners to meet our current CI needs vs buildomat and I wanted to document why here as buildomat may grow enough features to warrant looking at it again in the future.

Large toolchains (40GB+) which aren't well-suited to being installed on-demand. (Solveable with custom VM images I know)
A split-brain build system: we're using cobble for our bsv-based designs and buck2 for our vhdl-based designs.
A mono-repo structure where the shared and common IP and the projects are all in the same tree.

Our naive version of CI was "build all the things" and worked ok for one build system and a set of projects that built relatively quickly, but as the designs got larger (looking at you sidecar main FPGA) build times increased, leading to long lag of an hour or more while CI runs before changes can land. In our current world where much work is going on in the new build system for cosmo, having to spend 45mins or more rebuilding sidecar main controller for example is both a waste of time and compute energy.

Feature need 1:
We'd like a way of conditionally running jobs with some change filtering ala https://github.com/dorny/paths-filter does in github actions. This feature allows us to drastically cut down amount of build time required by only naively building cobble-based things or buck2 based things based on the change set.

Feature need 2:
We'd like a way of preserving interim build artifacts. Our build systems (cobble and buck2) are very good at tracking the dep tree and only re-doing the minimal amount of steps necessary. For our bluesim cases, this cuts a 17min run down to 4mins by not having to rebuild a bunch of stuff that didn't change because we're re-hydrating the next build with the previous build results. This looks to be related to #32

Feature need 3:
These things are large and relatively stable, we'd like to target our own infrastructure vs aws so being able to run these on colo/dogfood/some-other-oxide-rack would be awesome. Related to #13

That said, if neither of these features fit well for buildomat's roadmap I think it's very understandable. In the very very long-term, I'm interested in moving to a remote execution setup, which is probably also not a great fit for buildomat's core service. buck2 supports the protocol used by https://bazel.build/community/remote-execution-services and we may consider looking in that direction if our FPGA use continues to increase and our designs get much larger.

jclulow · 2024-11-18T19:56:40Z

Large toolchains (40GB+) which aren't well-suited to being installed on-demand. (Solveable with custom VM images I know)

Can you remind me how you're solving this part in GHA runners today?

nathanaelhuffman · 2024-11-18T20:20:43Z

Large toolchains (40GB+) which aren't well-suited to being installed on-demand. (Solveable with custom VM images I know)

Can you remind me how you're solving this part in GHA runners today?

self hosted runner(s), currently on colo, where the toolchain persists. This has certain non-idealities as well, but have it configured to require approval for external contributors and it executes on [push] events. Longer term I'm considering setting up a buildbarn or nativelink instance to do the building which will be more respectful of the buck2 env and caching but need a lot more learning and play there first.

jclulow · 2024-11-18T21:35:53Z

self hosted runner(s), currently on colo, where the toolchain persists. This has certain non-idealities as well, but have it configured to require approval for external contributors and it executes on [push] events.

Ah! That's essentially the same as what I'm putting together for the Hubris CI environment, FWIW, so if that fits the bill that's good to know.

jclulow · 2024-11-18T21:42:47Z

If we provided a target that used your existing instances in as secure a way as we can manage, I feel like that would sort out (2) & (3).

It occurs to me that all that the GHA facility you've linked for (1) is doing is effectively running a program to look at the commit it's been asked to build and then exiting the job as a success (without building anything) if it feels that nothing relevant has occurred. You could do this today in a buildomat job without adding any new facilities; yes we would spin up the environment for you to ask the question in, but that doesn't take very long. If we need to provide more of the metadata that GitHub gives us in the environment, I can easily do that. Otherwise the essential part of the program appears to be here: https://github.com/dorny/paths-filter/blob/de90cc6fb38fc0963ad72b210f1f284cd68cea36/src/main.ts#L115-L175

nathanaelhuffman · 2024-11-18T21:53:26Z

I'm game for getting this on our stuff if we think this is manageable. The GHA stuff proved to be a useful tool in making sure I'm getting what I want here and getting me something now, and if we'd like to set up buildomat jobs that do the same things I'm happy to help here, and I have an ubuntu 24.04 disk image on colo that has everything we need set up, and we could easily decommission the standalone runners by disabling them and nuking their folders when we're ready.

Aaron-Hartwig mentioned this issue Dec 10, 2024

disable buildomat oxidecomputer/quartz#251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FPGA CI feature requests #66

FPGA CI feature requests #66

nathanaelhuffman commented Nov 15, 2024 •

edited

Loading

jclulow commented Nov 18, 2024

nathanaelhuffman commented Nov 18, 2024

jclulow commented Nov 18, 2024

jclulow commented Nov 18, 2024

nathanaelhuffman commented Nov 18, 2024

FPGA CI feature requests #66

FPGA CI feature requests #66

Comments

nathanaelhuffman commented Nov 15, 2024 • edited Loading

jclulow commented Nov 18, 2024

nathanaelhuffman commented Nov 18, 2024

jclulow commented Nov 18, 2024

jclulow commented Nov 18, 2024

nathanaelhuffman commented Nov 18, 2024

nathanaelhuffman commented Nov 15, 2024 •

edited

Loading