Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No, it's german for "The Bootstrap, the" #1958

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

moskyb
Copy link
Contributor

@moskyb moskyb commented Feb 16, 2023

(joke context)

The core of the buildkite agent (one of its cores, anyway) is a component currently called "The Bootstrap". This is the part of the agent that's actually responsible for running jobs, streaming their logs back to the buildkite mothership, and doing all the business of running hooks, finding plugins, doing git things, etc.

Were it only that simple.

What we call "the bootstrap" is actually three separate components from this repo's point of view:

  • A CLI command, buildkite-agent bootstrap, which is what the agent calls when it gets a new job to run
  • A go package called bootstrap that contains most of the code that gets run to run a job
  • A go struct, bootstrap.Bootstrap which holds the logic for job execution (though there are other peripheral job execution-related bits and bobs hanging around in the bootstrap package mentioned above)

These three things being named the same thing makes talking about them separately a pain; when talking about "the bootstrap", there's a variety of things that could be the subject of discussion.

Furthermore...

"Bootstrap" is a kind of a crappy name for what this thing does

There was a time, long ago, when this name probably fit. Fun fact, prior to v3 of the agent, the bootstrap used to be a bash script that the agent ran. At this point, the bootstrap was mostly responsible for standing up (bootstrapping, one might say) an environment in which a job (at the time a bash script and nothing more).

Times have changed however, and the bootstrap is now a (very) complex piece of go code responsible for orchestrating all of the various tasks that need to happen before, during and after a job run.

Okay, but why change it?

Simply put, the name is confusing and it means that when we talk about the bootstrap (which we usually mean as "the job execution thingy") to our colleagues and to our customers, there's context that's lost in translation.

The bootstrap is an incredibly important part - maybe the most important part - of a job's execution lifecycle, and we fairly regularly have need to talk to customers about it. Knowing what the bootstrap actually is requires knowledge of the agent's history though, and it makes talking about these things, and intuiting how the agent actually works, a lot harder.

Consider: If you were a buildkite customer and a bikkie said "oh that's a bootstrap error", what would you think the problem is? How about if they said (foreshadowing) "I think there's an error in the job executor"?

Cool. What have you done about it?

This PR is basically a big fancy find-and-replace. The gist of it is:

  • The buildkite-agent bootstrap command is deprecated (but not removed) and replaced with buildkite-agent run-job. This new command is functionally identical to the existing one, with the only change being that it doesn't have a deprecation notice
  • The bootstrap package has been renamed to job. This makes a lot of names clearer IMO - consider bootstrap.Shell vs job.Shell
  • The boostrap.Bootstrap struct has been renamed to job.Executor. This is more in line with what it actually does - it executes a job

None of these names are final - i'd love some feedback on them. Two hard things and all that.

Open Questions

  • Is job.Executor too similar semantically to agent.JobRunner? My opinion is no, but it's not particularly strongly held
  • Should we bother scrubbing all mention of the bootstrap from the repo or is it okay to leave some of them in there?

Still to do

  • Update agent/job_runner.go to:
    • Use the new nomenclature
    • Add a hook called pre-exec, identical to pre-bootstrap but with the shiny new name
    • Add a deprecation warning to the pre-bootstrap hook??? should we just continue to allow it?
  • Another round of seek-and-destroy on instances of the text bootstrap. They're pervasive!
  • Local smoke testing to ensure that:
    • The agent uses buildkite-agent exec-job as its job executor by default
    • buildkite-agent bootstrap still works okay, but outputs a deprecation warning
    • The agent's bootstrap can be overridden using both --bootstrap-script and --job-executor-script.

@moskyb moskyb force-pushed the s-bootstrap-executor-g branch from 69f0510 to 49d5d3f Compare February 17, 2023 05:17
@pda
Copy link
Member

pda commented Feb 24, 2023

Naming bike-shedding: I wonder if run-job would match our other terminology closer than exec-job.
e.g. Job state will be running as a result of this not-bootstrap thing happening.
And when you look at it in Test Analytics, it'll be called a Run (I think).
Also, “exec” feels quite low-level syscall-ish, whereas the not-bootstrap does quite a lot of higher-level coordination before executing one-or-more processes/hooks/plugins/containers/things.

Apologies, I haven't looked/thought deeper about the PR more broadly, I only have this bike-shed right now 😅

@pda
Copy link
Member

pda commented Feb 24, 2023

Also: I was totally baited into looking at this by the excellent PR title 🤡

@moskyb
Copy link
Contributor Author

moskyb commented Feb 26, 2023

I wonder if run-job would match our other terminology closer than exec-job

@pda i think i agree with you here - it's terser while also holding more information. how would you feel about renaming the command to run-job while keeping the struct in the job package job.Executor? There's a slight mismatch in naming there, but i think it nicely delineates between the internals and the externals (porcelain and plumbing in git terms, i guess)

Also: I was totally baited into looking at this by the excellent PR title 🤡

my cunning plan has worked then

@moskyb moskyb force-pushed the s-bootstrap-executor-g branch from e0cd7ca to 351a916 Compare March 2, 2023 06:51
@moskyb moskyb marked this pull request as ready for review March 2, 2023 06:52
@moskyb moskyb force-pushed the s-bootstrap-executor-g branch from 351a916 to 303f0d6 Compare March 2, 2023 21:06
@pda
Copy link
Member

pda commented Mar 2, 2023

how would you feel about renaming the command to run-job while keeping the struct in the job package job.Executor? There's a slight mismatch in naming there, but i think it nicely delineates between the internals and the externals (porcelain and plumbing in git terms, i guess)

Interesting question.

I'm a proponent of ubiquitous language; it'd be a shame to have two names for one thing.

One arguable argument against “runner” is that other platforms call their entire agent a “runner” (GitHub Actions, GitLab), and a subset of our customers will confuse it with that.

The other that you touched on is that we already have a component called JobRunner which lives in the agent outside the bootstrap executor/runner/thing.

I don't have the answers 🤷‍♂️

I wonder…

  • buildkite-agent start (some people think of this as the “Buildkite self-hosted runner”)
    • loop: get jobs (specifically: Command Step jobs, aka Command Jobs)
      • internal JobRunner prepares & orchestrates running the Command Job
        • buildkite-agent bootstrap (rename to run-job?) subprocess
          • do the lifecycle of the Command Job; command, plugins, hooks etc

Maybe JobRunner becomes JobOrchestrator and boostrap.Bootstrap becomes job.Runner? I don't love it.

Taking a step back from specifics…

  • the main process gets a job and wants to run it, but doesn't know how or isn't capable of doing so directly; it delegates to another layer in a subprocess to actually run the job.
  • that subprocess exists to run jobs, and knows how to run jobs.

Through that lens, the subprocess has a much stronger claim to “run job” or “job runner” naming, and the main process should find a different name that means ”knows that a job needs running and knows how to ask a subprocess to run the job”.

@pda
Copy link
Member

pda commented Mar 2, 2023

Possible alternative names for agent.JobRunner (i.e. the bit that doesn't actually execute the job, it just kicks it off elsewhere)

  • agent.JobManager
  • agent.JobForker
  • agent.JobOrchestrator
  • agent.JobStarter
  • agent.JobSupervisor
  • agent.JobInvoker

None of those feel great. What does it actually do?

  • Starts the subprocess to run the job
    • Collates the correct env to pass to that process
  • Streams stdout / stderr / header times to the API
  • Experimentally knows how to run jobs in k8s/etc instead of as a subprocess

The “k8s/etc” bit means it's not a JobForker.

I'd call it JobDispatcher except that means something different server-side, and the log streaming etc goes a bit beyond just “dispatching”.

agent.JobManager is okayish, to the extent that “manager” is ever a good name for a software component 😬

Maybe it's a JobInvoker but that's just adding yet another synonym for “run” / “execute”.

The fact that it's learning run jobs in different ways (subprocess / k8s / …) feels important here. Again, “dispatcher” kind of suits that. So does “strategy”.

@moskyb
Copy link
Contributor Author

moskyb commented Mar 3, 2023

@pda very interesting thoughts 🤔 i agree with you that there remains some confusion about the role of the agent.JobRunner vs job.Executor, but how would you feel about making that change at a later date? my take is that the current setup makes things clearer, though maybe not as clear as they possibly could be, but it's a step in the right direction.

the good thing is that those names (job.Executor and agent.JobRunner) are both completely internal, and can be pretty easily changed

Copy link
Contributor

@triarius triarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really pumped for this to happen!

For cleaning up all references to bootstrap, it's probably fine to do this when we delete the bootstrap command. We should also change the bootstrap-script config key then. For now, we have to keep this config key.

As for the name of the cli command, I prefer subject-object-verb order to subject-verb-object order (despite being an English speaker and vim user). See https://cosine.blue/2019-09-06-kakoune.html, https://simblob.blogspot.com/2019/10/verb-noun-vs-noun-verb.html

So I prefer

buildkite-agent job run

This has the advantage that if we want to add other acions you can perform on a job, we can nest them under the same job subcommand namespace.

SOV order is also consistent with what we have done for the OIDC. There, the command is buildkite-agent oidc request-token.

Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, bravo for doing this 👏 it's grungy and fiddly work, so you earn A Kudos from me for taking it on. 🎆

I'm pumped for this get this landed! Sorry it's taken me a while to review it, I wanted to give it the review it deserves.

Code looks pretty good! Unfortunately I don't have any particular opinion on the naming.

Comment on lines 369 to 370
hookExit := r.preExecHook(ctx, "pre-bootstrap")
hookExit = r.preExecHook(ctx, "pre-exec")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change in behaviour intended here? The hookExit from pre-bootstrap is overwritten by the pre-exec hookExit. So pre-bootstrap would no longer be able to reject the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, it totally wasn't! fixed now in 088fa3a

agent/job_runner.go Outdated Show resolved Hide resolved
job/docker.go Outdated Show resolved Hide resolved
@@ -132,8 +132,9 @@ func (gr *gitRepository) Close() error {
func (gr *gitRepository) Execute(args ...string) (string, error) {
path, err := exec.LookPath("git")
if err != nil {
return "", err
return "", fmt.Errorf("finding git executable on path: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! 😄

@moskyb moskyb force-pushed the s-bootstrap-executor-g branch 3 times, most recently from d43a835 to 93e996a Compare March 15, 2023 02:27
@moskyb moskyb force-pushed the s-bootstrap-executor-g branch from 93e996a to 421f4fc Compare May 16, 2023 00:18
@lox
Copy link
Contributor

lox commented Jun 1, 2023

I quite like agent.JobSupervisor for the current agent.JobRunner. Prior art from supervisord.

I'd always imagine that we'd add "Executors" which where strategies for executing the bootstrap, what we do now is a LocalShellExecutor or similar. We've built a DockerExecutor at CashApp, I've built an AmazonECSExecutor in the past.

Finding the right name for the bootstrap is a real challenge. The architecture we've built at CashApp where we run the buildkite-agent bootstrap in a docker container (the logical extension of https://github.com/buildkite/docker-bootstrap-example) has really exposed the confusing-ness of the name. The bootstrap is almost not even part of the agent anymore, it could even be running on a totally different host depending on the executor.

What if you actually decoupled it from the buildkite-agent binary? What if it was a buildkite-agent-job-runtime? That also plays into the bk cli and wanting to run a job locally (which actually doesn't need an agent).

The other aspect here of what the bootstrap does is it manages phases (I wish I'd called this stages), hooks and plugins. I've frequently wanted more granular access to these things, for instance being able to call buildkite-agent bootstrap default-checkout-phase directly.

If I was pushed to pick a name for a straight sub-command rename, I'd actually aim to extract it out of the job subcommand to leave room for job commands that operate on the active job (the bootstrap does not in the same way that the other commands do). What about buildkite-agent job-runtime or buildkite-agent job-kernel execute? 😅

@DrJosh9000 DrJosh9000 added the cleanup Cleaning up code, refactoring, etc label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Cleaning up code, refactoring, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants