feat: allow pod spec to be defined and patched in agent and plugin #262

42atomys · 2024-02-27T14:43:54Z

Description

This pull request introduces a new feature where the PodSpec can be defined and patched dynamically, allowing for more customized configurations of the Kubernetes pods that run the Buildkite agents.

This includes the ability to specify any pod specifitations directly through the Buildkite agent's configuration or inside the plugins with the logic of strategic merge logic used in kubectl.

Execution Logic

PodSpec are resolved without changes
The agent resolve the pod-spec-patch in agent configuration file and apply it on all jobs executed by him.
The agent resolve kubernetes plugin pod-spec-path field defined in pipeline step and apply it above.

Example

This agent configuration change the service account name, automount the service account and add a env var without erase env var already present.

# ...
# ssh-credentials-secret are the secret with the ssh credentials
# configured on the agent with docker-ssh-config
# @see https://github.com/buildkite/docker-ssh-env-config
ssh-credentials-secret: buildkite-ssh-github-creds
pod-spec-patch:
  serviceAccountName: 'buildkite-agent-sa'
  automountServiceAccountToken: true
  containers:
    - name: container-0
      env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: github-secrets
              key: github-token

Related Issues

Resolve #247

…et-from-agent-and-plugin

…lugin

42atomys · 2024-02-27T15:00:07Z

reviewers : Let me know if you have question, I need help on the tests part 🙏

triarius

Another great contribution @42atomys!

I talked a bit about some tweaks I have made to #248 in that PR, and I think I want to do some similar tweaks to this too.

To keep things simple, I wonder whether we need the patch at the plugin (aka step) level? Prima facie, almost everything you define in the patch could have been defined in the podSpec. Is it there because the "system containers" may need patching? I'd image that whatever patching they need would most likely be the same for all steps run by the controller, so would it not be better handled by the podSpecPatch in the controller config?

So what I suggest is:

remove the step level podSpecPatch
rebase over main once feat: allow ssh-credentials to be set in agent and plugin #248 is merged as some combination of your commits and my tweaks.

As for the testing, you should be able to Start a test controller with a podSpecPatch specified in config. If we eliminate the step level patch, then all you need to do is test that a config level patch is propogated to all the jobs in a pipeline.

triarius · 2024-03-15T04:20:57Z

internal/controller/config/config.go

+	Tags                 stringSlice            `mapstructure:"tags"               validate:"min=1"`
+	ProfilerAddress      string                 `mapstructure:"profiler-address"   validate:"omitempty,hostname_port"`
+	ClusterUUID          string                 `mapstructure:"cluster-uuid"       validate:"omitempty"`
+	PodSpecPatch         map[string]interface{} `mapstructure:"pod-spec-patch"     validate:"omitempty"`


What happens if this is specified as something k8s would not recognise like:

llamas: true

Is it a no-op, or is there a failure of the patch to apply which causes the Job to fail to start? If it's the latter, then I suggest we should try to detect this as early as possible.

It's most likely too onerous to validate it in the JSON Schema for the helm chart, but I think if some Kubernetes library exports as type for this, we should try to marshal into that so that the helm chart fails to deploy if podSpecPatch would cause and error.

Oh yeah, both this and #248 will need to update to the Helm chart's JSONSchema

42atomys · 2024-03-15T11:51:51Z

Hi @triarius, thank you for your time. I have an example of usage that is implemented internally on a large scale, across thousands of jobs per day. We deploy one agent per cluster using a helm chart, with this configuration (we've been using the fork for 3 weeks now) to bind the service account and define resources at the agent level with our default values.

Note: Not defining resources in a Kubernetes pod results in a low priority assignment and scheduling of all jobs on the same node, causing latency and job crashes when many jobs are running concurrently.

Agent configuration (helm chart):

    image: ghcr.io/42atomys/helm/agent-stack-k8s/controller:0.8.0
    config:
      tags: [ queue="default" ]
      ssh-credentials-secret: github-credentials
      pod-spec-patch:
        serviceAccountName: buildkite-service-account
        automountServiceAccountToken: true
          initContainers:
            - name: copy-agent
              resources:
                requests:
                  cpu: 100m
                  memory: 96Mi
                limits:
                  memory: 96Mi
          containers:
            - name: container-0
               resources:
                 requests:
                   cpu: 200m
                   memory: 256Mi
                 limits:
                   memory: 256Mi
            - name: dind
              resources:
                requests:
                  cpu: 100m
                  memory: 336Mi
                limits:
                  memory: 336Mi
            - name: checkout
              resources:
                requests:
                  cpu: 100m
                  memory: 480Mi
                limits:
                  memory: 480Mi
            - name: agent
              resources:
                requests:
                  cpu: 50m
                  memory: 64Mi
                limits:
                  memory: 64Mi

In the step definition, we sometimes need "more power" to execute tasks, like integration tests with a browserless engine. With pod-spec-patch at the step level, we can granularly configure our pipeline without having to redefine all the pod/container definitions. See the example below:

Simple CI Pipeline:

 steps:
  # Execute a step on a Kubernetes pod to run rspec with the BUILDKITE_ANALYTICS_TOKEN
  # secret injected into container-0. 
  # NOTE: We define the commands in the `commands` field, we dont declare
  # any container in the `podSpecPatch` field, so the plugin will use the default
  # container (container-0) to run the commands.
  - agents:
      queue: default
    commands:
    - bundle exec rspec
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - env:
            - name: BUILDKITE_ANALYTICS_TOKEN
              valueFrom:
                secretKeyRef:
                  key: BUILDKITE_ANALYTICS_TOKEN
                  name: buildkite
            name: container-0

  # Execute a playwritght test on a Kubernetes pod to run playwright with more
  # resources due to the high memory and CPU usage.
  - agents:
      queue: default
    commands:
    - bundle exec playwright test
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - name: container-0
            resources:
              limits:
                cpu: 2
                memory: 4Gi
              requests:
                cpu: 2
                memory: 4Gi

This is a simple example of how podSpecPatch can be used at the step level to adjust resources, but any field of the pod can be changed as well, without the need to define the entire pod spec each time. We are currently scaling Buildkite to an industrial level on my organisation

triarius · 2024-03-18T10:59:25Z

Thanks for the context @42atomys. I feel like with a PodSpecPatch at the step level, if you're only running one container in a job, you will almost never need to specify PodSpec at the step level. I'm happy to leave them both in for now, but we may decide to deprecate PodSpec later. The controller will always create a base PodSpec, but you can patch it in the plugin.

I've got the tests working on another branch. Are you happy for me to push to this branch and merge this PR? I'll be making it independent of #248 though. We can discuss the merits of step level sshCredentialSecrets there.

42atomys · 2024-03-21T10:59:05Z

@triarius Thanks for your return, I think deprecate PodSpec in the future must be done and rework the logic about job creation !

FYI: I dont know if you got it, I create a default container-0 when no container are defined to run commands fields of step definition without the need to define the kubernetes plugin 🎉 The idea behind are to be seamless with the normal behaviour without the kubernetes plugin.

You can merge your branch and merge it for me all are good 💯

42Atomys and others added 8 commits February 7, 2024 22:21

feat: allow git-credentials to be set in agent and plugin

b9ede4a

chore: rename git credentials into ssh to be more readable

a20da48

chore: do a self review

550248f

Merge branch 'buildkite:main' into feat/allow-git-credentials-as-secr…

0972f93

…et-from-agent-and-plugin

fix: env form on container are public

603cafc

test: add integration tests for ssh credentials

5806b5e

fix: incorrect secret naming

0257e6f

feat: allow editing pod spec from agent and plugin

5ddb953

42atomys changed the title ~~feat: allow git-credentials to be set in agent and plugin~~ feat: allow pod spec to be defined and patched in agent and plugin Feb 27, 2024

Merge branch 'main' into feat/allow-pod-spec-merging-from-agent-and-p…

88bba9b

…lugin

42atomys mentioned this pull request Feb 27, 2024

Enhancing PodSpec Customization with JSON Merge Patch Strategy (helm-controller and plugin) #247

Closed

chore: regroup env var in one place

34bb742

triarius mentioned this pull request Mar 14, 2024

Skip checkout when there are parse errors #275

Merged

triarius reviewed Mar 15, 2024

View reviewed changes

triarius mentioned this pull request Mar 15, 2024

"failure" job uses private image causing imagePullBackoff #273

Closed

42atomys requested a review from triarius March 23, 2024 15:48

This was referenced Mar 23, 2024

Make imagePullPolicy configurable #257

Closed

Support resource requests/limits on checkout and agent containers #234

Open

42atomys marked this pull request as ready for review March 24, 2024 15:07

triarius mentioned this pull request Mar 27, 2024

Allow patching pod spec from controller configuration and steps #282

Merged

triarius closed this in #282 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow pod spec to be defined and patched in agent and plugin #262

feat: allow pod spec to be defined and patched in agent and plugin #262

42atomys commented Feb 27, 2024 •

edited

Loading

42atomys commented Feb 27, 2024 •

edited

Loading

triarius left a comment

triarius Mar 15, 2024 •

edited

Loading

triarius Mar 15, 2024

42atomys commented Mar 15, 2024

triarius commented Mar 18, 2024 •

edited

Loading

42atomys commented Mar 21, 2024

feat: allow pod spec to be defined and patched in agent and plugin #262

feat: allow pod spec to be defined and patched in agent and plugin #262

Conversation

42atomys commented Feb 27, 2024 • edited Loading

Description

Execution Logic

Example

Related Issues

42atomys commented Feb 27, 2024 • edited Loading

triarius left a comment

Choose a reason for hiding this comment

triarius Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

triarius Mar 15, 2024

Choose a reason for hiding this comment

42atomys commented Mar 15, 2024

triarius commented Mar 18, 2024 • edited Loading

42atomys commented Mar 21, 2024

42atomys commented Feb 27, 2024 •

edited

Loading

42atomys commented Feb 27, 2024 •

edited

Loading

triarius Mar 15, 2024 •

edited

Loading

triarius commented Mar 18, 2024 •

edited

Loading