Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow pod spec to be defined and patched in agent and plugin #262

Conversation

42atomys
Copy link
Contributor

@42atomys 42atomys commented Feb 27, 2024

Description

This pull request introduces a new feature where the PodSpec can be defined and patched dynamically, allowing for more customized configurations of the Kubernetes pods that run the Buildkite agents.

This includes the ability to specify any pod specifitations directly through the Buildkite agent's configuration​ or inside the plugins with the logic of strategic merge logic used in kubectl.

Execution Logic

  1. PodSpec are resolved without changes
  2. The agent resolve the pod-spec-patch in agent configuration file and apply it on all jobs executed by him.
  3. The agent resolve kubernetes plugin pod-spec-path field defined in pipeline step and apply it above.

Example

This agent configuration change the service account name, automount the service account and add a env var without erase env var already present.

# ...
# ssh-credentials-secret are the secret with the ssh credentials
# configured on the agent with docker-ssh-config
# @see https://github.com/buildkite/docker-ssh-env-config
ssh-credentials-secret: buildkite-ssh-github-creds
pod-spec-patch:
  serviceAccountName: 'buildkite-agent-sa'
  automountServiceAccountToken: true
  containers:
    - name: container-0
      env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: github-secrets
              key: github-token

Related Issues

Resolve #247

@42atomys 42atomys changed the title feat: allow git-credentials to be set in agent and plugin feat: allow pod spec to be defined and patched in agent and plugin Feb 27, 2024
@42atomys
Copy link
Contributor Author

42atomys commented Feb 27, 2024

reviewers : Let me know if you have question, I need help on the tests part 🙏

Copy link
Contributor

@triarius triarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great contribution @42atomys!

I talked a bit about some tweaks I have made to #248 in that PR, and I think I want to do some similar tweaks to this too.

To keep things simple, I wonder whether we need the patch at the plugin (aka step) level? Prima facie, almost everything you define in the patch could have been defined in the podSpec. Is it there because the "system containers" may need patching? I'd image that whatever patching they need would most likely be the same for all steps run by the controller, so would it not be better handled by the podSpecPatch in the controller config?

So what I suggest is:

  1. remove the step level podSpecPatch
  2. rebase over main once feat: allow ssh-credentials to be set in agent and plugin #248 is merged as some combination of your commits and my tweaks.

As for the testing, you should be able to Start a test controller with a podSpecPatch specified in config. If we eliminate the step level patch, then all you need to do is test that a config level patch is propogated to all the jobs in a pipeline.

Tags stringSlice `mapstructure:"tags" validate:"min=1"`
ProfilerAddress string `mapstructure:"profiler-address" validate:"omitempty,hostname_port"`
ClusterUUID string `mapstructure:"cluster-uuid" validate:"omitempty"`
PodSpecPatch map[string]interface{} `mapstructure:"pod-spec-patch" validate:"omitempty"`
Copy link
Contributor

@triarius triarius Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is specified as something k8s would not recognise like:

llamas: true

Is it a no-op, or is there a failure of the patch to apply which causes the Job to fail to start? If it's the latter, then I suggest we should try to detect this as early as possible.

It's most likely too onerous to validate it in the JSON Schema for the helm chart, but I think if some Kubernetes library exports as type for this, we should try to marshal into that so that the helm chart fails to deploy if podSpecPatch would cause and error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, both this and #248 will need to update to the Helm chart's JSONSchema

@42atomys
Copy link
Contributor Author

Hi @triarius, thank you for your time. I have an example of usage that is implemented internally on a large scale, across thousands of jobs per day. We deploy one agent per cluster using a helm chart, with this configuration (we've been using the fork for 3 weeks now) to bind the service account and define resources at the agent level with our default values.

Note: Not defining resources in a Kubernetes pod results in a low priority assignment and scheduling of all jobs on the same node, causing latency and job crashes when many jobs are running concurrently.

Agent configuration (helm chart):

    image: ghcr.io/42atomys/helm/agent-stack-k8s/controller:0.8.0
    config:
      tags: [ queue="default" ]
      ssh-credentials-secret: github-credentials
      pod-spec-patch:
        serviceAccountName: buildkite-service-account
        automountServiceAccountToken: true
          initContainers:
            - name: copy-agent
              resources:
                requests:
                  cpu: 100m
                  memory: 96Mi
                limits:
                  memory: 96Mi
          containers:
            - name: container-0
               resources:
                 requests:
                   cpu: 200m
                   memory: 256Mi
                 limits:
                   memory: 256Mi
            - name: dind
              resources:
                requests:
                  cpu: 100m
                  memory: 336Mi
                limits:
                  memory: 336Mi
            - name: checkout
              resources:
                requests:
                  cpu: 100m
                  memory: 480Mi
                limits:
                  memory: 480Mi
            - name: agent
              resources:
                requests:
                  cpu: 50m
                  memory: 64Mi
                limits:
                  memory: 64Mi

In the step definition, we sometimes need "more power" to execute tasks, like integration tests with a browserless engine. With pod-spec-patch at the step level, we can granularly configure our pipeline without having to redefine all the pod/container definitions. See the example below:

Simple CI Pipeline:

 steps:
  # Execute a step on a Kubernetes pod to run rspec with the BUILDKITE_ANALYTICS_TOKEN
  # secret injected into container-0. 
  # NOTE: We define the commands in the `commands` field, we dont declare
  # any container in the `podSpecPatch` field, so the plugin will use the default
  # container (container-0) to run the commands.
  - agents:
      queue: default
    commands:
    - bundle exec rspec
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - env:
            - name: BUILDKITE_ANALYTICS_TOKEN
              valueFrom:
                secretKeyRef:
                  key: BUILDKITE_ANALYTICS_TOKEN
                  name: buildkite
            name: container-0

  # Execute a playwritght test on a Kubernetes pod to run playwright with more
  # resources due to the high memory and CPU usage.
  - agents:
      queue: default
    commands:
    - bundle exec playwright test
    plugins:
    - kubernetes:
        podSpecPatch:
          containers:
          - name: container-0
            resources:
              limits:
                cpu: 2
                memory: 4Gi
              requests:
                cpu: 2
                memory: 4Gi

This is a simple example of how podSpecPatch can be used at the step level to adjust resources, but any field of the pod can be changed as well, without the need to define the entire pod spec each time. We are currently scaling Buildkite to an industrial level on my organisation

@triarius
Copy link
Contributor

triarius commented Mar 18, 2024

Thanks for the context @42atomys. I feel like with a PodSpecPatch at the step level, if you're only running one container in a job, you will almost never need to specify PodSpec at the step level. I'm happy to leave them both in for now, but we may decide to deprecate PodSpec later. The controller will always create a base PodSpec, but you can patch it in the plugin.

I've got the tests working on another branch. Are you happy for me to push to this branch and merge this PR? I'll be making it independent of #248 though. We can discuss the merits of step level sshCredentialSecrets there.

@42atomys
Copy link
Contributor Author

@triarius Thanks for your return, I think deprecate PodSpec in the future must be done and rework the logic about job creation !

FYI: I dont know if you got it, I create a default container-0 when no container are defined to run commands fields of step definition without the need to define the kubernetes plugin 🎉 The idea behind are to be seamless with the normal behaviour without the kubernetes plugin.

You can merge your branch and merge it for me all are good 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancing PodSpec Customization with JSON Merge Patch Strategy (helm-controller and plugin)
2 participants