Skip to content

[RFC-0010] Implement managed identity support for Azure Event Hub provider #1106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dipti-pai
Copy link
Member

@dipti-pai dipti-pai commented Apr 29, 2025

Depends on: fluxcd/pkg#917

Part of: fluxcd/flux2#5022

Fixes: #1047

Changes include :

  • If authentication token is not specified in provider, attempt to get the token using workload identity.
  • Add new field .spec.serviceAccountName to support multi-tenant workload identity as defined in RFC-0010 to use an identity with a service account other than the notification-controller.
  • Use proxy to get the token if specified in provider spec.
  • Cache the tokens if enabled in the notification controller options.
  • Add unit tests for the 3 authentication mechanisms (SAS, JWT, managed identity).
  • Add documentation for using single-tenant and multi-tenant approaches of workload identity with azureeventhub provider.

Tested the feature with notification-controller service account (single tenant) and standalone service account with proxy and token cache. Also tested with existing auth mechanisms (JWT token in secret/SAS string).

Sharing test results below:

Notification controller logs sending the events to event hub

{"level":"info","ts":"2025-04-29T21:20:59.268Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32119512"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:26:00.024Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32121054"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:31:00.932Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32122584"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:36:01.805Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32124121"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:41:02.685Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32125656"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}

Cache metrics:

curl localhost:5000/metrics | grep gotk_token_cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
10# HELP gotk_token_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_token_cache_events_total counter
gotk_token_cache_events_total{event_type="cache_hit",kind="Provider",name="azure",namespace="default"} 272
gotk_token_cache_events_total{event_type="cache_miss",kind="Provider",name="azure",namespace="default"} 25
# HELP gotk_token_cache_evictions_total Total number of cache evictions.
# TYPE gotk_token_cache_evictions_total counter
gotk_token_cache_evictions_total 0
0 # HELP gotk_token_cache_requests_total Total number of cache requests partioned by success or failure.
29# TYPE gotk_token_cache_requests_total counter
29gotk_token_cache_requests_total{status="success"} 297
5 # HELP gotk_token_cached_items Total number of items in the cache.
  # TYPE gotk_token_cached_items gauge
 0gotk_token_cached_items 1
 29295    0     0  1487k      0 --:--:-- --:--:-- --:--:-- 1505k

Proxy logs getting tokens once per hour since that's how long the token is valid:

kubectl logs proxy-server-f4c5b56db-bqwsw
2025/04/28 17:56:43 [001] INFO: Running 0 CONNECT handlers
2025/04/28 17:56:43 [001] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 18:56:54 [002] INFO: Running 0 CONNECT handlers
2025/04/28 18:56:54 [002] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 19:57:05 [003] INFO: Running 0 CONNECT handlers
2025/04/28 19:57:05 [003] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 20:55:16 [005] INFO: Running 0 CONNECT handlers
2025/04/28 20:55:16 [005] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 21:55:28 [006] INFO: Running 0 CONNECT handlers
2025/04/28 21:55:28 [006] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 22:55:40 [007] INFO: Running 0 CONNECT handlers
2025/04/28 22:55:40 [007] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 23:55:52 [008] INFO: Running 0 CONNECT handlers
2025/04/28 23:55:52 [008] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 00:56:02 [009] INFO: Running 0 CONNECT handlers
2025/04/29 00:56:02 [009] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 01:56:13 [010] INFO: Running 0 CONNECT handlers
2025/04/29 01:56:13 [010] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 03:00:25 [011] INFO: Running 0 CONNECT handlers
2025/04/29 03:00:25 [011] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 04:00:41 [012] INFO: Running 0 CONNECT handlers
2025/04/29 04:00:41 [012] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 05:00:53 [013] INFO: Running 0 CONNECT handlers
2025/04/29 05:00:53 [013] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 06:01:05 [014] INFO: Running 0 CONNECT handlers
2025/04/29 06:01:05 [014] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 07:05:17 [015] INFO: Running 0 CONNECT handlers
2025/04/29 07:05:17 [015] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 08:05:29 [016] INFO: Running 0 CONNECT handlers
2025/04/29 08:05:29 [016] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 09:05:41 [017] INFO: Running 0 CONNECT handlers
2025/04/29 09:05:41 [017] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 10:05:52 [018] INFO: Running 0 CONNECT handlers
2025/04/29 10:05:52 [018] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 11:06:02 [019] INFO: Running 0 CONNECT handlers
2025/04/29 11:06:02 [019] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 12:06:13 [020] INFO: Running 0 CONNECT handlers
2025/04/29 12:06:13 [020] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 13:10:25 [021] INFO: Running 0 CONNECT handlers
2025/04/29 13:10:25 [021] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 14:10:36 [022] INFO: Running 0 CONNECT handlers
2025/04/29 14:10:36 [022] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 15:10:48 [023] INFO: Running 0 CONNECT handlers
2025/04/29 15:10:48 [023] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 16:10:58 [024] INFO: Running 0 CONNECT handlers
2025/04/29 16:10:58 [024] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 17:11:10 [025] INFO: Running 0 CONNECT handlers
2025/04/29 17:11:10 [025] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 18:15:22 [026] INFO: Running 0 CONNECT handlers
2025/04/29 18:15:22 [026] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 19:15:34 [027] INFO: Running 0 CONNECT handlers
2025/04/29 19:15:34 [027] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 20:15:46 [028] INFO: Running 0 CONNECT handlers
2025/04/29 20:15:46 [028] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 21:15:57 [029] INFO: Running 0 CONNECT handlers
2025/04/29 21:15:57 [029] INFO: Accepting CONNECT to login.microsoftonline.com:443

@dipti-pai dipti-pai marked this pull request as draft April 29, 2025 22:31
@matheuscscp matheuscscp changed the title Implement managed identity support for Azure Event Hub provider [RFC-0010] Implement managed identity support for Azure Event Hub provider Apr 30, 2025
@stefanprodan stefanprodan added the area/alerting Alerting related issues and PRs label Apr 30, 2025
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch from 2bbc7be to dc2fd41 Compare April 30, 2025 19:21
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch 2 times, most recently from d30e98a to 54ceab7 Compare May 2, 2025 21:52
@dipti-pai dipti-pai marked this pull request as ready for review May 2, 2025 22:04
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 👌

I'd like to ask you to please investigate something: Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

@@ -108,6 +108,11 @@ type ProviderSpec struct {
// +optional
SecretRef *meta.LocalObjectReference `json:"secretRef,omitempty"`

// ServiceAccountName is the name of the service account used to
// authenticate with services from cloud providers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the behavior we discussed here, I like to write good docs for the API struct fields because they show up in the command kubectl explain provider.spec.serviceAccountName 🙏

@dipti-pai
Copy link
Member Author

Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

I meant to include a comment for this. In my install, I had to add a new rule to notification-controller's clusterrole for this permission. I used ARC extension for Flux to test this end-to-end and had to extend permissions there. To work with Flux bootstrap, do we need to extend the RBAC permissions somewhere, perhaps here ? Thanks.

- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create

- If authentication token is not specified in provider, attempt to get the token using workload identity.
= Add new field .spec.serviceAccountName to support multi-tenant workload identity as defined in RFC-0010 to use an identity with a service account other than the notification-controller.
- Use proxy to get the token if specified in provider spec.
- Cache the tokens if enabled in the notification controller options.
- If address has SAS connection string, use that for authentication, this takes priority over token-authentication
- If static JWT token is specified in the secret reference, use it for authentication, this takes priority over workload identity-acquired token.
- Add unit tests for the 3 authentication mechanisms (SAS, JWT, managed identity).
- Add documentation for using single-tenant and multi-tenant approaches of workload identity with azureeventhub provider.
- Add operation post to github helpers and provider controller for cache event metrics

Signed-off-by: Dipti Pai <[email protected]>
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch from 54ceab7 to 6d514d6 Compare May 2, 2025 23:20
@matheuscscp
Copy link
Member

matheuscscp commented May 2, 2025

Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

I meant to include a comment for this. In my install, I had to add a new rule to notification-controller's clusterrole for this permission. I used ARC extension for Flux to test this end-to-end and had to extend permissions there. To work with Flux bootstrap, do we need to extend the RBAC permissions somewhere, perhaps here ? Thanks.

- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create

Yes, that looks like the right place, but I don't see how it binds to the notification-controller ServiceAccount, though 🤔 I think it's through the ClusterRoleBinding here with a bit of magic, but I'm not sure @stefanprodan can you please confirm?

Edit: I believe config/rbac/role.yaml is indeed the right place but we need to use controller-gen to add it, just add this line in the provider controller and run make manifests:

diff --git a/internal/controller/provider_controller.go b/internal/controller/provider_controller.go
index 1f7d0f9..5bca247 100644
--- a/internal/controller/provider_controller.go
+++ b/internal/controller/provider_controller.go
@@ -35,6 +35,7 @@ import (
 // +kubebuilder:rbac:groups=notification.toolkit.fluxcd.io,resources=providers,verbs=get;list;watch;create;update;patch;delete
 // +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch
 // +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
+// +kubebuilder:rbac:groups="",resources=serviceaccounts/token,verbs=create
 
 // ProviderReconciler reconciles a Provider object to migrate it to static
 // Provider.

Edit 2: Actually I don't see the rules from role.yaml in the output of flux install --export, @stefanprodan how does this work?

Edit 3: Looking at the output of flux install --export there are only two ClusterRoleBinding objects: one that gives cluster-admin to kustomize-controller and helm-controller, and one that gives crd-controller-flux-system to all controllers. I opened a PR to add the required permission here, but for consistency I think we should also add the +kubebuilder directive above and run make manifests anyway.

@@ -88,6 +94,13 @@ type Factory struct {
// Option represents a functional option for configuring a notifier.
type Option func(*notifierOptions)

// WithContext sets the context for the notifier.
func WithContext(ctx context.Context) Option {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this function and pass the ctx in the NewFactory constructor

@matheuscscp
Copy link
Member

Hi Dipti 👋

We have finally released fluxcd/pkg/[email protected] and fluxcd/pkg/[email protected]:

https://github.com/fluxcd/kustomize-controller/compare/c413d479c373425ed46ea8704cf38b0afd42c066..f4c2d12eb3e3ea6986257f6f03b59b540a3baf7e

Please update this PR accordingly 🙏

Please also enable the token cache by default, see this comment from Stefan:

fluxcd/kustomize-controller#1426 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting Alerting related issues and PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add managed identity support of Azure Event Hub provider in notification-controller
3 participants