Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controling resource deletions #922

Open
vavdoshka opened this issue Dec 13, 2024 · 9 comments
Open

Controling resource deletions #922

vavdoshka opened this issue Dec 13, 2024 · 9 comments

Comments

@vavdoshka
Copy link

vavdoshka commented Dec 13, 2024

Is your feature request related to a problem? Please describe.

We are at early stage of adopting the messaging-topology-operator and graduating it's usage to production. Currently removal of a k8s resource triggers the removal of that object in RMQ, we don't feel it safe for our production workloads and would like to control the removal behavior.

Describe the solution you'd like

As a flag provided in runtime controlling resource deletion behavior (and/or) on the resource level.

Describe alternatives you've considered

Using admission controller to prevent resource removal

Additional context

topology-operator can reconcile resources which were requested previously, so it seems the missing piece is logic to ignore delete.
relevant request on rmq-server side rabbitmq/rabbitmq-server#12772

@Zerpet
Copy link
Contributor

Zerpet commented Dec 16, 2024

A flag to skip deletion of topology objects defeats the point of having an Operator in the first place. The Operator is meant to abstract the management and lifecycle of topology objects in RabbitMQ. Skipping deletion removes a fundamental part of the Operator functionality, and would result in a mismatch between the Kubernetes state and RabbitMQ state; in other words, it would result in objects not present in Kubernetes, and present in RabbitMQ.

@vavdoshka
Copy link
Author

Introducing a flag to control resource deletion behavior enhances the safety and flexibility of the Messaging Topology Operator without undermining its purpose. Many successful operators, like Crossplane, ArgoCD, Cert-Manager and others provide similar mechanisms to address real-world production needs, ensuring critical resources are not inadvertently deleted. While mismatches between Kubernetes and RabbitMQ states are a concern, they can be mitigated through logging and reconciliation mechanisms. Additionally, we already tested the reconciliation of resources created by the Operator after simulating the orphaned deletion and the existing implementation successfully detected and managed the orphaned resource.

Being used wisely this feature should allow easier and broader adoption of the operator for the both teams being early in their k8s journey and teams managing very sensitive RMQ workloads.

@ansd
Copy link
Member

ansd commented Dec 20, 2024

@vavdoshka Why can't you simply disallow deletion of these K8s objects via RBAC (Role Based Access Control)?

@ansd
Copy link
Member

ansd commented Dec 20, 2024

Also, please reference where these other projects support this feature.

A quick Google search suggests that this is an alpha feature in Crossplane:
https://docs.crossplane.io/latest/concepts/usages/

It's an open issue in ArgoCD:
argoproj/argo-cd#7301
argoproj/argo-cd#16063

@Zerpet
Copy link
Contributor

Zerpet commented Dec 20, 2024

Introducing a flag to control resource deletion behavior enhances the safety and flexibility of the Messaging Topology Operator without undermining its purpose.

I disagree. The mission of the Operator is to reconcile Kubernetes state into RabbitMQ. This flag is completely opposed to that mission.

ensuring critical resources are not inadvertently deleted

How can you inadvertently delete a resource when a deletion requires a explicit kubectl delete action, or an API DELETE request? It sounds like we are trying to fix human carelessness, and inherently unfixable problem.

Being used wisely this feature should allow easier and broader adoption of the operator for the both teams being early in their k8s journey and teams managing very sensitive RMQ workloads.

Given that this is the first time this feature has been requested in the ~4 years of existence of this Operator, I'm highly skeptical that it will help to spread adoption.

@vavdoshka
Copy link
Author

Thanks for the feedback. As you rightfully mentioned, there is a logical mistake in mocking the delete; however, it is not exactly what I propose. I propose to let users decide whether to optionally retain the resources that might contain a state influenced by factors outside the operator's responsibility or concern, i.e., queues or streams with data. Optional retain behavior, similar to what K8s implements for PVC Reclaim Policy. Because there might be use cases where deletion is appropriate and, in some cases, not so much. In addition one use-case which we did not face yet but fesable is migration (split/merge) of resources from one namespace to another.

In terms of what currently supports these use cases, there are two schools indeed, for example, the Kafka Strimzi operator does not support this, while at the same time:

We are trying to understand if maintainers would be opened to such contribution at current moment in order to plan accordingly the actions on our side. Thanks for this great Operator and your feedback.

@Zerpet
Copy link
Contributor

Zerpet commented Jan 3, 2025

I propose to let users decide whether to optionally retain the resources that might contain a state influenced by factors outside the operator's responsibility or concern, i.e., queues or streams with data. Optional retain behavior, similar to what K8s implements for PVC Reclaim Policy.

This sounds perfectly reasonable. I'm compelled by the reclaim policy analogy. I agree it makes sense to tweak the "reclaim" behaviour for resources that contain data, in other words: queues, streams and vhosts. A vhost does not contain data per-se, but its sub-resources (queues) do. Other resources that do not contain data e.g. policies, I'm not sure about implementing a reclaim policy.

We are trying to understand if maintainers would be opened to such contribution at current moment in order to plan accordingly the actions on our side.

I'm happy to go forth with a reclaim policy feature for Queue and VHost objects.

Any thoughts @mkuratczyk @ansd ?

@michaelklishin
Copy link
Member

Policies, runtime parameters, virtual host limits do not need a reclaim policy. It is extremely rare to see those things consuming a non-trivial amount of either memory or disk space.

Shovels and Federation upstreams, both based on runtime parameters, might need it because a shovel or a federation link can be configured to have a very high prefetch.

Users and permissions need to be treated differently because their rotation is typically not driven by how much memory or compute resources they take but rather by the credential rotation policy, which can vary greatly from deployment to deployment. So I'd leave those two resources out, at first or until/if something more suitable for them is designed and proposed.

@michaelklishin
Copy link
Member

I'm happy to go forth with a reclaim policy feature for Queue and VHost objects

I assume "queues' implicitly includes streams because they are considered a queue type, not a separate entity entirely, even though in most ways they are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants