Skip to content

Commit 72799a5

Browse files
DebakelOrakelbastjanhaasadsimu
authored
Architecture: Generic, template-able resource reconciler (Espejote) (#368)
Co-authored-by: Sebastian Widmer <[email protected]> Co-authored-by: Adrian Haas <[email protected]> Co-authored-by: Simon Gerber <[email protected]>
1 parent 582f11c commit 72799a5

File tree

3 files changed

+600
-0
lines changed

3 files changed

+600
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
= Manage Operator Managed PrometheusRules
2+
3+
== Problem
4+
5+
OpenShift operators manage their PrometheusRules (and alerts), we can't alter their definition.
6+
The current solution is to find the upstream rules in the source repositories, copy those rules, label alerts we see as useful with `syn=true`, and silence alerts without that label.
7+
This is a manual process, as the source location may change with every change in the upstream repository.
8+
Some rules exist only embedded in Go code.
9+
Rollout of new PrometheusRules must be coordinated with the corresponding change in the operator.
10+
11+
=== Goals
12+
13+
* Automatically copy and label the operator managed PrometheusRules
14+
15+
== Proposals
16+
17+
=== Option 1: Use a policy tool
18+
19+
We could evaluate a policy tool that helps us meet our requirements.
20+
Such a tool could also help with other tasks we may want to automate.
21+
22+
The policy tools we've evaluated in the past, like Kyverno, have a lot of features that we don't need.
23+
Those features make the tool more complex to use and run than necessary.
24+
25+
=== Option 2: Create own dedicated controller
26+
27+
We can create our own dedicated operator that watches for changes in OpenShift operator managed PrometheusRules and dynamically copy/update and label these alerts.
28+
29+
Implementing a dedicated operator for managing these PrometheusRules would be straightforward.
30+
We already implemented other controller/operator in situations where we run into limitations of existing tools.
31+
32+
=== Option 3: Create more generalized copy/patch operator
33+
34+
We've got quite a few other edge-cases where we need to copy or patch resources based on other resources.
35+
We use a mix of custom scripts, cron jobs, controllers and other tools to solve those problems.
36+
We could implement a more generalized copy/patch operator that could be used for other resources as well.
37+
38+
This would allow us to replace multiple tools and lower our operational overhead tracking and rolling out upstream changes of those tools.
39+
40+
By using Jsonnet as a templating engine we can create a very powerful and flexible tool that can be used for many different use-cases.
41+
42+
=== Option 4: Use Crossplane Compositions
43+
44+
[quote, 'https://docs.crossplane.io/v1.19/concepts/compositions/[Crossplane documentation]']
45+
----
46+
Compositions are a template for creating multiple managed resources as a single object.
47+
----
48+
49+
We could use Crossplane Compositions to create a template for creating PrometheusRules.
50+
51+
Composition functions allow Go code to be executed to generate resources.
52+
This would allow us templating in a fully fledged programming language.
53+
Crossplane was primarily designed to manage external resources.
54+
It's a CNCF project and moved to `Incubating` status in 2021.
55+
56+
Using Crossplane comes with a huge overhead in both learning and operational costs.
57+
We would need to learn a complex new framework and tooling.
58+
Since functions need to be compiled and deployed the iteration cycle is much slower and more complex to debug.
59+
Composition functions don't seem to always be enough and `provider-kubernetes` is also required.
60+
We're not sure how well Crossplane handles resources primarily managed by an external party and how well server-side apply works.
61+
62+
While we'd use Go for functions, there's still an amount of YAML that needs to be written.
63+
This removes the most positive aspect of having the full Go testing and linting toolchain available.
64+
65+
VSHNs flagship project, Servala, also uses Crossplane behind the scenes.
66+
Servala is installed on almost every cluster and we'd most likely need to solve issues of interdependencies between the two projects.
67+
68+
Crossplane constantly fights with performance issues and the complexity of the project.
69+
See https://github.com/crossplane-contrib/provider-kubernetes/issues/316[Crossplane issue 316] for an example.
70+
71+
https://vshnwiki.atlassian.net/wiki/spaces/VST/pages/757635/Crossplane+Review[Internal reviews] of Crossplane also note the complexity of compositions, the steep learning curve, and the issues with debugging.
72+
It's a https://kb.vshn.ch/app-catalog/adr/0021-composition-function-error-handling.html[footgun] that's loaded and with the safety off.
73+
74+
== Decision
75+
76+
We decided to implement our own generalized copy/patch operator.
77+
78+
== Rationale
79+
80+
By implementing our own generalized copy/patch operator we can adapt better to changes in the upstream PrometheusRules.
81+
82+
Creating or patching resources based on other resources is an issue we encounter constantly.
83+
We already have tools in place to solve those problems, but all of them address a special case which could be unified in a more general approach.
84+
This would allow us to replace multiple tools and lower our operational overhead tracking and rolling out upstream changes of those tools.

0 commit comments

Comments
 (0)