-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow refresh of informer cache results in delayed processing of Etcd resources #898
Labels
area/control-plane
Control plane related
kind/bug
Bug
status/accepted
Issue was accepted as something we need to work on
Comments
This issues was first observed in g/g e2e tests. See issue: gardener/gardener#10739 |
unmarshall
added a commit
to unmarshall/etcd-druid
that referenced
this issue
Oct 29, 2024
anveshreddy18
pushed a commit
to anveshreddy18/etcd-druid
that referenced
this issue
Oct 29, 2024
…en (gardener#900) reconciled before (gardener#898)
unmarshall
added a commit
that referenced
this issue
Oct 29, 2024
…en (#900) (#904) reconciled before (#898) Co-authored-by: Madhav Bhargava <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/control-plane
Control plane related
kind/bug
Bug
status/accepted
Issue was accepted as something we need to work on
How to categorize this issue?
/area control-plane
/kind bug
What happened:
A new
Etcd
resource is created. Sinceetcd-reconciler
is watching forEtcd
events, it gets aCreate
event. This even is allowed in. During the reconciliation loop an attempt is made to get the resource:etcd-druid/internal/controller/etcd/reconciler.go
Lines 135 to 137 in df3ff21
It is possible that the informer caches are not yet updated.
client.Get
returnsNotFound
error. This results in the following:etcd-druid/internal/controller/utils/reconciler.go
Lines 42 to 44 in df3ff21
The reconciler is short circuited and the no further processing is done.
The default cache resync is 10hrs, but in case of gardener, it reconciles again and with every reconcile it adds the following:
See here.
This will generate another event much sooner than the default cache resync period of 10hrs giving etcd-druid another chance to reconcile the event. However this event gets filtered-out and is not processed. See:
etcd-druid/internal/controller/etcd/register.go
Lines 53 to 75 in df3ff21
r.hasReconcileAnnotation()
is true since gardener adds the reconcile annotation.specUpdated()
is false as there is no change to the spec in this event.lastReconcileHasFinished()
is false since the first time around the event was not processed so no status is present yet.r.autoReconcileEnabled()
is false as its not auto reconciled.As a consequence
onReconcileAnnotationSetPredicate
predicate will evaluate to false andautoReconcileOnSpecChangePredicate
predicate will evaluate to false thus rejecting the event.The result is that for a long time after the
Etcd
resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.What you expected to happen:
The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.
How to reproduce it (as minimally and precisely as possible):
It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.
The text was updated successfully, but these errors were encountered: