Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Conditions and Terminal errors #2379

Open
EmilienM opened this issue Jan 20, 2025 · 5 comments
Open

Improve Conditions and Terminal errors #2379

EmilienM opened this issue Jan 20, 2025 · 5 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@EmilienM
Copy link
Contributor

EmilienM commented Jan 20, 2025

/kind feature

Context and Background

As part of the initiative to improve status reporting in Cluster API (CAPI) resources, significant changes will be introduced to how resource statuses are handled in the Cluster API Provider for OpenStack (CAPO).

One major change involves phasing out the FailureReason and FailureMessage fields in favor of leveraging Kubernetes Conditions to encapsulate terminal failures and lifecycle statuses. Terminal failures, though unique to CAPI, can be effectively communicated through well-defined conditions, using explicit type and reason values to represent fatal issues. This shift aligns CAPO with Kubernetes conventions and ensures that error states are consistently and clearly conveyed.

Key Updates and Behavior Changes

  1. Handling Errors with Conditions
  • Transient Errors: Errors caused by temporary issues (e.g., Neutron API unavailability) will update the Progressing condition to True with a Reason such as TransientError and a clear Message. These errors will trigger reconciliation retries using exponential backoff, allowing the system to self-recover without manual intervention.
  • Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.
  1. Lifecycle Management via Conditions
  • Non-Recoverable Conditions: Objects in a terminal state (e.g., due to unrecoverable infrastructure issues) will not be reconciled further.
  • Temporary Conditions: Objects with transient issues will continue reconciliation until resolved or escalated to a terminal state.
  1. Immutable vs. Mutable Resource Behavior
  • Immutable Resources (OpenStackMachine, OpenStackServer): Readiness will be set to Provisioned once all Conditions are met with no failures, it won't be able to change anymore. However, Conditions will reflect key events such as deletion failures.
  • Mutable Resources (OpenStackCluster): These resources may experience condition changes, reflecting updates or failures after modification (e.g., issues arising from adding a security group while the Neutron API is unresponsive).

Known Issues and Areas for Improvement

Several existing issues highlight gaps in handling terminal failures or reflect inconsistent status behavior. This enhancement will address the following key issues:

  • Issue #2146: Terminal failures are either not identified or incorrectly reported.
  • Issue #2185: Missing conditions in critical resource workflows.
  • Issue #2264: Inconsistent handling of fatal errors in OpenStackMachine.
  • Issue #2265: Status fields are not aligned with the proposed lifecycle management.
  • Issue #2404: Panic if instance was deleted in openstack manually.

Summary

By aligning CAPO with CAPI’s improved status reporting and transitioning to a condition-driven model, this enhancement will:

  • Provide clearer, more actionable resource statuses.
  • Reduce ambiguity in handling terminal failures.
  • Improve lifecycle management for immutable and mutable resources.
  • Address existing gaps and inconsistencies in error reporting.
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 20, 2025
@EmilienM EmilienM changed the title OpenStackCluster: improve Conditions OpenStackCluster: improve Conditions and Terminal errors Jan 21, 2025
@EmilienM EmilienM changed the title OpenStackCluster: improve Conditions and Terminal errors Improve Conditions and Terminal errors Jan 21, 2025
@lentzi90
Copy link
Contributor

Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.

Would/should there be a way for users to indicate that a retry should be made? Personally I get quite annoyed when I forget to create the identity ref secret or make a typo in some image name if that requires a full re-creation of the cluster.

@EmilienM
Copy link
Contributor Author

Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.

Would/should there be a way for users to indicate that a retry should be made? Personally I get quite annoyed when I forget to create the identity ref secret or make a typo in some image name if that requires a full re-creation of the cluster.

We could create a condition for that? Retriable (default not set): True or False ?

@lentzi90
Copy link
Contributor

Sure! I guess my question is that is it better to have the terminal error and then potentially need a way to retry, or is it better to just rely on the exponential backoff and make all errors transient?

@mdbooth
Copy link
Contributor

mdbooth commented Jan 23, 2025

The difference between TerminalError and old CAPI Failure conditions is that TerminalError is ephemeral. Throwing a TerminalError just means that at the top level of the reconciler we won't schedule another reconcile, so reconciliation will stop.

However, if anything happens to trigger a reconcile anyway, we will still reconcile the object. e.g. If you messed up the credentials and update them, that should trigger another reconcile.

@lentzi90
Copy link
Contributor

Ah that is excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Inbox
Development

No branches or pull requests

4 participants