Improve Conditions and Terminal errors #2379

EmilienM · 2025-01-20T23:48:52Z

/kind feature

Context and Background

As part of the initiative to improve status reporting in Cluster API (CAPI) resources, significant changes will be introduced to how resource statuses are handled in the Cluster API Provider for OpenStack (CAPO).

One major change involves phasing out the FailureReason and FailureMessage fields in favor of leveraging Kubernetes Conditions to encapsulate terminal failures and lifecycle statuses. Terminal failures, though unique to CAPI, can be effectively communicated through well-defined conditions, using explicit type and reason values to represent fatal issues. This shift aligns CAPO with Kubernetes conventions and ensures that error states are consistently and clearly conveyed.

Key Updates and Behavior Changes

Handling Errors with Conditions

Transient Errors: Errors caused by temporary issues (e.g., Neutron API unavailability) will update the Progressing condition to True with a Reason such as TransientError and a clear Message. These errors will trigger reconciliation retries using exponential backoff, allowing the system to self-recover without manual intervention.
Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.

Lifecycle Management via Conditions

Non-Recoverable Conditions: Objects in a terminal state (e.g., due to unrecoverable infrastructure issues) will not be reconciled further.
Temporary Conditions: Objects with transient issues will continue reconciliation until resolved or escalated to a terminal state.

Immutable vs. Mutable Resource Behavior

Immutable Resources (OpenStackMachine, OpenStackServer): Readiness will be set to Provisioned once all Conditions are met with no failures, it won't be able to change anymore. However, Conditions will reflect key events such as deletion failures.
Mutable Resources (OpenStackCluster): These resources may experience condition changes, reflecting updates or failures after modification (e.g., issues arising from adding a security group while the Neutron API is unresponsive).

Known Issues and Areas for Improvement

Several existing issues highlight gaps in handling terminal failures or reflect inconsistent status behavior. This enhancement will address the following key issues:

Issue #2146: Terminal failures are either not identified or incorrectly reported.
Issue #2185: Missing conditions in critical resource workflows.
Issue #2264: Inconsistent handling of fatal errors in OpenStackMachine.
Issue #2265: Status fields are not aligned with the proposed lifecycle management.
Issue #2404: Panic if instance was deleted in openstack manually.

Summary

By aligning CAPO with CAPI’s improved status reporting and transitioning to a condition-driven model, this enhancement will:

Provide clearer, more actionable resource statuses.
Reduce ambiguity in handling terminal failures.
Improve lifecycle management for immutable and mutable resources.
Address existing gaps and inconsistencies in error reporting.

The text was updated successfully, but these errors were encountered:

lentzi90 · 2025-01-23T13:23:04Z

Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.

Would/should there be a way for users to indicate that a retry should be made? Personally I get quite annoyed when I forget to create the identity ref secret or make a typo in some image name if that requires a full re-creation of the cluster.

EmilienM · 2025-01-23T13:28:13Z

Terminal Errors: Errors caused by invalid requests (e.g., HTTP 400 responses) will set Progressing=False and a Reason such as TerminalError. These errors will stop reconciliation, and users will be notified via a human-readable condition message.

Would/should there be a way for users to indicate that a retry should be made? Personally I get quite annoyed when I forget to create the identity ref secret or make a typo in some image name if that requires a full re-creation of the cluster.

We could create a condition for that? Retriable (default not set): True or False ?

lentzi90 · 2025-01-23T14:24:57Z

Sure! I guess my question is that is it better to have the terminal error and then potentially need a way to retry, or is it better to just rely on the exponential backoff and make all errors transient?

mdbooth · 2025-01-23T15:44:34Z

The difference between TerminalError and old CAPI Failure conditions is that TerminalError is ephemeral. Throwing a TerminalError just means that at the top level of the reconciler we won't schedule another reconcile, so reconciliation will stop.

However, if anything happens to trigger a reconcile anyway, we will still reconcile the object. e.g. If you messed up the credentials and update them, that should trigger another reconcile.

lentzi90 · 2025-01-24T06:08:02Z

Ah that is excellent!

github-project-automation bot added this to CAPO Roadmap Jan 20, 2025

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 20, 2025

github-project-automation bot moved this to Inbox in CAPO Roadmap Jan 20, 2025

EmilienM changed the title ~~OpenStackCluster: improve Conditions~~ OpenStackCluster: improve Conditions and Terminal errors Jan 21, 2025

EmilienM changed the title ~~OpenStackCluster: improve Conditions and Terminal errors~~ Improve Conditions and Terminal errors Jan 21, 2025

EmilienM mentioned this issue Jan 29, 2025

Panic if instance was deleted in openstack manually #2404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Conditions and Terminal errors #2379

Improve Conditions and Terminal errors #2379

EmilienM commented Jan 20, 2025 •

edited

Loading

lentzi90 commented Jan 23, 2025

EmilienM commented Jan 23, 2025

lentzi90 commented Jan 23, 2025

mdbooth commented Jan 23, 2025

lentzi90 commented Jan 24, 2025

Improve Conditions and Terminal errors #2379

Improve Conditions and Terminal errors #2379

Comments

EmilienM commented Jan 20, 2025 • edited Loading

Context and Background

Key Updates and Behavior Changes

Known Issues and Areas for Improvement

Summary

lentzi90 commented Jan 23, 2025

EmilienM commented Jan 23, 2025

lentzi90 commented Jan 23, 2025

mdbooth commented Jan 23, 2025

lentzi90 commented Jan 24, 2025

EmilienM commented Jan 20, 2025 •

edited

Loading