azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update #2025

rnirek · 2025-02-20T17:29:45Z

We have observed that with the recent update to the resource-agent configuration, as outlined in Cluster Lab resource-agents, once a node is marked with a value of -1000000—indicating a critical failure or an unhealthy state—the attribute remains unchanged until it is manually modified. Consequently, cluster services will not restart until this manual intervention occurs.

Could you clarify the rationale behind why the value is not automatically reset to 0?

The text was updated successfully, but these errors were encountered:

oalbrigt · 2025-02-24T09:10:41Z

@MSSedusch Can you explain why the value was changed from what it was in the old azure-events agent?

MSSedusch · 2025-02-24T09:28:38Z

@HappyTobi @msftrobiro @rdeltcheva

@oalbrigt @rnirek the node should be put back online here:

resource-agents/heartbeat/azure-events-az.in

Line 570 in 90f9f1c

def putNodeOnline(self, node=None):

@HappyTobi can you debug why/if the node is not put back online? Do we have to also put it in the else branch here?

resource-agents/heartbeat/azure-events-az.in

Line 674 in 90f9f1c

else:

rnirek · 2025-02-25T15:21:34Z

@MSSedusch While that was our assumption, we have observed that whenever a node is shut down, it is automatically marked as inactive with a value of -100000, preventing it from being brought back online until the value is manually updated. This behavior has been consistently observed across several clusters.

clusterHelper._exec("crm_attribute", "--node", node, "--name", attr_healthstate, "--update", "-1000000", "--lifetime=forever")

oalbrigt changed the title ~~Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update~~ azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update #2025

azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update #2025

rnirek commented Feb 20, 2025

oalbrigt commented Feb 24, 2025

MSSedusch commented Feb 24, 2025

rnirek commented Feb 25, 2025 •

edited

Loading

azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update #2025

azure-events-az: Node Health Attribute (-1000000) Not Automatically Reset After Critical Failure in Resource-Agent Update #2025

Comments

rnirek commented Feb 20, 2025

oalbrigt commented Feb 24, 2025

MSSedusch commented Feb 24, 2025

rnirek commented Feb 25, 2025 • edited Loading

rnirek commented Feb 25, 2025 •

edited

Loading