About HA #9988

leo79901 · 2024-11-27T12:47:07Z

leo79901
Nov 27, 2024

To ensure the availability of virtual machines and enable the virtual machines on a physical host to be automatically recovered on other hosts in case of a physical host failure, I conducted the following tests:

1. compute offer config

Enabled HA in the compute offer, and now "Offer HA" is set to true

2. cluster config

Enabled HA in the cluster, and the result is as shown.

3. host config

For the host, configured out-of-band management, ensured the power status showing on the web is on, configured "Configure HA" and selected "KVMHAProvider" as the Provider, and then enabled HA. The result is as follows.

4. do the test

Recently, I shut down the host using stop power via iDRAC. After that, the instance on the host cannot be logged into, but its status remains "Running". Approximately 5 minutes later, its status is still "Running" and cannot log in. The instance show with the message: "The Control Plane Status of this
Instance is Offline. Some actions on this Instance will fail, if so please wait a while and retry

result

In my opinion, this instance should been restart on another host. But it was not.

Is there any issues with my configuration?

Thanks a lot.

tdtmusic2 · 2024-11-27T21:19:35Z

tdtmusic2
Nov 27, 2024

Been there and this is not the way to do it. What you did is enable host HA and out-of-band management, not VM HA, and host HA, to my knowledge, does not work as it's intended. What you want is for the VMs to start on a different host in case of failure of the original host. For that, you need to have the offering with HA and a nfs primary storage in disabled mode named HA, a simple folder with this name shared via nfs. That's all. You'll see that in the event of host failure, being it power related, network, etc., the vms that are on that host and are HA enabled will power up on other hosts after some time - no idea where to configure these timings, for me the HA process starts well after 10-12 minutes after the host failure.

2 replies

leo79901 Nov 28, 2024
Author

‘a nfs primary storage in disabled mode named HA’ Sorry, I can't understand this.

If the storage is in disabled mode, how can ACS use it?
Does 'named HA' imply that the name is significant?

tdtmusic2 Nov 28, 2024

‘a nfs primary storage in disabled mode named HA’ Sorry, I can't understand this.

If the storage is in disabled mode, how can ACS use it?

Does 'named HA' imply that the name is significant?

I am talking about additional storage, not the one used by acs. A second primary storage, disabled.
Yes, that must be the name.

sbrueseke · 2024-11-28T11:37:13Z

sbrueseke
Nov 28, 2024

We run into the same situation and did not understand HA in Cloudstack correctly, too. It is really hard for CS to be sure that a host is really down. There are so many situations where the management server is unable to connect to the host, but all VMs are still running. If CS no is trying to start the VMs on other hosts it will end in a mess.
Here is how we handle host failures, it is a manual process:

Our monitoring will inform us that a host is down.
We take a look and a technician is deciding that the host is really down and will not get up.
If the host is really down, we do a Force Reconnect on the host UI page.
After that we do a Declare Host as Degraded via UI.
After declaring a host as degraded HA (of the service offering) will kick in and restarts all VMs on other hosts.

Even if possible, we are not going to automate this. We want control over this and one big reason is that we also run SDS (linstor) on all hosts and so it will impact our primary storage, too.

Hope that helps!

2 replies

leo79901 Nov 28, 2024
Author

Thanks a lot.
We are using NFS as the primary storage, so the storage is safe.
Most issues happened at night. It takes too long for manual intervention.
We want the instances to restart automatically to resolve the issues. Because of this, I have asked everyone to make sure the app can be started on boot.

leo79901 Nov 28, 2024
Author

"the management server is unable to connect to the host, but all VMs are still running"
I'm not professional, but I see there is a directory named KVMHA in the primary storage, and all the host writes something periodically in it, like a timestamp?
Maybe this is useful to identify whether the host is healthy or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About HA #9988

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

About HA #9988

leo79901 Nov 27, 2024

1. compute offer config

2. cluster config

3. host config

4. do the test

result

Replies: 2 comments · 4 replies

tdtmusic2 Nov 27, 2024

leo79901 Nov 28, 2024 Author

tdtmusic2 Nov 28, 2024

sbrueseke Nov 28, 2024

leo79901 Nov 28, 2024 Author

leo79901 Nov 28, 2024 Author

leo79901
Nov 27, 2024

Replies: 2 comments 4 replies

tdtmusic2
Nov 27, 2024

leo79901 Nov 28, 2024
Author

sbrueseke
Nov 28, 2024

leo79901 Nov 28, 2024
Author

leo79901 Nov 28, 2024
Author