Description of problem: -- When the Hosted-Storage is disconnected and reconnected again. Hosted engine vm starts on Second host in but remains in paused status on the First host and does not resume. -- Hosted engine vm needs to be powered off manually on the First host. It gets powered off sucessfully without affecting the running HE vm on Second Host. Version-Release number of selected component (if applicable): rhv 4.0.4 -- Steps used by Customer to reproduce this issue. - Hosted Storage (ISCSI ) was disconnected for around 2 Minutes. - Approx 3 Minutes later Hosted storage was connected to the hosts back. Hosted engine vm status seen "unknown stale data" for about a minute - Later the Hosted engine VM status was seen paused on the first host and about 30 Seconds later the Hosted engine vm was started on second Host. - Hosted engine VM status was still seen paused on the first host even after waiting for about 2 hours the status remained paused. - Hosted engine VM was powered off using below command on first Host. Hosted engine vm running on first host was not affected by this. # hosted-engine --vm-poweroff Actual results: -- Status of the Hosted engine vm did not change on first host and remained "paused" Expected results: -- Hosted engine vm status should be running on 1 Host and other hosts should have status "down"
possibly a broker issue restarting the HE VM without making sure it's gone first. Please attach relevant logs from both hosts
This not a broker "issue" actually, we do not touch paused VMs by design at the moment. Paused is also used for migration and we currently do not have a good enough rule to determine whether paused VM can be cleaned up or not.
there's Pause Reason code which you can use to differentiate. And if you observe the VM state via vdsm layer you would see a MIGRATION_DESTINATION status instead, not PAUSED.
As you can see in bug 1278481 this is currently by design due to the VM status life cycle. However we need to be able to handle this in light of upcoming changes with migration improvements. Michal, can you suggest an indication that will tell us when it's safe to destroy the VM?
I guess it depends what you want to do. The information is available. In general you should be able to kill the leftover VM as soon as you see the VM running on the other side. Normally it is libvirt doing that automatically, when it doesn't work vdsm tries to do that (for vdsm initiated migrations), if that fails engine tries to do that (for engine initiated migrations)
This became more important yesterday. It's more than cosmetic. Consider this sequence of events: - HE and the environment are running steady-state. Life is good. - The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and pauses on all but one. (That's this known bug.) Let's say the running HE is now on the SPM host. - A while later, that SPM host dies, taking down HE with it. - There's no power fencing. - The surviving hosts do *not* elect another SPM because they don't know what happened to the dead SPM because there's no power fencing. This is proper. - Somebody needs to check "Confirm host has been rebooted" for a new SPM election to happen. - But nobody can check that checkbox because there's no GUI because there's no manager because its hypervisor host died and managers are already running but paused on the other HE-eligible hypervisors. - And just like that, the whole environment is tied up in knots, leading to a heavy-duty support case. This sequence of events apparently really did happen. We break the cycle by making sure we only restart one and only one RHVM instance if it dies, so if it dies again, the next RHVM failover will still work. thanks - Greg
(In reply to Greg Scott from comment #12) > - The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and > pauses on all but one. (That's this known bug.) Let's say the running HE is > now on the SPM host. Can you point us to the known bug? is it this bug? Hosted engine may try to start new engine on several hosts in the same time, but only one engine will start. The other engine should fail to start, not pause. > - A while later, that SPM host dies, taking down HE with it. Hosted engine should start new engine at this point. > - There's no power fencing. Without power fencing your system cannot be highly available. > - The surviving hosts do *not* elect another SPM because they don't know > what happened to the dead SPM because there's no power fencing. This is > proper. The hosts do not select new SPM since we don't have such feature. Only engine select a new SPM, and only if it can ensure that the old SPM is not running. > - Somebody needs to check "Confirm host has been rebooted" for a new SPM > election to happen. If engine cannot access the old SPM host, yes, this is the only way to get a new SPM. The way to fix such system with multiple failures is to destroy the paused hosted engine vms using virsh. Hosted engine agent will start a new engine, and then if needed you can get a new SPM.
Thanks Nir >> - The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and >> pauses on all but one. (That's this known bug.) Let's say the running HE is >> now on the SPM host. > > Can you point us to the known bug? is it this bug? Yes - this bug right here in this BZ. > > Hosted engine may try to start new engine on several hosts in the same time, > but only one engine will start. The other engine should fail to start, not > > > > pause. > >> - A while later, that SPM host dies, taking down HE with it. > > Hosted engine should start new engine at this point. Except it doesn't. Every HE-eligible host already had a paused HE running, so nobody starts a new one. And I'll update the support case with what you said about the manager choosing an SPM. I always thought the hypervisors elected one, so thanks for clarifying that. > The way to fix such system with multiple failures is to destroy the paused > hosted engine vms using virsh. Hosted engine agent will start a new engine, > and then if needed you can get a new SPM. The problem here is, you're blind. There's no manager and no SPM, and so all you can do is ssh into each host and look around. If you don't know about this bug, it's a 24x7 severity 1 support case.
(In reply to Greg Scott from comment #15) > > Hosted engine may try to start new engine on several hosts in the same time, > > but only one engine will start. The other engine should fail to start, not > > > > pause. > > > >> - A while later, that SPM host dies, taking down HE with it. > > > > Hosted engine should start new engine at this point. > > Except it doesn't. Every HE-eligible host already had a paused HE running, > so nobody starts a new one. This is a bit unclear to me: more than one of the remaining hosts could try to start the engine VM at the same time but sanlock will ensure that only one host will really start it. The hosts where sanlock prevented the engine VM from starting shouldn't mark it as paused. I don't understand how every HE-eligible host already had a paused HE running. Just because you repeat the storage domain disconnection a few time just for testing? > > The way to fix such system with multiple failures is to destroy the paused > > hosted engine vms using virsh. Hosted engine agent will start a new engine, > > and then if needed you can get a new SPM. > > The problem here is, you're blind. There's no manager and no SPM, and so all > you can do is ssh into each host and look around. If you don't know about > this bug, it's a 24x7 severity 1 support case. Technically ovirt-ha-agent doesn't need an SPM host to start back the engine VM. As soon as you have a running manager, the death of the SPM host will not be different from a case where the engine is on a physical machine.
(In reply to Simone Tiraboschi from comment #16) > This is a bit unclear to me: more than one of the remaining hosts could try > to start the engine VM at the same time but sanlock will ensure that only > one host will really start it. > The hosts where sanlock prevented the engine VM from starting shouldn't mark > it as paused. > > I don't understand how every HE-eligible host already had a paused HE > running. > Just because you repeat the storage domain disconnection a few time just for > testing? This bug - the one we're commenting on - is that HE starts on *all* HE-eligible hosts, and then pauses on all but one. The bug is, we have all these HE-eligible hosts with a paused HE. The workaround is, kill those paused HE instances by hand. Nobody thought this was a big deal until recently. > > > > The way to fix such system with multiple failures is to destroy the paused > > > hosted engine vms using virsh. Hosted engine agent will start a new engine, > > > and then if needed you can get a new SPM. > > > > The problem here is, you're blind. There's no manager and no SPM, and so all > > you can do is ssh into each host and look around. If you don't know about > > this bug, it's a 24x7 severity 1 support case. > > Technically ovirt-ha-agent doesn't need an SPM host to start back the engine > VM. > As soon as you have a running manager, the death of the SPM host will not be > different from a case where the engine is on a physical machine. You're right, ovirt-ha-agent does not depend on an SPM. And now we're into a consequence of this bug. Consider this scenario. Everything is steady-state. HE is fine on one host and paused on other HE-eligible hosts. (That's this bug -HE should not be alive at all, not even paused.) The HE host dies. HE does not start up anywhere else, apparently because it's already started but paused. Now we have no manager. Let's say that HE host is also SPM - now we have no manager and no SPM. And no way to know what's going on because there's no manager. The workaround is easy; just kill the paused HE instances and fire up a new one. But figuring out that workaround is hard because we're blind. - Greg
I had another talk with my large TAM customer on this. Apparently, transient storage failures can also trigger this behavior. - Greg
The agent starts the engine VM on all nodes, but all but one should die immediately (sanlock protection). Nothing should stay in paused mode. The VM should be configured to not allow paused state at all (we want it to die and restart), let me see what we can do about that.
After discussing this with Michal Skrivanek, we can do one of two things: - enable resume policy (if it works with the lock HE uses) - duplicate the VDSM resume policy kill mode (paused on IO + running elsewhere -> kill)
it's still not clear how exactly this happened. The code is supposedly able to handle that situation, so there might be some unknown factor. We need to reproduce that locally. Nikolai/Koutuk can you pease try to reproduce that behavior
I have reproduced this bug using these steps: 1. Deploy HE on 2 hosts with iSCSI storage 2. Block iSCSI ports in the firewall on the host where HE VM is running: # iptables -A INPUT -p tcp --dport 860 -j REJECT # iptables -A INPUT -p tcp --dport 3260 -j REJECT # iptables -A OUTPUT -p tcp --dport 860 -j REJECT # iptables -A OUTPUT -p tcp --dport 3260 -j REJECT 3. As expected the VM is paused and started on the other host 4. Then I disabled the firewall. The agent and broker started, but the VM remains paused.
(In reply to Michal Skrivanek from comment #25) > it's still not clear how exactly this happened. The code is supposedly able > to handle that situation, so there might be some unknown factor. We need to > reproduce that locally. Nikolai/Koutuk can you pease try to reproduce that > behavior I think that comment #27 already explains the reproduction steps and the results. Removing the need info from myself.
Tested on these components on hosts: ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch rhvm-appliance-4.2-20180620.0.el7.noarch Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) On engine: ovirt-engine-setup-4.2.5-0.1.el7ev.noarch Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) Works for me as expected. I've deployed SHE over iSCSI on pair ha-hosts. On second host "B" that was an SPM with HE-VM running, I've blocked iSCSI target using iptables, e.g. "iptables -A OUTPUT -p tcp --destination-port 3260 -d IPaddressofthetarget -j DROP". Waited a few minutes for the engine to get started on first host "A". Wiped out iptables rule on host "B" e.g. "iptables -D OUTPUT -p tcp --destination-port 3260 -d IPaddressofthetarget -j DROP". Waited a few minutes to see what will happen and then seen that host "B" removed paused VM from itself and HE-VM perfectly continued to run on host "A" uninterruptedly. Moving to verified.
*** Bug 1460513 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2323