1393839 – Hosted engine vm status remains paused on 1st host and starts on 2nd Host during hosted-storage disconnect and reconnect

Bug 1393839 - Hosted engine vm status remains paused on 1st host and starts on 2nd Host during hosted-storage disconnect and reconnect

Summary: Hosted engine vm status remains paused on 1st host and starts on 2nd Host dur...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	4.0.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.5
Target Release:	---
Assignee:	Andrej Krejcir
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	hosted-engine
Duplicates (1):	1460513 (view as bug list)
Depends On:	1460513
Blocks:	RHV_DR 1520566 1534978 1596331
TreeView+	depends on / blocked

Reported:	2016-11-10 12:42 UTC by Koutuk Shukla
Modified:	2022-03-13 14:08 UTC (History)
CC List:	22 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-2.2.15-1.el7ev
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-31 17:50:42 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1275351	high	CLOSED	[RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungracefully by libvirt during VM migration, vdsm doe...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1278481	low	CLOSED	After problem with connection to storage domain one of hosts have paused vm	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3230061	None	None	None	2017-11-01 23:38:49 UTC
Red Hat Product Errata	RHBA-2018:2323	None	None	None	2018-07-31 17:51:08 UTC
oVirt gerrit	92169	master	ABANDONED	WIP Delete paused VM when a good engine runs elsewhere	2020-12-14 09:25:37 UTC
oVirt gerrit	92446	master	MERGED	agent: Stop paused VM if it is running elsewhere or is paused a long time	2020-12-14 09:25:37 UTC
oVirt gerrit	92488	master	MERGED	agent: Cleanup EngineUp state	2020-12-14 09:25:38 UTC
oVirt gerrit	92603	v2.2.z	MERGED	agent: Cleanup EngineUp state	2020-12-14 09:26:07 UTC
oVirt gerrit	92604	v2.2.z	MERGED	agent: Stop paused VM if it is running elsewhere or is paused a long time	2020-12-14 09:25:38 UTC

Internal Links: 1275351 1278481

Description Koutuk Shukla 2016-11-10 12:42:20 UTC

Description of problem:

-- When the Hosted-Storage is disconnected and reconnected again. Hosted engine vm starts on Second host in but remains in paused status on the First host and does not resume. 
-- Hosted engine vm needs to be powered off manually on the First host. It gets powered off sucessfully without affecting the running HE vm on Second Host.


Version-Release number of selected component (if applicable):
rhv 4.0.4


-- Steps used by Customer to reproduce this issue.

- Hosted Storage (ISCSI ) was disconnected for around 2 Minutes. 
- Approx 3 Minutes later Hosted storage was connected to the hosts back. Hosted engine vm status seen "unknown stale data" for about a minute
- Later the Hosted engine VM status was seen paused on the first host and about 30 Seconds later the Hosted engine vm was started on second Host.
- Hosted engine VM status was still seen paused on the first host even after waiting for about 2 hours the status remained paused.
- Hosted engine VM was powered off using below command on first Host. Hosted engine vm running on first host was not affected by this.
# hosted-engine --vm-poweroff

Actual results:

-- Status of the Hosted engine vm did not change on first host and remained "paused"

Expected results:

-- Hosted engine vm status should be running on 1 Host and other hosts should have status "down"

Comment 1 Michal Skrivanek 2016-11-11 08:30:59 UTC

possibly a broker issue restarting the HE VM without making sure it's gone first.
Please attach relevant logs from both hosts

Comment 2 Martin Sivák 2016-11-11 08:50:18 UTC

This not a broker "issue" actually, we do not touch paused VMs by design at the moment. Paused is also used for migration and we currently do not have a good enough rule to determine whether paused VM can be cleaned up or not.

Comment 3 Michal Skrivanek 2016-11-11 09:39:59 UTC

there's Pause Reason code which you can use to differentiate. And if you observe the VM state via vdsm layer you would see a MIGRATION_DESTINATION status instead, not PAUSED.

Comment 5 Doron Fediuck 2016-11-28 12:25:45 UTC

As you can see in bug 1278481 this is currently by design due to the VM status life cycle. However we need to be able to handle this in light of upcoming changes with migration improvements.

Michal, can you suggest an indication that will tell us when it's safe to destroy the VM?

Comment 7 Michal Skrivanek 2016-11-28 13:00:46 UTC

I guess it depends what you want to do. The information is available. In general you should be able to kill the leftover VM as soon as you see the VM running on the other side. Normally it is libvirt doing that automatically, when it doesn't work vdsm tries to do that (for vdsm initiated migrations), if that fails engine tries to do that (for engine initiated migrations)

Comment 12 Greg Scott 2018-05-10 20:03:59 UTC

This became more important yesterday. It's more than cosmetic. Consider this sequence of events:

- HE and the environment are running steady-state.  Life is good.
- The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and pauses on all but one. (That's this known bug.) Let's say the running HE is now on the SPM host.
- A while later, that SPM host dies, taking down HE with it.
- There's no power fencing.
- The surviving hosts do *not* elect another SPM because they don't know what happened to the dead SPM because there's no power fencing.  This is proper.
- Somebody needs to check "Confirm host has been rebooted" for a new SPM election to happen.
- But nobody can check that checkbox because there's no GUI because there's no manager because its hypervisor host died and managers are already running but paused on the other HE-eligible hypervisors.
- And just like that, the whole environment is tied up in knots, leading to a heavy-duty support case.

This sequence of events apparently really did happen.  We break the cycle by making sure we only restart one and only one RHVM instance if it dies, so if it dies again, the next RHVM failover will still work.

thanks

- Greg

Comment 14 Nir Soffer 2018-05-19 16:55:24 UTC

(In reply to Greg Scott from comment #12)
> - The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and
> pauses on all but one. (That's this known bug.) Let's say the running HE is
> now on the SPM host.

Can you point us to the known bug? is it this bug?

Hosted engine may try to start new engine on several hosts in the same time,
but only one engine will start. The other engine should fail to start, not pause.

> - A while later, that SPM host dies, taking down HE with it.

Hosted engine should start new engine at this point.

> - There's no power fencing.

Without power fencing your system cannot be highly available.

> - The surviving hosts do *not* elect another SPM because they don't know
> what happened to the dead SPM because there's no power fencing.  This is
> proper.

The hosts do not select new SPM since we don't have such feature. Only engine 
select a new SPM, and only if it can ensure that the old SPM is not running.

> - Somebody needs to check "Confirm host has been rebooted" for a new SPM
> election to happen.

If engine cannot access the old SPM host, yes, this is the only way to get a new
SPM.

The way to fix such system with multiple failures is to destroy the paused hosted
engine vms using virsh. Hosted engine agent will start a new engine, and then if
needed you can get a new SPM.

Comment 15 Greg Scott 2018-05-20 13:38:29 UTC

Thanks Nir

>> - The HE hypervisor dies; HE restarts on all HE-eligible hypervisors and
>> pauses on all but one. (That's this known bug.) Let's say the running HE is
>> now on the SPM host.
>
> Can you point us to the known bug? is it this bug?

Yes - this bug right here in this BZ.

>
> Hosted engine may try to start new engine on several hosts in the same time,
> but only one engine will start. The other engine should fail to start, not > > > > pause.
> 
>> - A while later, that SPM host dies, taking down HE with it.
>
> Hosted engine should start new engine at this point.

Except it doesn't. Every HE-eligible host already had a paused HE running, so nobody starts a new one.

And I'll update the support case with what you said about the manager choosing an SPM. I always thought the hypervisors elected one, so thanks for clarifying that.

> The way to fix such system with multiple failures is to destroy the paused 
> hosted engine vms using virsh. Hosted engine agent will start a new engine,
> and then if needed you can get a new SPM.

The problem here is, you're blind. There's no manager and no SPM, and so all you can do is ssh into each host and look around. If you don't know about this bug, it's a 24x7 severity 1 support case.

Comment 16 Simone Tiraboschi 2018-05-21 07:45:47 UTC

(In reply to Greg Scott from comment #15)
> > Hosted engine may try to start new engine on several hosts in the same time,
> > but only one engine will start. The other engine should fail to start, not > > > > pause.
> > 
> >> - A while later, that SPM host dies, taking down HE with it.
> >
> > Hosted engine should start new engine at this point.
> 
> Except it doesn't. Every HE-eligible host already had a paused HE running,
> so nobody starts a new one.

This is a bit unclear to me: more than one of the remaining hosts could try to start the engine VM at the same time but sanlock will ensure that only one host will really start it.
The hosts where sanlock prevented the engine VM from starting shouldn't mark it as paused.

I don't understand how every HE-eligible host already had a paused HE running.
Just because you repeat the storage domain disconnection a few time just for testing?

> > The way to fix such system with multiple failures is to destroy the paused 
> > hosted engine vms using virsh. Hosted engine agent will start a new engine,
> > and then if needed you can get a new SPM.
> 
> The problem here is, you're blind. There's no manager and no SPM, and so all
> you can do is ssh into each host and look around. If you don't know about
> this bug, it's a 24x7 severity 1 support case.

Technically ovirt-ha-agent doesn't need an SPM host to start back the engine VM.
As soon as you have a running manager, the death of the SPM host will not be different from a case where the engine is on a physical machine.

Comment 17 Greg Scott 2018-05-21 14:29:34 UTC

(In reply to Simone Tiraboschi from comment #16)

> This is a bit unclear to me: more than one of the remaining hosts could try
> to start the engine VM at the same time but sanlock will ensure that only
> one host will really start it.
> The hosts where sanlock prevented the engine VM from starting shouldn't mark
> it as paused.
> 
> I don't understand how every HE-eligible host already had a paused HE
> running.
> Just because you repeat the storage domain disconnection a few time just for
> testing?

This bug - the one we're commenting on - is that HE starts on *all* HE-eligible hosts, and then pauses on all but one. The bug is, we have all these HE-eligible hosts with a paused HE. The workaround is, kill those paused HE instances by hand. Nobody thought this was a big deal until recently.

> 
> > > The way to fix such system with multiple failures is to destroy the paused 
> > > hosted engine vms using virsh. Hosted engine agent will start a new engine,
> > > and then if needed you can get a new SPM.
> > 
> > The problem here is, you're blind. There's no manager and no SPM, and so all
> > you can do is ssh into each host and look around. If you don't know about
> > this bug, it's a 24x7 severity 1 support case.
> 
> Technically ovirt-ha-agent doesn't need an SPM host to start back the engine
> VM.
> As soon as you have a running manager, the death of the SPM host will not be
> different from a case where the engine is on a physical machine.

You're right, ovirt-ha-agent does not depend on an SPM. And now we're into a consequence of this bug. Consider this scenario.  Everything is steady-state.  HE is fine on one host and paused on other HE-eligible hosts.  (That's this bug -HE should not be alive at all, not even paused.)  The HE host dies. HE does not start up anywhere else, apparently because it's already started but paused.  Now we have no manager.  Let's say that HE host is also SPM - now we have no manager and no SPM.  And no way to know what's going on because there's no manager. The workaround is easy; just kill the paused HE instances and fire up a new one. But figuring out that workaround is hard because we're blind.

- Greg

Comment 18 Greg Scott 2018-05-24 20:58:23 UTC

I had another talk with my large TAM customer on this. Apparently, transient storage failures can also trigger this behavior.

- Greg

Comment 21 Martin Sivák 2018-06-05 10:16:05 UTC

The agent starts the engine VM on all nodes, but all but one should die immediately (sanlock protection). Nothing should stay in paused mode.

The VM should be configured to not allow paused state at all (we want it to die and restart), let me see what we can do about that.

Comment 23 Martin Sivák 2018-06-08 11:30:43 UTC

After discussing this with Michal Skrivanek, we can do one of two things:

- enable resume policy (if it works with the lock HE uses)
- duplicate the VDSM resume policy kill mode (paused on IO + running elsewhere -> kill)

Comment 25 Michal Skrivanek 2018-06-21 07:58:49 UTC

it's still not clear how exactly this happened. The code is supposedly able to handle that situation, so there might be some unknown factor. We need to reproduce that locally. Nikolai/Koutuk can you pease try to reproduce that behavior

Comment 27 Andrej Krejcir 2018-06-21 14:44:23 UTC

I have reproduced this bug using these steps:

1. Deploy HE on 2 hosts with iSCSI storage
2. Block iSCSI ports in the firewall on the host where HE VM is running:

  # iptables -A INPUT -p tcp --dport 860 -j REJECT
  # iptables -A INPUT -p tcp --dport 3260 -j REJECT
  # iptables -A OUTPUT -p tcp --dport 860 -j REJECT
  # iptables -A OUTPUT -p tcp --dport 3260 -j REJECT

3. As expected the VM is paused and started on the other host
4. Then I disabled the firewall. The agent and broker started, but the VM remains paused.

Comment 28 Nikolai Sednev 2018-06-24 10:58:01 UTC

(In reply to Michal Skrivanek from comment #25)
> it's still not clear how exactly this happened. The code is supposedly able
> to handle that situation, so there might be some unknown factor. We need to
> reproduce that locally. Nikolai/Koutuk can you pease try to reproduce that
> behavior

I think that comment #27 already explains the reproduction steps and the results.
Removing the need info from myself.

Comment 30 Nikolai Sednev 2018-07-03 13:52:56 UTC

Tested on these components on hosts:
ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch
rhvm-appliance-4.2-20180620.0.el7.noarch
Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

On engine:
ovirt-engine-setup-4.2.5-0.1.el7ev.noarch
Linux 3.10.0-862.6.3.el7.x86_64 #1 SMP Fri Jun 15 17:57:37 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)


Works for me as expected.

I've deployed SHE over iSCSI on pair ha-hosts.
On second host "B" that was an SPM with HE-VM running, I've blocked iSCSI target using iptables, e.g.
"iptables -A OUTPUT -p tcp --destination-port 3260 -d IPaddressofthetarget -j DROP".
Waited a few minutes for the engine to get started on first host "A".
Wiped out iptables rule on host "B" e.g. "iptables -D OUTPUT -p tcp --destination-port 3260 -d IPaddressofthetarget -j DROP".
Waited a few minutes to see what will happen and then seen that host "B" removed paused VM from itself and HE-VM perfectly continued to run on host "A" uninterruptedly.

Moving to verified.

Comment 31 Andrej Krejcir 2018-07-04 12:44:21 UTC

*** Bug 1460513 has been marked as a duplicate of this bug. ***

Comment 33 errata-xmlrpc 2018-07-31 17:50:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2323

Note You need to log in before you can comment on or make changes to this bug.

achareka
akrejcir
dfediuck
ebenahar
gscott
gveitmic
kshukla
lsurette
mavital
michal.skrivanek
mkalinin
msivak
nsednev
nsoffer
ratamir
Rhev-m-bugs
shipatil
sirao
srevivo
stirabos
ykaul
ylavi