Bug 1527249 - [DR] - HA VM with lease will not work, if SPM is down and power management is not available.
Summary: [DR] - HA VM with lease will not work, if SPM is down and power management is...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.8
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.0
: 4.3.0
Assignee: Eyal Shenitzky
QA Contact: Yosi Ben Shimon
URL:
Whiteboard:
Depends On:
Blocks: 1639269
TreeView+ depends on / blocked
 
Reported: 2017-12-19 01:39 UTC by Germano Veit Michel
Modified: 2021-09-09 12:57 UTC (History)
17 users (show)

Fixed In Version: ovirt-engine-4.3.0_alpha
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1639269 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:36:59 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
izuckerm: testing_plan_complete+


Attachments (Terms of Use)
Wasnt able to see the fix on HE env rhv-release-4.4.1-12-001 (510.79 KB, application/zip)
2020-08-05 08:20 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3469861 0 None None None 2018-07-25 12:49:40 UTC
Red Hat Product Errata RHEA-2019:1085 0 None None None 2019-05-08 12:37:22 UTC
oVirt gerrit 94864 0 master MERGED core: allow HA VM with a lease on a non-active storage domain to run 2020-12-22 15:21:24 UTC
oVirt gerrit 94903 0 master MERGED Revert "core: allow HA VM with a lease on a non-active storage domain to run" 2020-12-22 15:21:24 UTC
oVirt gerrit 94904 0 master MERGED core: allow HA VM with a lease on a non-active storage domain to run 2020-12-22 15:21:24 UTC
oVirt gerrit 94926 0 ovirt-engine-4.2 MERGED core: allow HA VM with a lease on a non-active storage domain to run 2020-12-22 15:21:55 UTC

Description Germano Veit Michel 2017-12-19 01:39:20 UTC
Description of problem:

When the SPM host has power management configuration enabled and loses power, the SPM role does not failover to another host. The engine keeps trying to fence the SPM host and failing in a loop because the fence agent also lost power.

If power management is disabled, it works fine and the SPM role fails over normally.

Without Power Management:
1. SPM Host is A
2. A has its power cut
3. Host B assumes SPM role as soon as the ids lease for the SDM resource expires

With Power Management:
1. SPM Host is A
2. A has its power cut
3. Engine keeps trying to fence Host A, fencing keeps failing as fence agent also has no power
4. No commands sent for other hosts to grab the SDM lease and become SPM. 
5. Engine stuck in a loop trying to fence A.
6. SDM resource is available for any other host to get it and move the Data Center back to Up status, but nothing happens
7. As DC is down, VM HA also fails.

So basically the DC is down because there is no SPM and the HA VMs that were running on the SPM are also down.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.8.2-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Enable Power Management for host A, with wrong configuration
2. Set A as SPM
3. In A, block communication with engine and stop renewing sanlock leases:
   # iptables -A INPUT -s <RHV-M IP> -j DROP
   # systemctl stop vdsmd (to release the SDM lease)

* If you skip step 1, everything works.

Actual results:
Data Center not responsive, HA VMs not restarted (even with lease option).

Expected results:
SPM role failing over to another host. Data Center up and VM HA mechanism working

Comment 3 Yaniv Kaul 2017-12-19 07:49:08 UTC
Why is it medium severity? Sounds like basic functionality that should work?
Raz - I assume this is tested all the time?

Comment 5 Raz Tamir 2017-12-19 08:41:56 UTC
(In reply to Yaniv Kaul from comment #3)
> Why is it medium severity? Sounds like basic functionality that should work?
> Raz - I assume this is tested all the time?

This should be tested by coresystem team.

Lukas,

Can you provide more info about the testing frequency of this issue?

Comment 7 Eli Mesika 2017-12-19 14:21:13 UTC
(In reply to Germano Veit Michel from comment #0)

> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Enable Power Management for host A, with wrong configuration
> 2. Set A as SPM
> 3. In A, block communication with engine and stop renewing sanlock leases:
>    # iptables -A INPUT -s <RHV-M IP> -j DROP
>    # systemctl stop vdsmd (to release the SDM lease)
> 
> * If you skip step 1, everything works.
> 

Well, this is not reproducible by the scenario you wrote since the soft-fencing procedure occurs before the hard-fencing(reboot) , the soft-fencing procedure restarts the vdsmd service and another host become SPM ...

Following your procedure, the SPM lease was released and the other host that I had (lets call it host B) became SPM

Please also add the following details :

1) How many VMs are running on Host A before the scenario starts ?
2) How many VMs from 1) are HA ?
3) Please specify what happened to the HA VMs in your scenario

Comment 8 Steffen Froemer 2017-12-19 14:58:19 UTC
Hi Eli,

I try to answer these questions.
How will be soft-fencing possible, when any connection from rhv-m is dropped on host A? 
If it is possible in your test, it does not cover the real situation on customer side. 

1) It does not matter how many VMs are running on host A, unless it's minimum 1 HA flagged VM.
2) see 1)
3) nothing. the SPM did not switched and also HA VM did not started. Latter could be if still running and updating VM-lease. But lately on killing the qemu-process, it should be started on host B

Comment 9 Germano Veit Michel 2017-12-19 23:15:21 UTC
(In reply to Eli Mesika from comment #7)
> (In reply to Germano Veit Michel from comment #0)
> 
> > How reproducible:
> > 100%
> > 
> > Steps to Reproduce:
> > 1. Enable Power Management for host A, with wrong configuration
> > 2. Set A as SPM
> > 3. In A, block communication with engine and stop renewing sanlock leases:
> >    # iptables -A INPUT -s <RHV-M IP> -j DROP
> >    # systemctl stop vdsmd (to release the SDM lease)
> > 
> > * If you skip step 1, everything works.
> > 
> 
> Well, this is not reproducible by the scenario you wrote since the
> soft-fencing procedure occurs before the hard-fencing(reboot) , the
> soft-fencing procedure restarts the vdsmd service and another host become
> SPM ...

How can soft-fencing work if iptables blocks the traffic? If you see the logs, you will see that soft fencing fails in the reproducer.

The steps reproduce the bug the customer hit. I tried 3 times, and my engine is looping 3 times. No SPM failover. DC is down. Disable Power Management and everything works.

> 
> Following your procedure, the SPM lease was released and the other host that
> I had (lets call it host B) became SPM
> 
> Please also add the following details :
> 
> 1) How many VMs are running on Host A before the scenario starts?

In customer logs, you can see 2 or 3 VMs going to unknown state in the logs when the host powers off.

> 2) How many VMs from 1) are HA ?

All of them.

> 3) Please specify what happened to the HA VMs in your scenario

Nothing, in my reproducer I did not have HA VMs. I just reproduced the failure to swith the SPM role. As the DC is down, nothing else works, I don't think we need to worry about VM HA now, without SPM the DC is down.

Comment 10 Eli Mesika 2017-12-21 14:48:42 UTC
Well, first of all the real scenario that I tested is :

Host A and Host B on teh same cluster (4.1)

1)Host A is powered off while it has the SPM lease 
2)PM is configured wrongly on A

Then I repeated the scenario with no PM configuration on host A

In both cases the SPM lease was not released from host A and it actually holds it until the SPM lease was transferred manually.

The reporter wrote in the scenario that we should manually stop vdsmd, I did not follow that , the scenario is power-off of the host , not a manual stop of the vdsmd service. So , we have to be aligned in the reproduction with the reported scenario....

From looking in the code in SpmStopVDSCommand::executeVdsBrokerCommand() it seems that if host is un-reachable , we can not transfer the SPM from it automatically ....

Can't this be handled by sanlock? If host is shutdown for a while, its sanlock lease is not updated and maybe after a while sanlock on different host is able to acquire its lease ???

Comment 11 Germano Veit Michel 2017-12-21 23:46:00 UTC
Ohh, sorry!

You are right Eli. It's indeed slightly different. Somehow gracefully releasing the lease makes it work, and letting it expire doesn't. Wonder why.

So we have 2 scenarios to fix?

1) SPM failover when host powered off regardless of power management
2) SPM failover when vdsm shutdown gracefully and power management on

And I'm afraid 2 can only be hit on purpose.

Comment 13 Eli Mesika 2018-01-03 12:55:31 UTC
Allon 

for scenario 1 in comment 11 :

If vdsmd is stopped unexpectedly as a power-off result and the PM agent associated with the host is not reachable , than engine can not release the SPM, is there any expiration time in which the SPM election will start again, from my check it seems that the answer is no.

However, if the answer for that is 'No', I think that this should be closed as NOT A BUG , since the administrator in that case should anyway confirm that host was rebooted and he can also transfer the SPM to any other host manually

Comment 15 Steffen Froemer 2018-01-03 13:09:47 UTC
I do not agree to close this as NOT A BUG, as the mentioned scenario is a valid disaster scenario in production environment. 

In this way, I need to assume, the HA feature of RHV is simply NOT working.
If an administrator need to confirm the outage of an environment, there is NO automatic failover mechanism existent.

Comment 16 Eli Mesika 2018-01-03 13:28:04 UTC
(In reply to Steffen Froemer from comment #15)
> I do not agree to close this as NOT A BUG, as the mentioned scenario is a
> valid disaster scenario in production environment. 
> 
> In this way, I need to assume, the HA feature of RHV is simply NOT working.
> If an administrator need to confirm the outage of an environment, there is
> NO automatic failover mechanism existent.

Therefor there is a NEEDINFO on Allon M who leads the storage team and can check if we can handle this scenario....

Comment 17 Martin Perina 2018-01-03 14:35:23 UTC
(In reply to Steffen Froemer from comment #15)
> I do not agree to close this as NOT A BUG, as the mentioned scenario is a
> valid disaster scenario in production environment. 
> 
> In this way, I need to assume, the HA feature of RHV is simply NOT working.
> If an administrator need to confirm the outage of an environment, there is
> NO automatic failover mechanism existent.

Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected.

Comment 18 Martin Perina 2018-01-03 15:12:06 UTC
(In reply to Martin Perina from comment #17)
> (In reply to Steffen Froemer from comment #15)
> > I do not agree to close this as NOT A BUG, as the mentioned scenario is a
> > valid disaster scenario in production environment. 
> > 
> > In this way, I need to assume, the HA feature of RHV is simply NOT working.
> > If an administrator need to confirm the outage of an environment, there is
> > NO automatic failover mechanism existent.

Sorry, there was a conflict when submitting my comment and I got lost some parts during resubmission, here's the original post:

Working power management setup is a prerequisity for HA VMs feature. When you are not able to fence the host, you don't know if VM is running on it on not, so you are risking splitbrain if you try to execute it on a different host. And the same is with the SPM, if host cannot be fenced, we don't know if host has access to the storage or not, so we cannot delegate SPM role to different host.

So the only remaining question is if SPM lock is expired when host if physically off. Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected.

Comment 19 Allon Mureinik 2018-01-07 09:55:35 UTC
Hi guys,

Sorry for the late response. I was on PTO and am digging through my backlog.

Engine currently does not issue an SPMStart command to any other host until it's sure the SPM is down (e.g., successfully fenced, admin has clicked "confirm host has been rebooted", etc).

I guess we could add a timeout (for argument's sake, Sanlock's timeout, counting starting after power fencing was attempted and failed).

Nir/Tal - am I missing anything here?

Comment 20 Nir Soffer 2018-01-07 13:07:51 UTC
We cannot use timeouts or any other heuristics, because of the master mount.

On block storage, the SPM is mounting the master lv, so we cannot start
the SPM on another host before unmounting the master lv.

On file storage, we can start the SPM once vdsm was killed on the old SPM.

So what we can do is:

- master domain on block storage - nothing, the admin must manually reboot the SPM
  host.

- master domain on file storage - we can check if the SPM holds the SPM lease, in
  the same way we check host liveliness during fencing. If the SPM does not have
  a lease, it is safe to start the SPM on another host

If we want a solution for block storage, we can use a killpath program to kill
vdsm when the SPM lost the lease.

This program will:
- terminate vdsm
- check if the master mount was unmounted, and ummount it if needed
- if the master mount could not be unmounted, fail. This will cause sanlock
  to reboot the host

When we have a killpath program, we can use sanlock_request api to take the SPM 
lease from the current SPM from another host.

    sanlock client request -r RESOURCE -f force_mode

    Request the owner of a resource do something specified by force_mode.
    A versioned RESOURCE:lver string must be used with a greater version than
    is presently held.  Zero lver and force_mode clears the request.

David, what do you think?

Comment 21 Eli Mesika 2018-01-08 08:25:33 UTC
(In reply to Nir Soffer from comment #20)

> If we want a solution for block storage, we can use a killpath program to
> kill vdsm when the SPM lost the lease.

Will this work when the host is shutdown unexpectedly (as a result of outage for example) , from what I saw in this scenario the VDSM service is killed (so , you need no killpath program to kill it since the host is dead) and the lease is not released (this is the scenario which is described in this BZ)

Comment 22 Nir Soffer 2018-01-08 09:30:26 UTC
(In reply to Eli Mesika from comment #21)
> (In reply to Nir Soffer from comment #20)
> 
> Will this work when the host is shutdown unexpectedly

If vdsm is killed, nobody can ensure that the mount is unmounted, you will have
to wait until the host is up again, or the user can manually use "confirm host 
was rebooted".

If the host is shutdown, there is no mount but we don't have a way to tell that a
host was shutdown.

We can check if a host is maintaining a lease on storage, but I don't know if we
have a way to detect that a host was rebooted.

David, can we use the delta lease to track host reboots?

Comment 23 David Teigland 2018-01-08 17:26:24 UTC
Each time a host joins a lockspace by acquiring the host_id lease, the generation number for that host_id lease is incremented.  So the host_id generation number may be what you're looking for.

Comment 25 Nir Soffer 2018-01-10 15:02:13 UTC
This is not 4.1.9 material, please move to 4.3.

Comment 26 Allon Mureinik 2018-01-15 11:41:48 UTC
(In reply to Nir Soffer from comment #25)
> This is not 4.1.9 material, please move to 4.3.

I agree this probably isn't 4.1.9 material at this point, but this needs further discussion. We seem to be missing something - I get the reasoning in comment 20, but we've reached the odd situation where offhand it seems as though a system without power-fencing configured may, in fact, be stabler than a system with power-fencing.

Pushing out to 4.2.2 for the meanwhile until we have a clear action plan. At that point, we can defer to 4.3 or even backport to 4.1.10 if there's such a release.

Comment 27 Nir Soffer 2018-01-16 17:38:03 UTC
I think comment 0 is wrong. When you don't have power management, we cannot
start the SPM on another host unless the user confirmed that the host was rebooted.
If we did this, we would corrupt the master mount on block storage.

Next step: try to reproduce what comment 0 describe.

Comment 28 Germano Veit Michel 2018-01-17 00:32:49 UTC
(In reply to Nir Soffer from comment #27)
> I think comment 0 is wrong. When you don't have power management, we cannot
> start the SPM on another host unless the user confirmed that the host was
> rebooted.
> If we did this, we would corrupt the master mount on block storage.
> 
> Next step: try to reproduce what comment 0 describe.

Nir you are right, I missed the fact that stopping vdsm gracefully releases the SPM role. We realized that further in the bug. See comment 11 for the actual problems here. No need to test it again.

The SPM role is not started on a different host in case of power failure (with fencing or not).

Comment 39 Germano Veit Michel 2018-02-07 04:46:32 UTC
I upgraded to 4.1.9 and re-did the tests.

To my surprise HA now works without a SPM, and doesn't matter the SPM power management settings (wheres in older versions engine would loop on SPM PM attempts and do nothing else).

So not having a SPM doesn't kill HA anymore. Any idea what fixed it?

Since this is working, I assume it's ok to lower the severity of this bug.

Comment 40 Germano Veit Michel 2018-06-05 00:20:52 UTC
(In reply to Germano Veit Michel from comment #39)
> I upgraded to 4.1.9 and re-did the tests.
> 
> To my surprise HA now works without a SPM, and doesn't matter the SPM power
> management settings (wheres in older versions engine would loop on SPM PM
> attempts and do nothing else).
> 
> So not having a SPM doesn't kill HA anymore. Any idea what fixed it?
> 
> Since this is working, I assume it's ok to lower the severity of this bug.

We just had a customer hitting this on 4.2. And we have re-tested it and it looks like in 4.2 there is no HA (leases) functionality without an SPM. 

It hits ACTION_TYPE_FAILED_INVALID_VM_LEASE_STORAGE_DOMAIN_STATUS in RunVmValidator.

Not sure if it worked on 4.1.9 due to some luck we had, but it seems we have problems again:

In summary:
1) SPM does not fail-over without power management or manual intervention (SPOF)
2) If HA leases functionality depends on SPM up (Lease SD in Up status), then the problem is more severe.

Comment 42 Nir Soffer 2018-06-10 12:35:47 UTC
(In reply to Germano Veit Michel from comment #40)
> In summary:
> 1) SPM does not fail-over without power management or manual intervention
> (SPOF)
> 2) If HA leases functionality depends on SPM up (Lease SD in Up status),
> then the problem is more severe.

Creating a lease depends on the SPM, but when you have a vm with a lease it does
not need the SPM to start.

I think the issue that that once the SPM is down, engine marks all storage domains
as down, and this prevents using a vm with a lease on any domain.

This should be fixed in engine; not having SPM should not move storage domains to
down state. We are monitoring them successfully from all hosts and they should not
depend on having SPM. I guess this will not be an easy fix, this is the basic
design of the system for ages.

Tal, what do yo think?

Comment 48 Yaniv Lavi 2018-10-11 09:35:03 UTC
We found out this is a regression added in RHV 4.2.
I want this fixed in the current z stream and adding a blocker flag, due to the impact.

Comment 49 Nir Soffer 2018-10-11 11:08:51 UTC
Eyal, can you explain why this is a regression? Do you think the validation added
of bug 1561006 is the root cause?

Comment 50 Eyal Shenitzky 2018-10-11 11:16:34 UTC
Yes, 
This validation will prevent from an HA VM with a lease to run if the lease storage domain is not active.
In the scenario above, all the storage domains become 'non-active' and the engine fails the restoration of the VM.

Comment 51 Eyal Shenitzky 2018-10-15 11:56:46 UTC
Steps to reproduce:

In an environment with 2 hosts 'h1' SPM and 'h2' HSM:

1. Set the HSM host ('h2') SPM priority to 'never'
1. Create a VM with a disk and a lease
2. Run the VM on the SPM
3. Block the connection from the engine to the SPM.

Before this fix:
VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all the storage domain deactivated

After this fix:
VM managed to run on the HSM ('h2') even if all the storage domain are down

Comment 53 Eyal Shenitzky 2018-10-21 07:19:33 UTC
Forgot a step (4).

Updated steps to reproduce are:

In an environment with 2 hosts 'h1' SPM and 'h2' HSM:

1. Set the HSM host ('h2') SPM priority to 'never'
1. Create a VM with a disk and a lease
2. Run the VM on the SPM
3. Block the connection from the SPM to the storage.
3. Block the connection from the engine to the SPM. ==> simulating crashed SPM

Comment 54 Yosi Ben Shimon 2018-10-29 15:50:16 UTC
Tested using:
ovirt-engine-setup-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch
vdsm-4.30.1-25.gitce9e416.el7.x86_64

Actual result (according to the steps in comment #53):
After the connection was blocked, all the SDs & DC went down and the VM failover to the 2nd host (with SPM=never) as expected.

VERIFIED

Comment 55 Eyal Shenitzky 2018-10-29 18:25:22 UTC
Yossi,

Can you please change the bug status?

Comment 58 errata-xmlrpc 2019-05-08 12:36:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 59 Daniel Gur 2019-08-28 13:13:29 UTC
sync2jira

Comment 60 Daniel Gur 2019-08-28 13:17:42 UTC
sync2jira

Comment 63 Ilan Zuckerman 2020-08-05 08:14:30 UTC
(In reply to Eyal Shenitzky from comment #51)
> Steps to reproduce:
> 
> In an environment with 2 hosts 'h1' SPM and 'h2' HSM:
> 
> 1. Set the HSM host ('h2') SPM priority to 'never'
> 2. Create a VM with a disk and a lease
> 3. Run the VM on the SPM
> 4. Block the connection from the SPM to the storage.
> 5. Block the connection from the engine to the SPM. ==> simulating crashed SPM

> 
> Before this fix:
> VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all
> the storage domain deactivated
> 
> After this fix:
> VM managed to run on the HSM ('h2') even if all the storage domain are down

Eyal, I wasnt able to get the expected behavior on HE env rhv-release-4.4.1-12-001.noarch
I assume this should be working on HE env as well, right?

This is my setup:
host1: set as SPM. no vms
host2: SPM priority set to 'never'. no vms
host3: SPM priority set to 'never'. running the HE vm

Steps:
1. Create template vm with lease on iscsi
2. Start it on host1
3. Block the connection from the SPM to the storage:
[root@caracal04 ~]# iptables -A OUTPUT -d 3par-iscsi-1.scl.lab.tlv.redhat.com -j DROP

4. Block the connection from the engine to the SPM. ==> simulating crashed SPM
[root@hosted-engine-09 ~]# iptables -A OUTPUT -d 10.46.30.4 -j DROP
(the ip is for caracal04)

Now, all of the SD's went down,
The vm went to status 'unknown' but showing still on host1
The HA vm is NOT migrating to host2 / host3 as expected.

Attaching vdsm and engine logs

Comment 64 Ilan Zuckerman 2020-08-05 08:20:44 UTC
Created attachment 1710479 [details]
Wasnt able to see the fix on HE env rhv-release-4.4.1-12-001

Comment 65 Ilan Zuckerman 2020-08-06 06:06:29 UTC
Also tried the same scenario from comment #63 on a regular (NOT HE) env.
The result is AS EXPECTED.
After a few minutes of SD's inactive, the HA vm migrates to host2

Comment 66 Eyal Shenitzky 2020-08-10 10:44:40 UTC
This bug is already closed and verified.

If you think that there is a bug please file a new bug with all the details.

Comment 67 Ilan Zuckerman 2020-08-17 06:21:50 UTC
(In reply to Eyal Shenitzky from comment #66)
> This bug is already closed and verified.
> 
> If you think that there is a bug please file a new bug with all the details.

New BZ opened as you suggested:
https://bugzilla.redhat.com/show_bug.cgi?id=1869162


Note You need to log in before you can comment on or make changes to this bug.