Bug 1639269
| Summary: | [downstream clone - 4.2.7] [DR] - HA VM with lease will not work, if SPM is down and power management is not available. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
| Component: | ovirt-engine | Assignee: | Eyal Shenitzky <eshenitz> |
| Status: | CLOSED ERRATA | QA Contact: | Yosi Ben Shimon <ybenshim> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.1.8 | CC: | apinnick, emesika, eshenitz, gveitmic, kshukla, lsurette, lsvaty, lveyde, mgoldboi, michal.skrivanek, mlipchuk, mperina, nsoffer, ratamir, Rhev-m-bugs, sfroemer, srevivo, teigland, tnisan, ylavi |
| Target Milestone: | ovirt-4.2.7 | Keywords: | Regression, ZStream |
| Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.2.7.3 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1527249 | Environment: | |
| Last Closed: | 2018-11-05 15:03:18 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1527249 | ||
| Bug Blocks: | |||
|
Description
RHV bug bot
2018-10-15 12:38:47 UTC
Why is it medium severity? Sounds like basic functionality that should work? Raz - I assume this is tested all the time? (Originally by Yaniv Kaul) (In reply to Yaniv Kaul from comment #3) > Why is it medium severity? Sounds like basic functionality that should work? > Raz - I assume this is tested all the time? This should be tested by coresystem team. Lukas, Can you provide more info about the testing frequency of this issue? (Originally by Raz Tamir) (In reply to Germano Veit Michel from comment #0) > How reproducible: > 100% > > Steps to Reproduce: > 1. Enable Power Management for host A, with wrong configuration > 2. Set A as SPM > 3. In A, block communication with engine and stop renewing sanlock leases: > # iptables -A INPUT -s <RHV-M IP> -j DROP > # systemctl stop vdsmd (to release the SDM lease) > > * If you skip step 1, everything works. > Well, this is not reproducible by the scenario you wrote since the soft-fencing procedure occurs before the hard-fencing(reboot) , the soft-fencing procedure restarts the vdsmd service and another host become SPM ... Following your procedure, the SPM lease was released and the other host that I had (lets call it host B) became SPM Please also add the following details : 1) How many VMs are running on Host A before the scenario starts ? 2) How many VMs from 1) are HA ? 3) Please specify what happened to the HA VMs in your scenario (Originally by Eli Mesika) Hi Eli, I try to answer these questions. How will be soft-fencing possible, when any connection from rhv-m is dropped on host A? If it is possible in your test, it does not cover the real situation on customer side. 1) It does not matter how many VMs are running on host A, unless it's minimum 1 HA flagged VM. 2) see 1) 3) nothing. the SPM did not switched and also HA VM did not started. Latter could be if still running and updating VM-lease. But lately on killing the qemu-process, it should be started on host B (Originally by Steffen Froemer) (In reply to Eli Mesika from comment #7) > (In reply to Germano Veit Michel from comment #0) > > > How reproducible: > > 100% > > > > Steps to Reproduce: > > 1. Enable Power Management for host A, with wrong configuration > > 2. Set A as SPM > > 3. In A, block communication with engine and stop renewing sanlock leases: > > # iptables -A INPUT -s <RHV-M IP> -j DROP > > # systemctl stop vdsmd (to release the SDM lease) > > > > * If you skip step 1, everything works. > > > > Well, this is not reproducible by the scenario you wrote since the > soft-fencing procedure occurs before the hard-fencing(reboot) , the > soft-fencing procedure restarts the vdsmd service and another host become > SPM ... How can soft-fencing work if iptables blocks the traffic? If you see the logs, you will see that soft fencing fails in the reproducer. The steps reproduce the bug the customer hit. I tried 3 times, and my engine is looping 3 times. No SPM failover. DC is down. Disable Power Management and everything works. > > Following your procedure, the SPM lease was released and the other host that > I had (lets call it host B) became SPM > > Please also add the following details : > > 1) How many VMs are running on Host A before the scenario starts? In customer logs, you can see 2 or 3 VMs going to unknown state in the logs when the host powers off. > 2) How many VMs from 1) are HA ? All of them. > 3) Please specify what happened to the HA VMs in your scenario Nothing, in my reproducer I did not have HA VMs. I just reproduced the failure to swith the SPM role. As the DC is down, nothing else works, I don't think we need to worry about VM HA now, without SPM the DC is down. (Originally by Germano Veit Michel) Well, first of all the real scenario that I tested is : Host A and Host B on teh same cluster (4.1) 1)Host A is powered off while it has the SPM lease 2)PM is configured wrongly on A Then I repeated the scenario with no PM configuration on host A In both cases the SPM lease was not released from host A and it actually holds it until the SPM lease was transferred manually. The reporter wrote in the scenario that we should manually stop vdsmd, I did not follow that , the scenario is power-off of the host , not a manual stop of the vdsmd service. So , we have to be aligned in the reproduction with the reported scenario.... From looking in the code in SpmStopVDSCommand::executeVdsBrokerCommand() it seems that if host is un-reachable , we can not transfer the SPM from it automatically .... Can't this be handled by sanlock? If host is shutdown for a while, its sanlock lease is not updated and maybe after a while sanlock on different host is able to acquire its lease ??? (Originally by Eli Mesika) Ohh, sorry! You are right Eli. It's indeed slightly different. Somehow gracefully releasing the lease makes it work, and letting it expire doesn't. Wonder why. So we have 2 scenarios to fix? 1) SPM failover when host powered off regardless of power management 2) SPM failover when vdsm shutdown gracefully and power management on And I'm afraid 2 can only be hit on purpose. (Originally by Germano Veit Michel) Allon for scenario 1 in comment 11 : If vdsmd is stopped unexpectedly as a power-off result and the PM agent associated with the host is not reachable , than engine can not release the SPM, is there any expiration time in which the SPM election will start again, from my check it seems that the answer is no. However, if the answer for that is 'No', I think that this should be closed as NOT A BUG , since the administrator in that case should anyway confirm that host was rebooted and he can also transfer the SPM to any other host manually (Originally by Eli Mesika) I do not agree to close this as NOT A BUG, as the mentioned scenario is a valid disaster scenario in production environment. In this way, I need to assume, the HA feature of RHV is simply NOT working. If an administrator need to confirm the outage of an environment, there is NO automatic failover mechanism existent. (Originally by Steffen Froemer) (In reply to Steffen Froemer from comment #15) > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > valid disaster scenario in production environment. > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > If an administrator need to confirm the outage of an environment, there is > NO automatic failover mechanism existent. Therefor there is a NEEDINFO on Allon M who leads the storage team and can check if we can handle this scenario.... (Originally by Eli Mesika) (In reply to Steffen Froemer from comment #15) > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > valid disaster scenario in production environment. > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > If an administrator need to confirm the outage of an environment, there is > NO automatic failover mechanism existent. Therefor there is a NEEDINFO on Allon M who leads the storage team and can check if we can handle this scenario.... (Originally by Eli Mesika) (In reply to Steffen Froemer from comment #15) > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > valid disaster scenario in production environment. > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > If an administrator need to confirm the outage of an environment, there is > NO automatic failover mechanism existent. Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected. (Originally by Martin Perina) (In reply to Martin Perina from comment #17) > (In reply to Steffen Froemer from comment #15) > > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > > valid disaster scenario in production environment. > > > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > > If an administrator need to confirm the outage of an environment, there is > > NO automatic failover mechanism existent. Sorry, there was a conflict when submitting my comment and I got lost some parts during resubmission, here's the original post: Working power management setup is a prerequisity for HA VMs feature. When you are not able to fence the host, you don't know if VM is running on it on not, so you are risking splitbrain if you try to execute it on a different host. And the same is with the SPM, if host cannot be fenced, we don't know if host has access to the storage or not, so we cannot delegate SPM role to different host. So the only remaining question is if SPM lock is expired when host if physically off. Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected. (Originally by Martin Perina) Hi guys, Sorry for the late response. I was on PTO and am digging through my backlog. Engine currently does not issue an SPMStart command to any other host until it's sure the SPM is down (e.g., successfully fenced, admin has clicked "confirm host has been rebooted", etc). I guess we could add a timeout (for argument's sake, Sanlock's timeout, counting starting after power fencing was attempted and failed). Nir/Tal - am I missing anything here? (Originally by amureini) We cannot use timeouts or any other heuristics, because of the master mount.
On block storage, the SPM is mounting the master lv, so we cannot start
the SPM on another host before unmounting the master lv.
On file storage, we can start the SPM once vdsm was killed on the old SPM.
So what we can do is:
- master domain on block storage - nothing, the admin must manually reboot the SPM
host.
- master domain on file storage - we can check if the SPM holds the SPM lease, in
the same way we check host liveliness during fencing. If the SPM does not have
a lease, it is safe to start the SPM on another host
If we want a solution for block storage, we can use a killpath program to kill
vdsm when the SPM lost the lease.
This program will:
- terminate vdsm
- check if the master mount was unmounted, and ummount it if needed
- if the master mount could not be unmounted, fail. This will cause sanlock
to reboot the host
When we have a killpath program, we can use sanlock_request api to take the SPM
lease from the current SPM from another host.
sanlock client request -r RESOURCE -f force_mode
Request the owner of a resource do something specified by force_mode.
A versioned RESOURCE:lver string must be used with a greater version than
is presently held. Zero lver and force_mode clears the request.
David, what do you think?
(Originally by Nir Soffer)
(In reply to Nir Soffer from comment #20) > If we want a solution for block storage, we can use a killpath program to > kill vdsm when the SPM lost the lease. Will this work when the host is shutdown unexpectedly (as a result of outage for example) , from what I saw in this scenario the VDSM service is killed (so , you need no killpath program to kill it since the host is dead) and the lease is not released (this is the scenario which is described in this BZ) (Originally by Eli Mesika) (In reply to Eli Mesika from comment #21) > (In reply to Nir Soffer from comment #20) > > Will this work when the host is shutdown unexpectedly If vdsm is killed, nobody can ensure that the mount is unmounted, you will have to wait until the host is up again, or the user can manually use "confirm host was rebooted". If the host is shutdown, there is no mount but we don't have a way to tell that a host was shutdown. We can check if a host is maintaining a lease on storage, but I don't know if we have a way to detect that a host was rebooted. David, can we use the delta lease to track host reboots? (Originally by Nir Soffer) Each time a host joins a lockspace by acquiring the host_id lease, the generation number for that host_id lease is incremented. So the host_id generation number may be what you're looking for. (Originally by David Teigland) This is not 4.1.9 material, please move to 4.3. (Originally by Nir Soffer) (In reply to Nir Soffer from comment #25) > This is not 4.1.9 material, please move to 4.3. I agree this probably isn't 4.1.9 material at this point, but this needs further discussion. We seem to be missing something - I get the reasoning in comment 20, but we've reached the odd situation where offhand it seems as though a system without power-fencing configured may, in fact, be stabler than a system with power-fencing. Pushing out to 4.2.2 for the meanwhile until we have a clear action plan. At that point, we can defer to 4.3 or even backport to 4.1.10 if there's such a release. (Originally by amureini) I think comment 0 is wrong. When you don't have power management, we cannot start the SPM on another host unless the user confirmed that the host was rebooted. If we did this, we would corrupt the master mount on block storage. Next step: try to reproduce what comment 0 describe. (Originally by Nir Soffer) (In reply to Nir Soffer from comment #27) > I think comment 0 is wrong. When you don't have power management, we cannot > start the SPM on another host unless the user confirmed that the host was > rebooted. > If we did this, we would corrupt the master mount on block storage. > > Next step: try to reproduce what comment 0 describe. Nir you are right, I missed the fact that stopping vdsm gracefully releases the SPM role. We realized that further in the bug. See comment 11 for the actual problems here. No need to test it again. The SPM role is not started on a different host in case of power failure (with fencing or not). (Originally by Germano Veit Michel) I upgraded to 4.1.9 and re-did the tests. To my surprise HA now works without a SPM, and doesn't matter the SPM power management settings (wheres in older versions engine would loop on SPM PM attempts and do nothing else). So not having a SPM doesn't kill HA anymore. Any idea what fixed it? Since this is working, I assume it's ok to lower the severity of this bug. (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #39) > I upgraded to 4.1.9 and re-did the tests. > > To my surprise HA now works without a SPM, and doesn't matter the SPM power > management settings (wheres in older versions engine would loop on SPM PM > attempts and do nothing else). > > So not having a SPM doesn't kill HA anymore. Any idea what fixed it? > > Since this is working, I assume it's ok to lower the severity of this bug. We just had a customer hitting this on 4.2. And we have re-tested it and it looks like in 4.2 there is no HA (leases) functionality without an SPM. It hits ACTION_TYPE_FAILED_INVALID_VM_LEASE_STORAGE_DOMAIN_STATUS in RunVmValidator. Not sure if it worked on 4.1.9 due to some luck we had, but it seems we have problems again: In summary: 1) SPM does not fail-over without power management or manual intervention (SPOF) 2) If HA leases functionality depends on SPM up (Lease SD in Up status), then the problem is more severe. (Originally by Germano Veit Michel) (In reply to Germano Veit Michel from comment #40) > In summary: > 1) SPM does not fail-over without power management or manual intervention > (SPOF) > 2) If HA leases functionality depends on SPM up (Lease SD in Up status), > then the problem is more severe. Creating a lease depends on the SPM, but when you have a vm with a lease it does not need the SPM to start. I think the issue that that once the SPM is down, engine marks all storage domains as down, and this prevents using a vm with a lease on any domain. This should be fixed in engine; not having SPM should not move storage domains to down state. We are monitoring them successfully from all hosts and they should not depend on having SPM. I guess this will not be an easy fix, this is the basic design of the system for ages. Tal, what do yo think? (Originally by Nir Soffer) We found out this is a regression added in RHV 4.2. I want this fixed in the current z stream and adding a blocker flag, due to the impact. (Originally by ylavi) Eyal, can you explain why this is a regression? Do you think the validation added of bug 1561006 is the root cause? (Originally by Nir Soffer) Yes, This validation will prevent from an HA VM with a lease to run if the lease storage domain is not active. In the scenario above, all the storage domains become 'non-active' and the engine fails the restoration of the VM. (Originally by Eyal Shenitzky) Steps to reproduce:
In an environment with 2 hosts 'h1' SPM and 'h2' HSM:
1. Set the HSM host ('h2') SPM priority to 'never'
1. Create a VM with a disk and a lease
2. Run the VM on the SPM
3. Block the connection from the engine to the SPM.
Before this fix:
VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all the storage domain deactivated
After this fix:
VM managed to run on the HSM ('h2') even if all the storage domain are down
(Originally by Eyal Shenitzky)
Tested using the steps in comment #53 on: ovirt-engine-4.2.7.3-0.1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 Actual result: The VM failed to start on the 2nd host (HSM) for very long time (waited ~50 minutes and still didnt' start) after 4 tries. In this env: - host_mixed_1 -> SPM - host_mixed_2 -> HSM (SPM = Never) - host_mixed_3 -> maintenance - PM disabled - * The DC and all SDs are down, VM is in "Unknown" state, SPM is in "NonResponsive" state. From the engine log: 2018-10-18 16:24:52,872+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-567) [155259fe] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to re start VM test_VM_HA on Host host_mixed_2 2018-10-18 16:24:55,380+03 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp React or) [] Connecting to storage-ge17-vdsm1.scl.lab.tlv.redhat.com/10.35.83.183 2018-10-18 16:24:56,431+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (EE-M anagedThreadFactory-engineScheduled-Thread-65) [] Fetched 1 VMs from VDS 'db1fc49d-20ee-49b0-b22d-b60dfc a00df4' 2018-10-18 16:24:56,433+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThre adFactory-engineScheduled-Thread-65) [] VM 'a7a514e4-11ad-41ac-9acf-5d42eff38ef5'(test_VM_HA) was unexpe ctedly detected as 'WaitForLaunch' on VDS 'db1fc49d-20ee-49b0-b22d-b60dfca00df4'(host_mixed_2) (expected on 'ef7f30b8-eed8-4b5e-a839-083f1d2e1840') 2018-10-18 16:25:03,741+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1 -worker-9) [] VM 'a7a514e4-11ad-41ac-9acf-5d42eff38ef5' was reported as Down on VDS 'db1fc49d-20ee-49b0- b22d-b60dfca00df4'(host_mixed_2) 2018-10-18 16:25:03,743+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoin Pool-1-worker-9) [] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hos tId='db1fc49d-20ee-49b0-b22d-b60dfca00df4', vmId='a7a514e4-11ad-41ac-9acf-5d42eff38ef5', secondsToWait=' 0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 703e8e65 Moving to ASSIGNED Forgot a step (4).
Updated steps to reproduce are:
In an environment with 2 hosts 'h1' SPM and 'h2' HSM:
1. Set the HSM host ('h2') SPM priority to 'never'
1. Create a VM with a disk and a lease
2. Run the VM on the SPM
3. Block the connection from the SPM to the storage.
3. Block the connection from the engine to the SPM. ==> simulating crashed SPM
Thank you, Eyal. Tested according to Eyal's steps (comment #56) using: ovirt-engine-4.2.7.3-0.1.el7ev.noarch vdsm-4.20.43-1.el7ev.x86_64 Actual result: After blocking the connection: SPM -> Storage Engine -> SPM The SPM status changed to "NonResponsive", all storage domains and datacenter went down. The HA VM managed to restart and run on the other host (not the SPM as SPM = never) Moving to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3480 sync2jira sync2jira |