Description of problem: When the SPM host has power management configuration enabled and loses power, the SPM role does not failover to another host. The engine keeps trying to fence the SPM host and failing in a loop because the fence agent also lost power. If power management is disabled, it works fine and the SPM role fails over normally. Without Power Management: 1. SPM Host is A 2. A has its power cut 3. Host B assumes SPM role as soon as the ids lease for the SDM resource expires With Power Management: 1. SPM Host is A 2. A has its power cut 3. Engine keeps trying to fence Host A, fencing keeps failing as fence agent also has no power 4. No commands sent for other hosts to grab the SDM lease and become SPM. 5. Engine stuck in a loop trying to fence A. 6. SDM resource is available for any other host to get it and move the Data Center back to Up status, but nothing happens 7. As DC is down, VM HA also fails. So basically the DC is down because there is no SPM and the HA VMs that were running on the SPM are also down. Version-Release number of selected component (if applicable): ovirt-engine-4.1.8.2-0.1.el7.noarch How reproducible: 100% Steps to Reproduce: 1. Enable Power Management for host A, with wrong configuration 2. Set A as SPM 3. In A, block communication with engine and stop renewing sanlock leases: # iptables -A INPUT -s <RHV-M IP> -j DROP # systemctl stop vdsmd (to release the SDM lease) * If you skip step 1, everything works. Actual results: Data Center not responsive, HA VMs not restarted (even with lease option). Expected results: SPM role failing over to another host. Data Center up and VM HA mechanism working
Why is it medium severity? Sounds like basic functionality that should work? Raz - I assume this is tested all the time?
(In reply to Yaniv Kaul from comment #3) > Why is it medium severity? Sounds like basic functionality that should work? > Raz - I assume this is tested all the time? This should be tested by coresystem team. Lukas, Can you provide more info about the testing frequency of this issue?
(In reply to Germano Veit Michel from comment #0) > How reproducible: > 100% > > Steps to Reproduce: > 1. Enable Power Management for host A, with wrong configuration > 2. Set A as SPM > 3. In A, block communication with engine and stop renewing sanlock leases: > # iptables -A INPUT -s <RHV-M IP> -j DROP > # systemctl stop vdsmd (to release the SDM lease) > > * If you skip step 1, everything works. > Well, this is not reproducible by the scenario you wrote since the soft-fencing procedure occurs before the hard-fencing(reboot) , the soft-fencing procedure restarts the vdsmd service and another host become SPM ... Following your procedure, the SPM lease was released and the other host that I had (lets call it host B) became SPM Please also add the following details : 1) How many VMs are running on Host A before the scenario starts ? 2) How many VMs from 1) are HA ? 3) Please specify what happened to the HA VMs in your scenario
Hi Eli, I try to answer these questions. How will be soft-fencing possible, when any connection from rhv-m is dropped on host A? If it is possible in your test, it does not cover the real situation on customer side. 1) It does not matter how many VMs are running on host A, unless it's minimum 1 HA flagged VM. 2) see 1) 3) nothing. the SPM did not switched and also HA VM did not started. Latter could be if still running and updating VM-lease. But lately on killing the qemu-process, it should be started on host B
(In reply to Eli Mesika from comment #7) > (In reply to Germano Veit Michel from comment #0) > > > How reproducible: > > 100% > > > > Steps to Reproduce: > > 1. Enable Power Management for host A, with wrong configuration > > 2. Set A as SPM > > 3. In A, block communication with engine and stop renewing sanlock leases: > > # iptables -A INPUT -s <RHV-M IP> -j DROP > > # systemctl stop vdsmd (to release the SDM lease) > > > > * If you skip step 1, everything works. > > > > Well, this is not reproducible by the scenario you wrote since the > soft-fencing procedure occurs before the hard-fencing(reboot) , the > soft-fencing procedure restarts the vdsmd service and another host become > SPM ... How can soft-fencing work if iptables blocks the traffic? If you see the logs, you will see that soft fencing fails in the reproducer. The steps reproduce the bug the customer hit. I tried 3 times, and my engine is looping 3 times. No SPM failover. DC is down. Disable Power Management and everything works. > > Following your procedure, the SPM lease was released and the other host that > I had (lets call it host B) became SPM > > Please also add the following details : > > 1) How many VMs are running on Host A before the scenario starts? In customer logs, you can see 2 or 3 VMs going to unknown state in the logs when the host powers off. > 2) How many VMs from 1) are HA ? All of them. > 3) Please specify what happened to the HA VMs in your scenario Nothing, in my reproducer I did not have HA VMs. I just reproduced the failure to swith the SPM role. As the DC is down, nothing else works, I don't think we need to worry about VM HA now, without SPM the DC is down.
Well, first of all the real scenario that I tested is : Host A and Host B on teh same cluster (4.1) 1)Host A is powered off while it has the SPM lease 2)PM is configured wrongly on A Then I repeated the scenario with no PM configuration on host A In both cases the SPM lease was not released from host A and it actually holds it until the SPM lease was transferred manually. The reporter wrote in the scenario that we should manually stop vdsmd, I did not follow that , the scenario is power-off of the host , not a manual stop of the vdsmd service. So , we have to be aligned in the reproduction with the reported scenario.... From looking in the code in SpmStopVDSCommand::executeVdsBrokerCommand() it seems that if host is un-reachable , we can not transfer the SPM from it automatically .... Can't this be handled by sanlock? If host is shutdown for a while, its sanlock lease is not updated and maybe after a while sanlock on different host is able to acquire its lease ???
Ohh, sorry! You are right Eli. It's indeed slightly different. Somehow gracefully releasing the lease makes it work, and letting it expire doesn't. Wonder why. So we have 2 scenarios to fix? 1) SPM failover when host powered off regardless of power management 2) SPM failover when vdsm shutdown gracefully and power management on And I'm afraid 2 can only be hit on purpose.
Allon for scenario 1 in comment 11 : If vdsmd is stopped unexpectedly as a power-off result and the PM agent associated with the host is not reachable , than engine can not release the SPM, is there any expiration time in which the SPM election will start again, from my check it seems that the answer is no. However, if the answer for that is 'No', I think that this should be closed as NOT A BUG , since the administrator in that case should anyway confirm that host was rebooted and he can also transfer the SPM to any other host manually
I do not agree to close this as NOT A BUG, as the mentioned scenario is a valid disaster scenario in production environment. In this way, I need to assume, the HA feature of RHV is simply NOT working. If an administrator need to confirm the outage of an environment, there is NO automatic failover mechanism existent.
(In reply to Steffen Froemer from comment #15) > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > valid disaster scenario in production environment. > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > If an administrator need to confirm the outage of an environment, there is > NO automatic failover mechanism existent. Therefor there is a NEEDINFO on Allon M who leads the storage team and can check if we can handle this scenario....
(In reply to Steffen Froemer from comment #15) > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > valid disaster scenario in production environment. > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > If an administrator need to confirm the outage of an environment, there is > NO automatic failover mechanism existent. Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected.
(In reply to Martin Perina from comment #17) > (In reply to Steffen Froemer from comment #15) > > I do not agree to close this as NOT A BUG, as the mentioned scenario is a > > valid disaster scenario in production environment. > > > > In this way, I need to assume, the HA feature of RHV is simply NOT working. > > If an administrator need to confirm the outage of an environment, there is > > NO automatic failover mechanism existent. Sorry, there was a conflict when submitting my comment and I got lost some parts during resubmission, here's the original post: Working power management setup is a prerequisity for HA VMs feature. When you are not able to fence the host, you don't know if VM is running on it on not, so you are risking splitbrain if you try to execute it on a different host. And the same is with the SPM, if host cannot be fenced, we don't know if host has access to the storage or not, so we cannot delegate SPM role to different host. So the only remaining question is if SPM lock is expired when host if physically off. Allon, can the SPM lock expire after some time (similar to sanlock lease), so another host can be elected as SPM even without prior SPM being fenced? If not, then everything is working as expected.
Hi guys, Sorry for the late response. I was on PTO and am digging through my backlog. Engine currently does not issue an SPMStart command to any other host until it's sure the SPM is down (e.g., successfully fenced, admin has clicked "confirm host has been rebooted", etc). I guess we could add a timeout (for argument's sake, Sanlock's timeout, counting starting after power fencing was attempted and failed). Nir/Tal - am I missing anything here?
We cannot use timeouts or any other heuristics, because of the master mount. On block storage, the SPM is mounting the master lv, so we cannot start the SPM on another host before unmounting the master lv. On file storage, we can start the SPM once vdsm was killed on the old SPM. So what we can do is: - master domain on block storage - nothing, the admin must manually reboot the SPM host. - master domain on file storage - we can check if the SPM holds the SPM lease, in the same way we check host liveliness during fencing. If the SPM does not have a lease, it is safe to start the SPM on another host If we want a solution for block storage, we can use a killpath program to kill vdsm when the SPM lost the lease. This program will: - terminate vdsm - check if the master mount was unmounted, and ummount it if needed - if the master mount could not be unmounted, fail. This will cause sanlock to reboot the host When we have a killpath program, we can use sanlock_request api to take the SPM lease from the current SPM from another host. sanlock client request -r RESOURCE -f force_mode Request the owner of a resource do something specified by force_mode. A versioned RESOURCE:lver string must be used with a greater version than is presently held. Zero lver and force_mode clears the request. David, what do you think?
(In reply to Nir Soffer from comment #20) > If we want a solution for block storage, we can use a killpath program to > kill vdsm when the SPM lost the lease. Will this work when the host is shutdown unexpectedly (as a result of outage for example) , from what I saw in this scenario the VDSM service is killed (so , you need no killpath program to kill it since the host is dead) and the lease is not released (this is the scenario which is described in this BZ)
(In reply to Eli Mesika from comment #21) > (In reply to Nir Soffer from comment #20) > > Will this work when the host is shutdown unexpectedly If vdsm is killed, nobody can ensure that the mount is unmounted, you will have to wait until the host is up again, or the user can manually use "confirm host was rebooted". If the host is shutdown, there is no mount but we don't have a way to tell that a host was shutdown. We can check if a host is maintaining a lease on storage, but I don't know if we have a way to detect that a host was rebooted. David, can we use the delta lease to track host reboots?
Each time a host joins a lockspace by acquiring the host_id lease, the generation number for that host_id lease is incremented. So the host_id generation number may be what you're looking for.
This is not 4.1.9 material, please move to 4.3.
(In reply to Nir Soffer from comment #25) > This is not 4.1.9 material, please move to 4.3. I agree this probably isn't 4.1.9 material at this point, but this needs further discussion. We seem to be missing something - I get the reasoning in comment 20, but we've reached the odd situation where offhand it seems as though a system without power-fencing configured may, in fact, be stabler than a system with power-fencing. Pushing out to 4.2.2 for the meanwhile until we have a clear action plan. At that point, we can defer to 4.3 or even backport to 4.1.10 if there's such a release.
I think comment 0 is wrong. When you don't have power management, we cannot start the SPM on another host unless the user confirmed that the host was rebooted. If we did this, we would corrupt the master mount on block storage. Next step: try to reproduce what comment 0 describe.
(In reply to Nir Soffer from comment #27) > I think comment 0 is wrong. When you don't have power management, we cannot > start the SPM on another host unless the user confirmed that the host was > rebooted. > If we did this, we would corrupt the master mount on block storage. > > Next step: try to reproduce what comment 0 describe. Nir you are right, I missed the fact that stopping vdsm gracefully releases the SPM role. We realized that further in the bug. See comment 11 for the actual problems here. No need to test it again. The SPM role is not started on a different host in case of power failure (with fencing or not).
I upgraded to 4.1.9 and re-did the tests. To my surprise HA now works without a SPM, and doesn't matter the SPM power management settings (wheres in older versions engine would loop on SPM PM attempts and do nothing else). So not having a SPM doesn't kill HA anymore. Any idea what fixed it? Since this is working, I assume it's ok to lower the severity of this bug.
(In reply to Germano Veit Michel from comment #39) > I upgraded to 4.1.9 and re-did the tests. > > To my surprise HA now works without a SPM, and doesn't matter the SPM power > management settings (wheres in older versions engine would loop on SPM PM > attempts and do nothing else). > > So not having a SPM doesn't kill HA anymore. Any idea what fixed it? > > Since this is working, I assume it's ok to lower the severity of this bug. We just had a customer hitting this on 4.2. And we have re-tested it and it looks like in 4.2 there is no HA (leases) functionality without an SPM. It hits ACTION_TYPE_FAILED_INVALID_VM_LEASE_STORAGE_DOMAIN_STATUS in RunVmValidator. Not sure if it worked on 4.1.9 due to some luck we had, but it seems we have problems again: In summary: 1) SPM does not fail-over without power management or manual intervention (SPOF) 2) If HA leases functionality depends on SPM up (Lease SD in Up status), then the problem is more severe.
(In reply to Germano Veit Michel from comment #40) > In summary: > 1) SPM does not fail-over without power management or manual intervention > (SPOF) > 2) If HA leases functionality depends on SPM up (Lease SD in Up status), > then the problem is more severe. Creating a lease depends on the SPM, but when you have a vm with a lease it does not need the SPM to start. I think the issue that that once the SPM is down, engine marks all storage domains as down, and this prevents using a vm with a lease on any domain. This should be fixed in engine; not having SPM should not move storage domains to down state. We are monitoring them successfully from all hosts and they should not depend on having SPM. I guess this will not be an easy fix, this is the basic design of the system for ages. Tal, what do yo think?
We found out this is a regression added in RHV 4.2. I want this fixed in the current z stream and adding a blocker flag, due to the impact.
Eyal, can you explain why this is a regression? Do you think the validation added of bug 1561006 is the root cause?
Yes, This validation will prevent from an HA VM with a lease to run if the lease storage domain is not active. In the scenario above, all the storage domains become 'non-active' and the engine fails the restoration of the VM.
Steps to reproduce: In an environment with 2 hosts 'h1' SPM and 'h2' HSM: 1. Set the HSM host ('h2') SPM priority to 'never' 1. Create a VM with a disk and a lease 2. Run the VM on the SPM 3. Block the connection from the engine to the SPM. Before this fix: VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all the storage domain deactivated After this fix: VM managed to run on the HSM ('h2') even if all the storage domain are down
Forgot a step (4). Updated steps to reproduce are: In an environment with 2 hosts 'h1' SPM and 'h2' HSM: 1. Set the HSM host ('h2') SPM priority to 'never' 1. Create a VM with a disk and a lease 2. Run the VM on the SPM 3. Block the connection from the SPM to the storage. 3. Block the connection from the engine to the SPM. ==> simulating crashed SPM
Tested using: ovirt-engine-setup-4.3.0-0.0.master.20181016132820.gite60d148.el7.noarch vdsm-4.30.1-25.gitce9e416.el7.x86_64 Actual result (according to the steps in comment #53): After the connection was blocked, all the SDs & DC went down and the VM failover to the 2nd host (with SPM=never) as expected. VERIFIED
Yossi, Can you please change the bug status?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085
sync2jira
(In reply to Eyal Shenitzky from comment #51) > Steps to reproduce: > > In an environment with 2 hosts 'h1' SPM and 'h2' HSM: > > 1. Set the HSM host ('h2') SPM priority to 'never' > 2. Create a VM with a disk and a lease > 3. Run the VM on the SPM > 4. Block the connection from the SPM to the storage. > 5. Block the connection from the engine to the SPM. ==> simulating crashed SPM > > Before this fix: > VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all > the storage domain deactivated > > After this fix: > VM managed to run on the HSM ('h2') even if all the storage domain are down Eyal, I wasnt able to get the expected behavior on HE env rhv-release-4.4.1-12-001.noarch I assume this should be working on HE env as well, right? This is my setup: host1: set as SPM. no vms host2: SPM priority set to 'never'. no vms host3: SPM priority set to 'never'. running the HE vm Steps: 1. Create template vm with lease on iscsi 2. Start it on host1 3. Block the connection from the SPM to the storage: [root@caracal04 ~]# iptables -A OUTPUT -d 3par-iscsi-1.scl.lab.tlv.redhat.com -j DROP 4. Block the connection from the engine to the SPM. ==> simulating crashed SPM [root@hosted-engine-09 ~]# iptables -A OUTPUT -d 10.46.30.4 -j DROP (the ip is for caracal04) Now, all of the SD's went down, The vm went to status 'unknown' but showing still on host1 The HA vm is NOT migrating to host2 / host3 as expected. Attaching vdsm and engine logs
Created attachment 1710479 [details] Wasnt able to see the fix on HE env rhv-release-4.4.1-12-001
Also tried the same scenario from comment #63 on a regular (NOT HE) env. The result is AS EXPECTED. After a few minutes of SD's inactive, the HA vm migrates to host2
This bug is already closed and verified. If you think that there is a bug please file a new bug with all the details.
(In reply to Eyal Shenitzky from comment #66) > This bug is already closed and verified. > > If you think that there is a bug please file a new bug with all the details. New BZ opened as you suggested: https://bugzilla.redhat.com/show_bug.cgi?id=1869162