Hide Forgot
Created attachment 1146014 [details] engine and vdsm logs Description of problem: In environment with more than one host and few storage-domains, When blocking connection between SPM host via iptable to Master Storage Domain The engine failed to connect with the host and it become Non-responsive and not Non-operational as expected. Version-Release number of selected component (if applicable): engine version - 3.6.5.1-0.1.el6 vdsm version - 4.17.25-0.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Block connection between the SPM host to the Master Storage Domain via iptable Actual results: Host become Non-responsive Expected results: Host should be Non-operational and release the SPM Additional info: Attached the following Exception and vdsm+engine logs - 2016-04-11 17:36:48,240 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-46) [53ba98a5] Failure to refresh Vds runtime info: org.ovirt.v dsm.jsonrpc.client.ClientConnectionException: Connection timeout 2016-04-11 17:36:48,240 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-46) [53ba98a5] Exception: org.ovirt.engine.core.vdsbroker.vdsbr oker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection timeout at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:652) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) [:1.7.0_99] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_99] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_99] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection timeout at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.waitForConnect(SSLStompClient.java:107) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.sendMessage(SSLStompClient.java:77) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:73) [vdsm-jsonrpc-java-client.jar:] at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:68) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getCapabilities(JsonRpcVdsServer.java:268) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:15) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
From the logs, it seems as though the connection between the HOST and ENGINE is timing out, which would explain the Non-Responsive status. I don't see, offhand, any evidence this has anything to do with blocking the master domain connection. Oved/Dan - any insight? Is this related to the recent issues we've seen with networking timing out when adding a host (I know the two operations aren't related - but perhaps there's a common root cause)?
Piotr - can you take a look?
It looks like we could saturate the network because engine disconnected due to: 2016-04-11 17:11:47,193 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded and on vdsm side I can see: d/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:42,082::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0222868 s, 45.9 MB/s\n'; <rc> = 0 Thread-37::DEBUG::2016-04-11 17:11:42,136::fileSD::173::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__local__ge3__volume__0/b5c69862-2087-43df-a162-502ce5e331e4/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-37::DEBUG::2016-04-11 17:11:42,143::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n357 bytes (357 B) copied, 0.00012556 s, 2.8 MB/s\n'; <rc> = 0 Thread-62::DEBUG::2016-04-11 17:11:42,947::blockSD::422::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/dev/cb7ccb5b-5684-465e-98b2-31b0c4c3924c/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-62::DEBUG::2016-04-11 17:11:42,954::blockSD::422::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.000662718 s, 6.2 MB/s\n'; <rc> = 0 mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:44,095::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-1 dd if=/rhev/data-center/48028c56-65ab-4bd4-87c8-2003444251b7/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:44,121::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0182608 s, 56.1 MB/s\n'; <rc> = 0 Thread-64::DEBUG::2016-04-11 17:11:44,807::fileSD::173::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Storage_export__domain__storage__local__ge3/cfded165-c52f-456e-8108-3dd3a91f27b1/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-64::DEBUG::2016-04-11 17:11:44,815::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n374 bytes (374 B) copied, 0.000586225 s, 638 kB/s\n'; <rc> = 0 and since JsonRpc (StompReactor)::ERROR::2016-04-11 17:11:47,090::betterAsyncore::124::vds.dispatcher::(recv) SSL error during reading data: unexpected eof we were unable to connect to vdsm and there were no more connection attempts in vdsm log. We just fail to connect from the engine perspective.
Eyal - are you sure that you only blocked this and not also the communication with the host? Can you reproduce and give us access to examine this?
1. How come there are no more connection attempts? 2. Is there some weird connection between blocking the master domain and the error? as it seems the error only appeared when there was supposed to be an error in performing the requested operation. 3. Eyal, are you sure that you didn't block the engine/host? although it doesn't seem so from the logs 3. Eyal, does this issue happened only once or does it reproduce constantly?
(In reply to Liron Aravot from comment #5) > 1. How come there are no more connection attempts? From the engine perspective there are numerous attempts but none of the reach vdsm. All of them end up with org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed. > 2. Is there some weird connection between blocking the master domain and the > error? as it seems the error only appeared when there was supposed to be an > error in performing the requested operation. > 3. Eyal, are you sure that you didn't block the engine/host? although it > doesn't seem so from the logs > 3. Eyal, does this issue happened only once or does it reproduce constantly?
restoring need info
(In reply to Liron Aravot from comment #5) > 1. How come there are no more connection attempts? > 2. Is there some weird connection between blocking the master domain and the > error? as it seems the error only appeared when there was supposed to be an > error in performing the requested operation. > 3. Eyal, are you sure that you didn't block the engine/host? although it > doesn't seem so from the logs > 3. Eyal, does this issue happened only once or does it reproduce constantly? Yes, I only blocked the connection between the SPM to the master storage domain. And yes, this issue reproduce constantly as written above.
Same behavior happens also when block the connection between the SPM and Non-master domain
Created attachment 1146732 [details] block connection to non-master domain - engine and vdsm log
From the old logs I can see that engine lost connection with vdsm at some point and it never was able to talk to it again. How did you block the connection?
Via iptable(In reply to Piotr Kliczewski from comment #11) > From the old logs I can see that engine lost connection with vdsm at some > point and it never was able to talk to it again. > > How did you block the connection? Via iptable
Please make sure that you do not remove the rule which is added during vdsm installation and retest. It that would still make the engine <-> vdsm connection go away. Please post here iptables config specifying which rules are for which host without providing any IPs.
Nir, I wonder whether storage code could make vdsm process to stuck. I talked to Eyal offline and it seems that we do not block engine <-> vdsm communication but still vdsm do not sends any data and heartbeat causes engine to drop the connection.
Nir - can you answer?
(In reply to Piotr Kliczewski from comment #14) > Nir, > > I wonder whether storage code could make vdsm process to stuck. I talked to > Eyal offline and it seems that we do not block engine <-> vdsm communication > but still vdsm do not sends any data and heartbeat causes engine to drop the > connection. No, I don't think any storage operation can cause vdsm to stop sending heartbeats. I would add detailed logs to understand what the reactor thread is doing, when it wakes up, how much time each operation it does take (e.g. writing message). Also, can we reproduce it with standard ssl or only with m2crypto?
It looks like having no heartbeats and storage domain being blocked is related. Eyal, please reproduce the issue and let us access it.
Any progress reproducing the issue? Thanks
Yes, I reproduce the bug again and attached the relevant logs (vdsm and engine)
Created attachment 1154112 [details] 05/05/16-engine vdsm logs
Can we access, as Piotr requested?
(In reply to Oved Ourfali from comment #21) > Can we access, as Piotr requested? What do you mean by access?
(In reply to Eyal Shenitzky from comment #22) > (In reply to Oved Ourfali from comment #21) > > Can we access, as Piotr requested? > > What do you mean by access? Access the environment.
Checking the logs I see that vdsm was stopped: MainThread::DEBUG::2016-05-05 10:14:47,395::vdsm::72::vds::(sigtermHandler) Received signal 15 and as a result engine lost connection: 2016-05-05 10:15:13,423 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-52) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='da92e664-deeb-4111-9d42-98a56c7a594b', vds='Host[host_mixed_3,da92e664-deeb-4111-9d42-98a56c7a594b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded In the logs I can see that vdsm was stopped 3 times. Communication was restored when vdsm was Up again. Why we restarted vdsm?
(In reply to Piotr Kliczewski from comment #24) > Checking the logs I see that vdsm was stopped: > > MainThread::DEBUG::2016-05-05 10:14:47,395::vdsm::72::vds::(sigtermHandler) > Received signal 15 > > and as a result engine lost connection: > > 2016-05-05 10:15:13,423 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] > (DefaultQuartzScheduler_Worker-52) [] Command > 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, > VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', > hostId='da92e664-deeb-4111-9d42-98a56c7a594b', > vds='Host[host_mixed_3,da92e664-deeb-4111-9d42-98a56c7a594b]'})' execution > failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded > > In the logs I can see that vdsm was stopped 3 times. Communication was > restored when vdsm was Up again. Why we restarted vdsm The vdsm reboots 3 times before the engine set the host set as Non-responsive
Please see the time when it occurred: Vdsm Restart: 2016-05-05 10:14:47,395 Engine lost connectivity: 2016-05-05 10:15:13,423 The restarts were direct cause of the issue, at least according to the logs.
Please retest without restarting vdsm
(In reply to Piotr Kliczewski from comment #27) > Please retest without restarting vdsm I'm not restarting the vdsm, I only block the connection between the SPM host and the Master Storage Domain. the reboot of the VDSM cause by the engine and not manually by me.
Before deep diving into the logs, are you sure that connection between engine and VDSM on SPM host was not affected by blocking connection between VDSM and master storage domain? Because from the description it seems to me like most probable cause for the SPM host to became NonResponsive ...
(In reply to Martin Perina from comment #29) > Before deep diving into the logs, are you sure that connection between > engine and VDSM on SPM host was not affected by blocking connection between > VDSM and master storage domain? Because from the description it seems to me > like most probable cause for the SPM host to became NonResponsive ... The opposite way, this is the bug. What I said in my response is that I didn't block or interfere to the connection between the host and the engine, I block only the connection between the SPM and the master storage domain
Created attachment 1154614 [details] Reproducing logs
This is strange because I see: Vdsm restart: MainThread::DEBUG::2016-05-06 12:31:14,597::vdsm::72::vds::(sigtermHandler) Received signal 15 Engine lost connection: 2016-05-06 12:31:39,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Heartbeat exeeded and the connection was recovered at: 2016-05-06 12:32:30,903 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.34.63.152 and later I can see: 2016-05-06 12:34:57,774 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-10) [7fda609e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host dev-21 was set to Up. The question is why vdsm was restarted.
So I was able to reproduce (logs attached) it on latest 3.6.6 (including changes using vdsm-jsonrpc-java 1.1.12) using 2 hosts (dev-18, dev-21) in a cluster with shared NFS storage on 3rd server (str-02). Flow: 1. Make dev-21 SPM and block access to str-02 only (engine can connect to the dev-21 host without any issues) 2. After a few following message appeared in the log (IMO there's no issue so far): 2016-05-06 12:30:38,548 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-34) [] domain '4b204a94-88ae-4bae-80ba-32e5213023ff:nfs01' in problem. vds: 'dev-21' 3. Unfortunately cca 1 minute later heartbeat exceeded error message was raised during GetStatsVDSCommand: 2016-05-06 12:31:39,891 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-45) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 4. Around the same time another heartbeat exceeded error message was raised during GetStoragePoolInfoVDSCommand: 5. After these 2 error we changed dev-21 host status to Connecting 6. Without any intervention (access to storafe domains is still blocked) host dev-21 became Up again 3 minutes later 2016-05-06 12:34:57,774 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-10) [7fda609e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host dev-21 was set to Up. So it seems that blocking access to master storage domain is somehow connected to VDSM stopping to respond to communication from engine. Piotr, could you please take a look?
We need to understand the cause of vdsm service restart at 2016-05-06 12:31:14,597.
Created attachment 1154647 [details] journalctl from dev-21 So from journalctl content it seems that VDSM was killed by systemd: May 06 12:31:54 dev-21.rhev.lab.eng.brq.redhat.com systemd[1]: vdsmd.service: main process exited, code=killed, status=9/KILL
Signal was received at 12:31:14 and systemd could just log process being down. Here is interesting entry that i found in journalctl: May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5630 [19820]: s2 check_our_lease failed 80 May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5630 [19820]: s2 kill 20012 sig 15 count 1 May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5631 [19820]: s2 kill 20012 sig 15 count 2 Nir, is there any functionality in sanlock which could send the signal?
(In reply to Piotr Kliczewski from comment #37) > Signal was received at 12:31:14 and systemd could just log process being > down. > > Here is interesting entry that i found in journalctl: > > May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: > 2016-05-06 12:31:14+0200 5630 [19820]: s2 check_our_lease failed 80 > May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: > 2016-05-06 12:31:14+0200 5630 [19820]: s2 kill 20012 sig 15 count 1 > May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: > 2016-05-06 12:31:14+0200 5631 [19820]: s2 kill 20012 sig 15 count 2 > > Nir, is there any functionality in sanlock which could send the signal? Sure, if you block the master storage domain on the spm host, sanlock cannot renew the host lease. If the lease cannot be renewed for 80 seconds, sanlock will kill vdsm, since vdsm is the owner of the spm resource. Becoming unresponsive is expected when blocking the master domain on the spm host. When vdsm is restarted, the host should become up again, but it will not be able to become the spm again since the master domain is blocked. I think the correct test would be to block access to some storage domain on the other host (non-spm), which should make this host non-operational.
I just confirmed behaviour Nir described in Comment 38: 1. If you block access to storage domain on non-SPM host, then the host is moved to NonOperational after some time (without any communication issues raised). 2. If you block access to storage domain for SPM host, then VDSM on the host is killed by sanlock (which causes communication issues on engine side and host may change status to Connecting/NonResponsive), but afterwards host is marked as NonOperational and different host is selected as SPM I'm going to close it as NOTABUG.
Martin, the problem here iiuc is that the host stays unresponsive although vdsm is restarted (and therefore the engine-host communication should be restored), therefore the host should become non-operational and not remain non responsive. I might have a lead on this issue, on my setup (engine - master) I've restarted vdsm during the syncornous part of operation execution. On that scenario the communication wasn't restored as well till i've restarted the engine - if sanlock fences vdsm during the synchronous part of an operation in this bug it might be the same.
ignore the second part of my comment, there was other problem leading to that in that case.
Liron, As far as I understand Martin's reproducer of the issue proved exactly what you stated in the first part of your comment.
> Martin, > the problem here iiuc is that the host stays unresponsive although vdsm is > restarted (and therefore the engine-host communication should be restored), > therefore the host should become non-operational and not remain non responsive. Exactly as Piotr mentioned above, when VDSM was killed by sanlock, we lost a connection to the host, so host status was changed to Connecting. When VDSM was restarted then host status was changed to Up, InitVdsOnUp was called and during it we found an issue connecting to storage domain, so we changed that status to NonOperational. If VDSM restart takes more time, engine can change host status from Connecting to NotResponding and even start fencing on the host, but that's how it is when we lost connection to VDSM.
Thanks Piotr/Martin- this bug is about the host remaining permanently in Non Responsive status. If you and Eyal failed to reproduce and the log doesn't provide any more info - i guess it can be closed as WORKSFORME.
(In reply to Martin Perina from comment #39) > I just confirmed behaviour Nir described in Comment 38: > > 1. If you block access to storage domain on non-SPM host, then the host is > moved to NonOperational after some time (without any communication issues > raised). > > 2. If you block access to storage domain for SPM host, then VDSM on the host > is killed by sanlock (which causes communication issues on engine side and > host may change status to Connecting/NonResponsive), but afterwards host is > marked as NonOperational and different host is selected as SPM > > I'm going to close it as NOTABUG. (In reply to Liron Aravot from comment #44) > Thanks Piotr/Martin- > this bug is about the host remaining permanently in Non Responsive status. > If you and Eyal failed to reproduce and the log doesn't provide any more > info - i guess it can be closed as WORKSFORME. The bug reproduces each time, the question here is if the SPM host should become Non-responsive when the connection to the Master Storage domain is blocked or to move to Non-operational - which I think is the expected result.
The host should end in Non Operational status, if in some phase its non responsive its fine - but if its *remains* non responsive there is an issue.
According to all the above this is definitely a bug This bug does reproduced each time. I'm not sure why this issue was closed as NOTABUG.
I'm unable to reproduce your error, in my environment with 2 host and NFS storage, if you block access for SPM (host1) to NFS storage, then after a while host1 always ends up in NonOperational state and host2 is selected as SPM. Eyal, could you please provide access to your environment where you can reproduce it? If so, Eli could you please investigate? Thanks
Hi Talking with Eyal the host is moved to non-responding for 20 min , then it moved to non-operational and them the SPM election occurs and change the SPM to the other host So, what we have to investigate here is the long (20 min) time in non-responding state
Created attachment 1163576 [details] reattach engine and vdsm logs
As far as I see from the log, the reason that this took 20 minutes is that the Host is switching from non-reponding to connecting states back and forth while the engine tried to stop the host from being SPM, but when the state is 'Connecting' this operation is ignored. There are no fencing scenarios here since the host power management option is disabled according to the logs
Following comment 50 and comment 52, it seems as NOTABUG Please let us know if you want to reproduce scenario with power management that is configured correctly on the host Otherwise, this bug will be closed as NOTABUG
*** Bug 1343011 has been marked as a duplicate of this bug. ***
Fine by me, close as NOTABUG
If power management is correctly configured on hosts, then host (which is the SPM and connection from the host to master storage domains is blocked) is properly fenced and after that new SPM is selected as expected.