Bug 829174

Summary: [Error handling] [scale] new SPM Selection fails in DC where SPM host with running VMs is Non Responsive
Product: Red Hat Enterprise Virtualization Manager Reporter: Rami Vaknin <rvaknin>
Component: ovirt-engineAssignee: Ayal Baron <abaron>
Status: CLOSED WONTFIX QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, amureini, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-03 12:24:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine logs none

Description Rami Vaknin 2012-06-06 07:19:33 UTC
Created attachment 589750 [details]
engine logs

Version:
RHEVM SI4

Scenario:
1. Data Center with 2 clusters, first cluster with 20 hosts, second cluster with 4 hosts, SPM is host from the first cluster
2. All 20 hosts in the first cluster became Non Responsive due to power issue in their lab rack

Results:
New SPM selection failed, one hour after the power failure - there is still no SPM, and the old SPM (which is in Non Responsive status) can't even be moved to Maintenance because it has running VMs on it.

Expected Results:
New SPM will be selected automatically and successfully from one of the Up hosts.


From engine.log, with "grep -i spm" due to a lot of network-failures-related logs, puma05 is the old SPM, puma hosts are down, tigris hosts are up:
2012-06-06 09:38:40,654 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-66) SPM selection - vds seems as spm puma05
2012-06-06 09:38:40,655 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-66) spm vds is non responsive, stopping spm selection.
2012-06-06 09:38:50,631 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-90) hostFromVds::selectedVds - tigris01, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:38:50,654 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-90) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:38:50,657 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-90) SPM selection - vds seems as spm puma05
2012-06-06 09:38:50,658 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-90) spm vds is non responsive, stopping spm selection.
2012-06-06 09:38:50,686 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) hostFromVds::selectedVds - tigris02, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:38:50,710 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:38:50,714 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM selection - vds seems as spm puma05
2012-06-06 09:38:50,714 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:00,693 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) hostFromVds::selectedVds - tigris04, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:00,713 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:00,717 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM selection - vds seems as spm puma05
2012-06-06 09:39:00,717 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:00,749 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) hostFromVds::selectedVds - tigris03, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:00,771 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:00,775 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) SPM selection - vds seems as spm puma05
2012-06-06 09:39:00,776 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:10,758 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) hostFromVds::selectedVds - tigris03, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:10,782 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:10,786 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) SPM selection - vds seems as spm puma05
2012-06-06 09:39:10,787 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-52) spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:10,813 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) hostFromVds::selectedVds - tigris02, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:10,832 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:10,835 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) SPM selection - vds seems as spm puma05
2012-06-06 09:39:10,836 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-54) spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:20,818 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) [250fc185] hostFromVds::selectedVds - tigris03, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:20,841 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) [250fc185] SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:20,845 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) [250fc185] SPM selection - vds seems as spm puma05
2012-06-06 09:39:20,846 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-24) [250fc185] spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:20,873 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-20) [1262c27b] hostFromVds::selectedVds - tigris01, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:20,899 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-20) [1262c27b] SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:20,903 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-20) [1262c27b] SPM selection - vds seems as spm puma05
2012-06-06 09:39:20,904 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-20) [1262c27b] spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:30,891 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-16) [23a528bd] hostFromVds::selectedVds - tigris02, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:30,912 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-16) [23a528bd] SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:30,915 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-16) [23a528bd] SPM selection - vds seems as spm puma05
2012-06-06 09:39:30,916 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-16) [23a528bd] spm vds is non responsive, stopping spm selection.
2012-06-06 09:39:30,951 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) hostFromVds::selectedVds - tigris04, spmStatus Free, storage pool iscsi_dc
2012-06-06 09:39:30,971 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) SPM Init: could not find reported vds or not up - pool:iscsi_dc vds_spm_id: 20
2012-06-06 09:39:30,974 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) SPM selection - vds seems as spm puma05
2012-06-06 09:39:30,975 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-55) spm vds is non responsive, stopping spm selection.

Comment 1 Ayal Baron 2012-08-01 12:45:59 UTC
The only solution is to fence the host.  We cannot do it automatically because as far as we know there are still VMs running on that node.
We cannot solve this for current version. pushing to future.
This requires sanlock, storage monitoring and request to stop spm through storage

Comment 2 Itamar Heim 2013-02-03 12:24:56 UTC
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.