Bug 869309

Summary: PRD32 - engine: auto recovery cannot recover hosts/storage when last host in setup still has no access to storage and not in maintenance
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Barak <bazulay>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: bazulay, dyasny, emesika, hateya, iheim, lpeer, mkublin, Rhev-m-bugs, sgordon, sgrinber, yeylon, ykaul, yzaslavs
Target Milestone: ---Keywords: Tracking
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: sf3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Bug Depends On: 876235, 882832, 882837    
Bug Blocks: 915537    
Attachments:
Description Flags
logs none

Description Dafna Ron 2012-10-23 10:32:02 EDT
Created attachment 632143 [details]
logs

Description of problem:

I have two hosts in the setup. 
after blocking storage from both hosts, spm becomes non-operational and hsm fails to acquire lease and remains in up state but not SPM. 
if I remove the iptables block from the non-operational host only, the auto recovery will fail to activate the storage and hosts. 

although the ConnectStorageServerVDSCommand succeeds on the Non-Operational host, ConnectStoragePoolVDSCommand will fail with Cannot Find Master Domain. 
and... we will get Wrong Master domain or its version since failed connect to pool by Auto Recovery will up the master version
 
Version-Release number of selected component (if applicable):

si21.1 

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster, block connectivity to the storage from both hosts
2. after the spm becomes non-operational and the second host releases spm restore the connectivity to the storage from the non-operational host only
3.
  
Actual results:

1.Auto Recovery will fail to recover storage/hosts (non-operational host will become unassigned -> back to non-operational). 
2. since we up the master version we would have to put up host in maintenance so that recovery can happen. 


Expected results:

since engine requires leaving 1 host in up state to allow recovery, and Auto recovery cannot recover other hosts while there is a host in up state in setup I would suggest that only one of these flows be active (so if auto recovery is activated engine recovery is disabled). 

Additional info: full logs


2012-10-23 15:15:01,019 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) [66613508] Checking autorecoverable hosts done
2012-10-23 15:15:01,019 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) [66613508] Checking autorecoverable storage domains
2012-10-23 15:15:01,021 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) [66613508] Autorecovering 0 storage domains
2012-10-23 15:15:01,021 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-77) [66613508] Checking autorecoverable storage domains done
2012-10-23 15:15:01,227 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-80) [64f7ab0c] Running command: InitVdsOnUpCommand internal: true.
2012-10-23 15:15:01,316 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-80) [79f9af98] START, ValidateStorageServerConnectionVDSCommand(HostName = gold-vdsd, HostId = 0
e8479de-1c56-11e2-b621-001a4a169741, storagePoolId = 1167fe48-4788-486d-876b-f8261ede6c23, storageType = ISCSI, connectionList = [{ id: b5a56dcc-ef37-48eb-b83a-92db3b366aaa, connection: 10.35.64.10, iqn: Dafna-Upgrade-03, vfsType: null,
 mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 600b6044-c53b-4309-8f85-fbd1558dbcc0, connection: 10.35.64.10, iqn: Dafna-Upgrade-04, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null,
 nfsTimeo: null };{ id: 17aa00f8-63cb-4926-8763-bac1a4e251bf, connection: 10.35.64.10, iqn: Dafna-upgrade-02, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 2030dacd-2069-4488-a7ca-abd07dbb
b558, connection: 10.35.64.10, iqn: Dafna-upgrade-01, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 2798240a
2012-10-23 15:15:01,327 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-80) [79f9af98] FINISH, ValidateStorageServerConnectionVDSCommand, return: {b5a56dcc-ef37-48eb-b8
3a-92db3b366aaa=0, 600b6044-c53b-4309-8f85-fbd1558dbcc0=0, 17aa00f8-63cb-4926-8763-bac1a4e251bf=0, 2030dacd-2069-4488-a7ca-abd07dbbb558=0}, log id: 2798240a
2012-10-23 15:15:01,328 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-80) [79f9af98] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : 
 ID: 1167fe48-4788-486d-876b-f8261ede6c23 Type: StoragePool
2012-10-23 15:15:01,329 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-80) [79f9af98] START, ConnectStorageServerVDSCommand(HostName = gold-vdsd, HostId = 0e8479de-1c56-11e2-b621
-001a4a169741, storagePoolId = 1167fe48-4788-486d-876b-f8261ede6c23, storageType = ISCSI, connectionList = [{ id: b5a56dcc-ef37-48eb-b83a-92db3b366aaa, connection: 10.35.64.10, iqn: Dafna-Upgrade-03, vfsType: null, mountOptions: null, n
fsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 600b6044-c53b-4309-8f85-fbd1558dbcc0, connection: 10.35.64.10, iqn: Dafna-Upgrade-04, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id
: 17aa00f8-63cb-4926-8763-bac1a4e251bf, connection: 10.35.64.10, iqn: Dafna-upgrade-02, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 2030dacd-2069-4488-a7ca-abd07dbbb558, connection: 10.3
5.64.10, iqn: Dafna-upgrade-01, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 3e6ace41
2012-10-23 15:15:01,648 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-76) [dc04fe] domain 7633b7eb-62d0-498d-a762-c1da4f3b505f:Dafna-Upgrade-03 in problem. vds: gold-vdsc
2012-10-23 15:15:01,649 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-76) [dc04fe] domain 7bdb9b94-729f-409b-94d8-bad3fe0d4d6f:Dafna-Upgrade-04 in problem. vds: gold-vdsc
2012-10-23 15:15:01,652 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-76) [dc04fe] domain f844782b-dc73-4c35-b776-92ef809ab6f5:Dafna-Upgrade-02 in problem. vds: gold-vdsc
2012-10-23 15:15:01,653 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-76) [dc04fe] domain 6faf7684-e22a-4332-8ad9-0ad89dbd6172:Dafna-Upgrade-01 in problem. vds: gold-vdsc
2012-10-23 15:15:02,019 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-80) [79f9af98] FINISH, ConnectStorageServerVDSCommand, return: {b5a56dcc-ef37-48eb-b83a-92db3b366aaa=0, 600
b6044-c53b-4309-8f85-fbd1558dbcc0=0, 17aa00f8-63cb-4926-8763-bac1a4e251bf=0, 2030dacd-2069-4488-a7ca-abd07dbbb558=0}, log id: 3e6ace41
2012-10-23 15:15:02,019 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-80) [79f9af98] Host gold-vdsd storage connection was succeeded 
2012-10-23 15:15:02,121 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-80) [79f9af98] START, ConnectStoragePoolVDSCommand(HostName = gold-vdsd, HostId = 0e8479de-1c56-11e2-b621-001
a4a169741, storagePoolId = 1167fe48-4788-486d-876b-f8261ede6c23, vds_spm_id = 2, masterDomainId = 7633b7eb-62d0-498d-a762-c1da4f3b505f, masterVersion = 45), log id: 6879a6e
2012-10-23 15:15:02,303 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-85) hostFromVds::selectedVds - gold-vdsc, spmStatus Unknown_Pool, storage pool iSCSI
2012-10-23 15:15:02,324 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-85) START, ConnectStoragePoolVDSCommand(HostName = gold-vdsc, HostId = 0419c81e-1c56-11e2-9707-001a4a169741, 
storagePoolId = 1167fe48-4788-486d-876b-f8261ede6c23, vds_spm_id = 1, masterDomainId = 7633b7eb-62d0-498d-a762-c1da4f3b505f, masterVersion = 45), log id: 20beb48d
2012-10-23 15:15:02,634 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-85) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         304
mMessage                      Cannot find master domain: 'spUUID=1167fe48-4788-486d-876b-f8261ede6c23, msdUUID=7633b7eb-62d0-498d-a762-c1da4f3b505f'
Comment 1 mkublin 2012-10-25 04:56:31 EDT
These bug is not related to auto-recovery a similar behaviour will be if someone will try to Activate a following host
Comment 3 Barak 2012-12-06 09:00:12 EST
We have opened 3 Bugs that together need to handle the above scenario.
Comment 4 mkublin 2012-12-16 05:44:39 EST
http://gerrit.ovirt.org/#/c/10103/
Comment 5 Barak 2013-01-02 03:45:21 EST
(In reply to comment #4)
> http://gerrit.ovirt.org/#/c/10103/

The above patch belongs to bug 882837 but should handle this scenario as well.

Hence moving to MODIFIED and later to ON_QA as the above scenario should be verified as well.
Comment 6 Stephen Gordon 2013-01-04 16:16:25 EST
Setting docs_scoped- as this looks like a series of bug fixes to provide the behaviour users already expect rather than a new feature (again from user POV).
Comment 7 Leonid Natapov 2013-03-17 12:23:46 EDT
verified on sf10 as part of 882837 and 874019
Comment 8 Itamar Heim 2013-06-11 05:52:05 EDT
3.2 has been released
Comment 9 Itamar Heim 2013-06-11 05:52:08 EDT
3.2 has been released
Comment 10 Itamar Heim 2013-06-11 05:58:59 EDT
3.2 has been released