Created attachment 1355099 [details] engine log Description of problem: Under ovirt-system-tests, storage domains are added while the 2nd host (host-1) is added to the system. For some reason, it causes a race - where not all secondary domains are added to it (no ConnectStorageServer is even sent) and it ends up as Non-Operational. Timeline: Host-0 (which is the 1st host connected and is connected first to the iSCSI master SD and then to the secondary, NFS based, storage domains): ---- host-0 2017-11-17 10:21:52,098-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (default task-17) [7b4e6607-5a01-45a3-8b6a-87cc4b5795ce] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-0, StorageServerConnectionManagementVDSParameters:{hostId='e70aabfb-f76b-4a4e-97ca-b68f6d234360', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='null', connection='192.168.202.4:/exports/nfs/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='V4_2', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 2d51b079 2017-11-17 10:23:13,727-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-56) [5bf9a05e] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-0, StorageServerConnectionManagementVDSParameters:{hostId='e70aabfb-f76b-4a4e-97ca-b68f6d234360', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='9c1af421-331a-4483-9e1a-bb8864c3afac', connection='192.168.202.4:/exports/nfs/share2', iqn='null', vfsType='null', mountOptions='null', nfsVersion='V4_1', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 6bfb50bc And host-1 (which ends up as problematic): 2017-11-17 10:22:39,829-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-1, StorageServerConnectionManagementVDSParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', storageType='ISCSI', connectionList='[StorageServerConnections:{id='adf99be9-5e6e-4bbd-a0ef-7aed8db79586', connection='192.168.201.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='5fef0968-a84b-40a6-bfa2-784ff633c992', connection='192.168.202.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 77a2062a ... 2017-11-17 10:22:41,866-05 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] method: runVdsCommand, params: [ConnectStorageServer, StorageServerConnectionManagementVDSParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', storageType='ISCSI', connectionList='[StorageServerConnections:{id='adf99be9-5e6e-4bbd-a0ef-7aed8db79586', connection='192.168.201.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='5fef0968-a84b-40a6-bfa2-784ff633c992', connection='192.168.202.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}], timeElapsed: 2042ms 2017-11-17 10:22:41,877-05 INFO [org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] Host 'lago-basic-suite-master-host-1' storage connection was succeeded ... 2017-11-17 10:22:41,893-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] START, ConnectStoragePoolVDSCommand(HostName = lago-basic-suite-master-host-1, ConnectStoragePoolVDSCommandParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', vdsId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', masterVersion='1'}), log id: 2ae9a4dc 2017-11-17 10:22:43,390-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Domain 'e3cc9f2e-0833-4a79-b452-b1487308dbfd:templates' was reported with error code '358' 2017-11-17 10:22:43,391-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Domain '61efd44a-948f-4c17-a1e6-7cef6e422c6a:nfs' was reported with error code '358' 2017-11-17 10:22:43,391-05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Storage Domain 'templates' of pool 'test-dc' is in problem in host 'lago-basic-suite-master-host-1' 2017-11-17 10:22:43,392-05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Storage Domain 'nfs' of pool 'test-dc' is in problem in host 'lago-basic-suite-master-host-1' 2017-11-17 10:22:43,396-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host lago-basic-suite-master-host-1 reports about one of the Active Storage Domains as Problematic. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch How reproducible: Sometimes. Additional info: All logs: http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2252/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-002_bootstrap.py/
A host must be connected to all storage domains, so adding a host and storage domain in the same time is likely to cause trouble. The host may not have access to the storage domain since it was not asked to connect to it yet (the host did not exist when engine added the storage domain). For system tests, it is best to add the storage domains only after the hosts were added, or add all the storage domains before adding the extra host, but not mix the two flows. In a real setup, new host becoming non-operational will recover automatically after several minutes, so this may not be a real issue. In the tests, we don't want to wait for several minutes until a host recovers. System tests should not test esoteric edge cases but the normal flow.
(In reply to Nir Soffer from comment #1) > A host must be connected to all storage domains, so adding a host and > storage > domain in the same time is likely to cause trouble. The host may not have > access > to the storage domain since it was not asked to connect to it yet (the host > did > not exist when engine added the storage domain). > > For system tests, it is best to add the storage domains only after the hosts > were > added, or add all the storage domains before adding the extra host, but not > mix > the two flows. > > In a real setup, new host becoming non-operational will recover automatically > after several minutes, so this may not be a real issue. In the tests, we > don't want > to wait for several minutes until a host recovers. System tests should not > test > esoteric edge cases but the normal flow. You are probably right - I was trying to make the suite run faster - so I wait for the 1st host to be up and then use it to create the master storage domain and the then the other storage domains. What bothers me is that it did not fail until last week or so - it worked well for quite some time.
Posted https://gerrit.ovirt.org/84397 to ensure all hosts are added before secondary domains are added.
(In reply to Yaniv Kaul from comment #2) > (In reply to Nir Soffer from comment #1) > > A host must be connected to all storage domains, so adding a host and > > storage > > domain in the same time is likely to cause trouble. The host may not have > > access > > to the storage domain since it was not asked to connect to it yet (the host > > did > > not exist when engine added the storage domain). > > > > For system tests, it is best to add the storage domains only after the hosts > > were > > added, or add all the storage domains before adding the extra host, but not > > mix > > the two flows. > > > > In a real setup, new host becoming non-operational will recover automatically > > after several minutes, so this may not be a real issue. In the tests, we > > don't want > > to wait for several minutes until a host recovers. System tests should not > > test > > esoteric edge cases but the normal flow. > > You are probably right - I was trying to make the suite run faster - so I > wait for the 1st host to be up and then use it to create the master storage > domain and the then the other storage domains. > > What bothers me is that it did not fail until last week or so - it worked > well for quite some time. Perhaps you made it work too fast :-) Targetting to 4.2.1 - this doesn't seem to be an oVirt GA blocker. Having said that - QE/PM - please weigh in here.
(In reply to Yaniv Kaul from comment #3) > Posted https://gerrit.ovirt.org/84397 to ensure all hosts are added before > secondary domains are added. Closing as it doesn't seem a very interesting scenario and the above eliminated it from ovirt-system-tests.