Bug 1514906 - Race between adding storage domains and a host leaves host as non-operational
Summary: Race between adding storage domains and a host leaves host as non-operational
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Allon Mureinik
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-19 11:48 UTC by Yaniv Kaul
Modified: 2022-06-27 12:13 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-11-27 11:08:02 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.2-


Attachments (Terms of Use)
engine log (2.65 MB, text/plain)
2017-11-19 11:48 UTC, Yaniv Kaul
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-46706 0 None None None 2022-06-27 12:13:51 UTC

Description Yaniv Kaul 2017-11-19 11:48:20 UTC
Created attachment 1355099 [details]
engine log

Description of problem:
Under ovirt-system-tests, storage domains are added while the 2nd host (host-1) is added to the system. For some reason, it causes a race - where not all secondary domains are added to it (no ConnectStorageServer is even sent) and it ends up as Non-Operational.

Timeline:

Host-0 (which is the 1st host connected and is connected first to the iSCSI master SD and then to the secondary, NFS based, storage domains):
---- host-0

2017-11-17 10:21:52,098-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (default task-17) [7b4e6607-5a01-45a3-8b6a-87cc4b5795ce] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-0, StorageServerConnectionManagementVDSParameters:{hostId='e70aabfb-f76b-4a4e-97ca-b68f6d234360', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='null', connection='192.168.202.4:/exports/nfs/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='V4_2', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 2d51b079

2017-11-17 10:23:13,727-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-56) [5bf9a05e] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-0, StorageServerConnectionManagementVDSParameters:{hostId='e70aabfb-f76b-4a4e-97ca-b68f6d234360', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='9c1af421-331a-4483-9e1a-bb8864c3afac', connection='192.168.202.4:/exports/nfs/share2', iqn='null', vfsType='null', mountOptions='null', nfsVersion='V4_1', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 6bfb50bc


And host-1 (which ends up as problematic):
2017-11-17 10:22:39,829-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] START, ConnectStorageServerVDSCommand(HostName = lago-basic-suite-master-host-1, StorageServerConnectionManagementVDSParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', storageType='ISCSI', connectionList='[StorageServerConnections:{id='adf99be9-5e6e-4bbd-a0ef-7aed8db79586', connection='192.168.201.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='5fef0968-a84b-40a6-bfa2-784ff633c992', connection='192.168.202.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 77a2062a
...
2017-11-17 10:22:41,866-05 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] method: runVdsCommand, params: [ConnectStorageServer, StorageServerConnectionManagementVDSParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', storageType='ISCSI', connectionList='[StorageServerConnections:{id='adf99be9-5e6e-4bbd-a0ef-7aed8db79586', connection='192.168.201.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='5fef0968-a84b-40a6-bfa2-784ff633c992', connection='192.168.202.4', iqn='iqn.2014-07.org.ovirt:storage', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}], timeElapsed: 2042ms
2017-11-17 10:22:41,877-05 INFO  [org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-97) [54c41c3f] Host 'lago-basic-suite-master-host-1' storage connection was succeeded
...
2017-11-17 10:22:41,893-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] START, ConnectStoragePoolVDSCommand(HostName = lago-basic-suite-master-host-1, ConnectStoragePoolVDSCommandParameters:{hostId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', vdsId='aa142561-521f-4d1f-ab5c-0fb440e5a0e1', storagePoolId='cded8424-99cb-4fb2-a9d4-999e3f46f558', masterVersion='1'}), log id: 2ae9a4dc



2017-11-17 10:22:43,390-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Domain 'e3cc9f2e-0833-4a79-b452-b1487308dbfd:templates' was reported with error code '358'
2017-11-17 10:22:43,391-05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Domain '61efd44a-948f-4c17-a1e6-7cef6e422c6a:nfs' was reported with error code '358'
2017-11-17 10:22:43,391-05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Storage Domain 'templates' of pool 'test-dc' is in problem in host 'lago-basic-suite-master-host-1'
2017-11-17 10:22:43,392-05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] Storage Domain 'nfs' of pool 'test-dc' is in problem in host 'lago-basic-suite-master-host-1'
2017-11-17 10:22:43,396-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-38) [54c41c3f] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host lago-basic-suite-master-host-1 reports about one of the Active Storage Domains as Problematic.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171116212005.git61ffb5f.el7.centos.noarch

How reproducible:
Sometimes.


Additional info:
All logs: http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2252/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-002_bootstrap.py/

Comment 1 Nir Soffer 2017-11-20 13:49:21 UTC
A host must be connected to all storage domains, so adding a host and storage 
domain in the same time is likely to cause trouble. The host may not have access
to the storage domain since it was not asked to connect to it yet (the host did
not exist when engine added the storage domain).

For system tests, it is best to add the storage domains only after the hosts were
added, or add all the storage domains before adding the extra host, but not mix
the two flows.

In a real setup, new host becoming non-operational will recover automatically
after several minutes, so this may not be a real issue. In the tests, we don't want
to wait for several minutes until a host recovers. System tests should not test
esoteric edge cases but the normal flow.

Comment 2 Yaniv Kaul 2017-11-20 13:53:15 UTC
(In reply to Nir Soffer from comment #1)
> A host must be connected to all storage domains, so adding a host and
> storage 
> domain in the same time is likely to cause trouble. The host may not have
> access
> to the storage domain since it was not asked to connect to it yet (the host
> did
> not exist when engine added the storage domain).
> 
> For system tests, it is best to add the storage domains only after the hosts
> were
> added, or add all the storage domains before adding the extra host, but not
> mix
> the two flows.
> 
> In a real setup, new host becoming non-operational will recover automatically
> after several minutes, so this may not be a real issue. In the tests, we
> don't want
> to wait for several minutes until a host recovers. System tests should not
> test
> esoteric edge cases but the normal flow.

You are probably right - I was trying to make the suite run faster - so I wait for the 1st host to be up and then use it to create the master storage domain and the then the other storage domains.

What bothers me is that it did not fail until last week or so - it worked well for quite some time.

Comment 3 Yaniv Kaul 2017-11-20 19:56:46 UTC
Posted https://gerrit.ovirt.org/84397 to ensure all hosts are added before secondary domains are added.

Comment 4 Allon Mureinik 2017-11-22 12:44:25 UTC
(In reply to Yaniv Kaul from comment #2)
> (In reply to Nir Soffer from comment #1)
> > A host must be connected to all storage domains, so adding a host and
> > storage 
> > domain in the same time is likely to cause trouble. The host may not have
> > access
> > to the storage domain since it was not asked to connect to it yet (the host
> > did
> > not exist when engine added the storage domain).
> > 
> > For system tests, it is best to add the storage domains only after the hosts
> > were
> > added, or add all the storage domains before adding the extra host, but not
> > mix
> > the two flows.
> > 
> > In a real setup, new host becoming non-operational will recover automatically
> > after several minutes, so this may not be a real issue. In the tests, we
> > don't want
> > to wait for several minutes until a host recovers. System tests should not
> > test
> > esoteric edge cases but the normal flow.
> 
> You are probably right - I was trying to make the suite run faster - so I
> wait for the 1st host to be up and then use it to create the master storage
> domain and the then the other storage domains.
> 
> What bothers me is that it did not fail until last week or so - it worked
> well for quite some time.
Perhaps you made it work too fast :-)

Targetting to 4.2.1 - this doesn't seem to be an oVirt GA blocker.
Having said that - QE/PM - please weigh in here.

Comment 5 Yaniv Kaul 2017-11-27 11:08:02 UTC
(In reply to Yaniv Kaul from comment #3)
> Posted https://gerrit.ovirt.org/84397 to ensure all hosts are added before
> secondary domains are added.

Closing as it doesn't seem a very interesting scenario and the above eliminated it from ovirt-system-tests.


Note You need to log in before you can comment on or make changes to this bug.