Description of problem: Similarly to BZ1499056 (for ISO Domains), if vdsm takes more than vdsTimeout seconds to respond to a connectStorageServer for a Data Storage Domain such as iSCSI due to an outage, then the host keeps looping on Connecting->NotResponding states. This is difficult to stop (clicking maintenance did not work for me) had to stop the engine and set the host to maintenance in the DB. And the admin receives not clear indication on what is going on, there is just a host timing out and going not responsive in a loop. The DC goes up and down too. Version-Release number of selected component (if applicable): ovirt-engine-4.2.3.5-1.el7.centos.noarch vdsm-4.20.27.1-1.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: Make storage unreachable, so that iscsi login takes 180 or more. Hitting TCP timeout is the easiest with a DROP iptables rule. Actual results: Hosts not Responding Expected results: User warned that a SD in unreachable Additional info: 2018-05-21 10:43:33,055+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] START, ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 7e750511 2018-05-21 10:46:35,123+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] Command 'ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues VDSM took 256s to respond: 2018-05-21 10:47:49,356+1000 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 256.30 seconds (__init__:573) Blocked worker: 2018-05-21 10:47:33,058+1000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'id': u'7b82e212-40bd-4f0a-9ed3-fa9ab274f122', u'connection': u'10.64.24.33', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', u'connection': u'10.64.24.45', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], u'storagepoolID': u'6a8adb90-36e7-11e8-872c-001a4a61183e', u'domainType': 3}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'6a36ae3a-3338-4175-af28-a07f7f371609'} at 0x7f2a7408aad0> timeout=60, duration=240 at 0x7f2a7408ae10> task#=16 at 0x7f2a7416a710>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod result = fn(*methodArgs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1070, in connectStorageServer connectionParams) File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper result = ctask.prepare(func, *args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapperFile: "<string>", line 2, in connectStorageServer File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer conObj.connect() File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect iscsi.addIscsiNode(self._iface, self._target, self._cred) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode iscsiadm.node_login(iface.name, target.address, target.iqn) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login portal, "-l"]) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync) File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd (out, err) = p.communicate(data) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate orig_timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll ready = poller.poll(self._remaining_time(endtime)) (executor:363) return m(self, *a, **kw) File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare result = self._run(func, *args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File: "<string>", line 2, in connectStorageServer File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer conObj.connect() File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect iscsi.addIscsiNode(self._iface, self._target, self._cred) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode iscsiadm.node_login(iface.name, target.address, target.iqn) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login portal, "-l"]) File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync) File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd (out, err) = p.communicate(data) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate orig_timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll ready = poller.poll(self._remaining_time(endtime)) (executor:363)
Fred, anything we can do about it?
One solution could be to move the connectStorageServer verb to be asynch, as the NFS connect can take long before failing. This needs some design and input from Nir.
Unlike ISO domains, data domains are critical to our operation, and we don't want a host to be up if it's not seeing all data domains. That being said, it should not be in a loop but in a non-operational state.
Right, the expect outcome is 'Non Operational'. But it doesn't reach that stage since connectStorageServer times out, making the engine enter in a loop of connecting -> connectStorageServer -> timeout -> Not Responding -> connecting...
(In reply to Fred Rolland from comment #2) > One solution could be to move the connectStorageServer verb to be asynch, as > the NFS connect can take long before failing. > > This needs some design and input from Nir. Nir, what do you think?
I agree with Fred. If something can take lot of time it should be async.
*** This bug has been marked as a duplicate of bug 1561522 ***
Inverting the DUP bug as this has more info and some discussions under way.
*** Bug 1561522 has been marked as a duplicate of this bug. ***
We just saw this on a partial storage network outage, where 1+ paths fail to connect. It delays the command and causes the same issue. So this is important for resiliency on storage/network failures.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1684554 in patch https://gerrit.ovirt.org/#/c/98633/ The host will move to non-operational if the connection to the storage server fails when activating the host. It will not loop on the status.
Tested on: ovirt-engine-4.3.3.1-0.1.el7.noarch I blocked the connection between a host to the iscsi SD using iptables. Actual result: The host status switched from up to non-operational without looping through connecting and non-responsive as expected. But since that, every 5 minutes, the hosts status changes from non-operational -> unassigned and back to non-operational. 2019-04-04 11:55:00,048+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory -engineScheduled-Thread-56) [659d89e6] Before acquiring lock in order to prevent monitoring for host 'ho st_mixed_2' from data-center 'golden_env_mixed' 2019-04-04 11:55:00,049+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory -engineScheduled-Thread-56) [659d89e6] Lock acquired, from now a monitoring of host will be skipped for host 'host_mixed_2' from data-center 'golden_env_mixed' 2019-04-04 11:55:00,062+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr eadFactory-engineScheduled-Thread-56) [659d89e6] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='Unassigned', no nOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 3f0d5e8a 2019-04-04 11:55:00,070+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr eadFactory-engineScheduled-Thread-56) [659d89e6] FINISH, SetVdsStatusVDSCommand, return: , log id: 3f0d5 e8a 2019-04-04 11:55:00,080+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory -engineScheduled-Thread-56) [659d89e6] Activate host finished. Lock released. Monitoring can run now for host 'host_mixed_2' from data-center 'golden_env_mixed' ..... 2019-04-04 11:55:10,658+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-62068) [324c1f52] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host host_mixed_2 reports about one of the Active Storage Domains as Problematic. 2019-04-04 11:55:10,703+03 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: c54a1613-9828-4e95-a0ec-e4372012bdc9 Type: VDS 2019-04-04 11:55:10,711+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2309f9ab 2019-04-04 11:55:10,719+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] FINISH, SetVdsStatusVDSCommand, return: , log id: 2309f9ab 2019-04-04 11:55:10,860+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_2 cannot access the Storage Domain(s) iscsi_1, iscsi_0, iscsi_2 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational. Is this the expected behavior in terms of status changes?
This is the scenario to test: Have a DC with 2 hosts and at least one SD. Put Host 2 in maintenance. Block access to SD in Host2 Move Host2 to UP. Without the fix the host never moves to NonOperational, it keeps moving between connecting and Up. With the fix the host moves to NonOperational. HostMonitoring on the next cycle tries to connect to storage domain again.
(In reply to Fred Rolland from comment #16) > This is the scenario to test: > > Have a DC with 2 hosts and at least one SD. > Put Host 2 in maintenance. > Block access to SD in Host2 > Move Host2 to UP. > > Without the fix the host never moves to NonOperational, it keeps moving > between connecting and Up. > With the fix the host moves to NonOperational. HostMonitoring on the next > cycle tries to connect to storage domain again. Thanks for update Fred! Yosi will test by this scenario to verify. What about the issue seen in last comment(c#15) from yosi -> host state moves from 'Non-operational' -> 'unassigned' and back every 5 minutes. A few questions we need your help with to clarify and maybe open a new bug on: 1) It's important to understand that the host state moves from 'Non-operational' -> 'unassigned' and back means the host is treated as down and not in any means active, is this the case ? 2) What is the reason for this move ? Is there any impact on the customer? 3) Is this a regression?
The host is not moving to unassigned in this flow
Tested using: ovirt-engine-4.3.3.2-0.1.el7.noarch Actual result: The blocked host doesn't loop from connecting -> up and goes straight to "NonOperational" after the timeout as expected. Moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085
sync2jira