1580243 – A host with Unreachable Data SD is not moving to non-operational when brought up (but remains in a loop between non-responsive and connecting)

Bug 1580243 - A host with Unreachable Data SD is not moving to non-operational when brought up (but remains in a loop between non-responsive and connecting)

Summary: A host with Unreachable Data SD is not moving to non-operational when brought...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.3.3
Target Release:	4.3.0
Assignee:	Fred Rolland
QA Contact:	Yosi Ben Shimon
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1561522 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-21 05:24 UTC by Germano Veit Michel
Modified:	2022-03-13 15:12 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 12:37:35 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1499056	high	CLOSED	Unreachable ISO/Export SD prevents hosts from activating	2021-12-10 15:20:03 UTC
Red Hat Issue Tracker	RHV-45186	None	None	None	2022-03-13 15:12:36 UTC
Red Hat Knowledge Base (Solution)	3208571	None	None	None	2018-05-21 05:25:38 UTC
Red Hat Product Errata	RHEA-2019:1085	None	None	None	2019-05-08 12:37:56 UTC

Internal Links: 1499056

Description Germano Veit Michel 2018-05-21 05:24:57 UTC

Description of problem:

Similarly to BZ1499056 (for ISO Domains), if vdsm takes more than vdsTimeout seconds to respond to a connectStorageServer for a Data Storage Domain such as iSCSI due to an outage, then the host keeps looping on Connecting->NotResponding states.

This is difficult to stop (clicking maintenance did not work for me) had to stop the engine and set the host to maintenance in the DB. And the admin receives not clear indication on what is going on, there is just a host timing out and going not responsive in a loop. The DC goes up and down too.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.5-1.el7.centos.noarch
vdsm-4.20.27.1-1.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
Make storage unreachable, so that iscsi login takes 180 or more. Hitting TCP timeout is the easiest with a DROP iptables rule.

Actual results:
Hosts not Responding

Expected results:
User warned that a SD in unreachable

Additional info:
2018-05-21 10:43:33,055+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] START, ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 7e750511

2018-05-21 10:46:35,123+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] Command 'ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

VDSM took 256s to respond:
2018-05-21 10:47:49,356+1000 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 256.30 seconds (__init__:573)

Blocked worker:
2018-05-21 10:47:33,058+1000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'id': u'7b82e212-40bd-4f0a-9ed3-fa9ab274f122', u'connection': u'10.64.24.33', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', u'connection': u'10.64.24.45', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], u'storagepoolID': u'6a8adb90-36e7-11e8-872c-001a4a61183e', u'domainType': 3}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'6a36ae3a-3338-4175-af28-a07f7f371609'} at 0x7f2a7408aad0> timeout=60, duration=240 at 0x7f2a7408ae10> task#=16 at 0x7f2a7416a710>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod
  result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1070, in connectStorageServer
  connectionParams)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper
  result = ctask.prepare(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapperFile: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer
  conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect
  iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
  iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
  portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
  return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd
  (out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
  orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
  ready = poller.poll(self._remaining_time(endtime)) (executor:363)
  return m(self, *a, **kw)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare
  result = self._run(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
  return fn(*args, **kargs)
File: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer
  conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect
  iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
  iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
  portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
  return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd
  (out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
  orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
  ready = poller.poll(self._remaining_time(endtime)) (executor:363)

Comment 1 Tal Nisan 2018-05-21 14:17:34 UTC

Fred, anything we can do about it?

Comment 2 Fred Rolland 2018-05-21 15:32:51 UTC

One solution could be to move the connectStorageServer verb to be asynch, as the NFS connect can take long before failing.

This needs some design and input from Nir.

Comment 3 Yaniv Kaul 2018-05-22 05:33:29 UTC

Unlike ISO domains, data domains are critical to our operation, and we don't want a host to be up if it's not seeing all data domains. That being said, it should not be in a loop but in a non-operational state.

Comment 4 Germano Veit Michel 2018-05-22 05:37:28 UTC

Right, the expect outcome is 'Non Operational'. But it doesn't reach that stage since connectStorageServer times out, making the engine enter in a loop of connecting -> connectStorageServer -> timeout -> Not Responding -> connecting...

Comment 5 Tal Nisan 2018-05-27 10:22:43 UTC

(In reply to Fred Rolland from comment #2)
> One solution could be to move the connectStorageServer verb to be asynch, as
> the NFS connect can take long before failing.
> 
> This needs some design and input from Nir.

Nir, what do you think?

Comment 6 Nir Soffer 2018-06-04 15:56:27 UTC

I agree with Fred. If something can take lot of time it should be async.

Comment 8 Yaniv Lavi 2018-08-13 08:57:20 UTC


*** This bug has been marked as a duplicate of bug 1561522 ***

Comment 9 Germano Veit Michel 2018-12-19 00:46:27 UTC

Inverting the DUP bug as this has more info and some discussions under way.

Comment 10 Germano Veit Michel 2018-12-19 00:47:20 UTC

*** Bug 1561522 has been marked as a duplicate of this bug. ***

Comment 11 Germano Veit Michel 2018-12-20 06:50:11 UTC

We just saw this on a partial storage network outage, where 1+ paths fail to connect. It delays the command and causes the same issue.
So this is important for resiliency on storage/network failures.

Comment 12 Sandro Bonazzola 2019-01-28 09:41:55 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 14 Fred Rolland 2019-03-24 13:19:59 UTC

Fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1684554 in patch https://gerrit.ovirt.org/#/c/98633/

The host will move to non-operational if the connection to the storage server fails when activating the host. It will not loop on the status.

Comment 15 Yosi Ben Shimon 2019-04-04 09:18:33 UTC

Tested on:
ovirt-engine-4.3.3.1-0.1.el7.noarch

I blocked the connection between a host to the iscsi SD using iptables.

Actual result:
The host status switched from up to non-operational without looping through connecting and non-responsive as expected.


But since that, every 5 minutes, the hosts status changes from non-operational -> unassigned and back to non-operational.

2019-04-04 11:55:00,048+03 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Before acquiring lock in order to prevent monitoring for host 'ho
st_mixed_2' from data-center 'golden_env_mixed'
2019-04-04 11:55:00,049+03 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Lock acquired, from now a monitoring of host will be skipped for 
host 'host_mixed_2' from data-center 'golden_env_mixed'
2019-04-04 11:55:00,062+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr
eadFactory-engineScheduled-Thread-56) [659d89e6] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, 
SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='Unassigned', no
nOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 3f0d5e8a
2019-04-04 11:55:00,070+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr
eadFactory-engineScheduled-Thread-56) [659d89e6] FINISH, SetVdsStatusVDSCommand, return: , log id: 3f0d5
e8a
2019-04-04 11:55:00,080+03 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Activate host finished. Lock released. Monitoring can run now for
 host 'host_mixed_2' from data-center 'golden_env_mixed'


.....


2019-04-04 11:55:10,658+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-62068) [324c1f52] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host host_mixed_2 reports about one of the Active Storage Domains as Problematic.
2019-04-04 11:55:10,703+03 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: c54a1613-9828-4e95-a0ec-e4372012bdc9 Type: VDS
2019-04-04 11:55:10,711+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2309f9ab
2019-04-04 11:55:10,719+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] FINISH, SetVdsStatusVDSCommand, return: , log id: 2309f9ab
2019-04-04 11:55:10,860+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_2 cannot access the Storage Domain(s) iscsi_1, iscsi_0, iscsi_2 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.



Is this the expected behavior in terms of status changes?

Comment 16 Fred Rolland 2019-04-04 12:00:16 UTC

This is the scenario to test:

Have a DC with 2 hosts and at least one SD.
Put Host 2 in maintenance.
Block access to SD in Host2
Move Host2 to UP.

Without the fix the host never moves to NonOperational, it keeps moving between connecting and Up.
With the fix the host moves to NonOperational. HostMonitoring on the next cycle tries to connect to storage domain again.

Comment 17 Avihai 2019-04-04 12:44:52 UTC

(In reply to Fred Rolland from comment #16)
> This is the scenario to test:
> 
> Have a DC with 2 hosts and at least one SD.
> Put Host 2 in maintenance.
> Block access to SD in Host2
> Move Host2 to UP.
> 
> Without the fix the host never moves to NonOperational, it keeps moving
> between connecting and Up.
> With the fix the host moves to NonOperational. HostMonitoring on the next
> cycle tries to connect to storage domain again.

Thanks for update Fred! 
Yosi will test by this scenario to verify.

What about the issue seen in last comment(c#15) from yosi  -> host state moves from 'Non-operational' -> 'unassigned' and back every 5 minutes.
A few questions we need your help with to clarify and maybe open a new bug on:

1) It's important to understand that the host state moves from 'Non-operational' -> 'unassigned' and back means the host is treated as down and not in any means active, is this the case ?

2) What is the reason for this move ? Is there any impact on the customer?

3) Is this a regression?

Comment 18 Fred Rolland 2019-04-04 14:24:14 UTC

The host is not moving to unassigned in this flow

Comment 19 Yosi Ben Shimon 2019-04-07 12:20:04 UTC

Tested using:
ovirt-engine-4.3.3.2-0.1.el7.noarch

Actual result:
The blocked host doesn't loop from connecting -> up and goes straight to "NonOperational" after the timeout as expected.


Moving to VERIFIED.

Comment 21 errata-xmlrpc 2019-05-08 12:37:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 22 Daniel Gur 2019-08-28 13:12:36 UTC

sync2jira

Comment 23 Daniel Gur 2019-08-28 13:16:48 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.