+++ This bug was initially created as a clone of Bug #1609792 +++ Description of problem: vdsm-gluster tries to run heal operations on all volumes. It fails to run on a single node volume that causes vdsm to timeout and stop communication with the engine and will go offline. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a single brick/node volume in Gluster that is apart of a RHEV cluster Like the following. Volume Name: rhev_export Type: Distribute Volume ID: ebd22164-41cf-45be-8fb4-c8f684ada43e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: rdma Bricks: Brick1: 100.64.78.11:/gluster/brick/rhev_export Options Reconfigured: network.ping-timeout: 30 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 network.remote-dio: enable performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable 2. Wait for VDSM to go offline. Actual results: ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:26,729 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) ==> /var/log/vdsm/vdsm.log <== 2018-07-30 07:55:32,647-0400 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'volumeName': u'rhev_export'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'9daf4941-c9fb-4387-a51e-cadcae22272f'} at 0x7f3148341390> timeout=60, duration=41880 at 0x7f3148341450> task#=29967 at 0x7f316c070950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo return self._gluster.volumeHealInfo(volumeName) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)} File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in glusterVolumeHealInfo File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() (executor:363) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH repoStats return={} from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:52) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START multipath_health() from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:52) ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:34,026 - mom.RPCServer - INFO - ping() 2018-07-30 07:55:34,027 - mom.RPCServer - INFO - getStatistics() 2018-07-30 07:55:41,745 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) Expected results: Heal operations do not run or skipped and the host remains online. Additional info: --- Additional comment from Denis Chaplygin on 2018-08-07 10:16:57 EDT --- According to additional info, provided by Dan, gluster commands may stuck and then VDSM starts waiting for them infinitely, thus exhausting working threads. I think we need to wait for gluster command result with timeout, using AsyncProc
Cloned this bug to vdsm to investigate the issue mentioned by Denis about stuck gluster commands exhausting vdsm threads
Moving to 4.3.0 as it's not a blocker
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
The patches are merged in master and available in 4.3.0
Kaustav, please provide doc_text
Tested with RHV 4.3.3-3 And with the mix of distribute volume and replicate volume managed by RHV Manager, there are no issues found
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077
*** Bug 1712654 has been marked as a duplicate of this bug. ***
sync2jira