Description of problem: vdsm-gluster tries to run heal operations on all volumes. It fails to run on a single node volume that causes vdsm to timeout and stop communication with the engine and will go offline. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a single brick/node volume in Gluster that is apart of a RHEV cluster Like the following. Volume Name: rhev_export Type: Distribute Volume ID: ebd22164-41cf-45be-8fb4-c8f684ada43e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: rdma Bricks: Brick1: 100.64.78.11:/gluster/brick/rhev_export Options Reconfigured: network.ping-timeout: 30 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 network.remote-dio: enable performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable 2. Wait for VDSM to go offline. Actual results: ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:26,729 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) ==> /var/log/vdsm/vdsm.log <== 2018-07-30 07:55:32,647-0400 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'volumeName': u'rhev_export'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'9daf4941-c9fb-4387-a51e-cadcae22272f'} at 0x7f3148341390> timeout=60, duration=41880 at 0x7f3148341450> task#=29967 at 0x7f316c070950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo return self._gluster.volumeHealInfo(volumeName) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)} File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in glusterVolumeHealInfo File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() (executor:363) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH repoStats return={} from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:52) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START multipath_health() from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:52) ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:34,026 - mom.RPCServer - INFO - ping() 2018-07-30 07:55:34,027 - mom.RPCServer - INFO - getStatistics() 2018-07-30 07:55:41,745 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) Expected results: Heal operations do not run or skipped and the host remains online. Additional info:
According to additional info, provided by Dan, gluster commands may stuck and then VDSM starts waiting for them infinitely, thus exhausting working threads. I think we need to wait for gluster command result with timeout, using AsyncProc
Changing component as the heal info command is triggered from engine
Tested with distribute gluster volume with single brick with RHV 4.2.6-4 with the steps from comment0.