Bug 1609792
| Summary: | Heal operations called on a single node volume forcing vdsm to stop working | |||
|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Dan Lavu <dlavu> | |
| Component: | BLL.Gluster | Assignee: | Sahina Bose <sabose> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | --- | CC: | bugs, dchaplyg, sabose | |
| Target Milestone: | ovirt-4.2.6 | Flags: | rule-engine:
ovirt-4.2+
|
|
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ovirt-engine-4.2.6.4 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Heal info commands are called for non-replicate volume types
Consequence: vdsm commands invoking the gluster cli hangs, and eventually vdsm runs out of threads
Fix: Heal commands are invoked only for replicate volume types
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1614430 1619639 (view as bug list) | Environment: | ||
| Last Closed: | 2018-09-13 07:41:27 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1619639 | |||
According to additional info, provided by Dan, gluster commands may stuck and then VDSM starts waiting for them infinitely, thus exhausting working threads. I think we need to wait for gluster command result with timeout, using AsyncProc Changing component as the heal info command is triggered from engine |
Description of problem: vdsm-gluster tries to run heal operations on all volumes. It fails to run on a single node volume that causes vdsm to timeout and stop communication with the engine and will go offline. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a single brick/node volume in Gluster that is apart of a RHEV cluster Like the following. Volume Name: rhev_export Type: Distribute Volume ID: ebd22164-41cf-45be-8fb4-c8f684ada43e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: rdma Bricks: Brick1: 100.64.78.11:/gluster/brick/rhev_export Options Reconfigured: network.ping-timeout: 30 server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 network.remote-dio: enable performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable 2. Wait for VDSM to go offline. Actual results: ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:26,729 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) ==> /var/log/vdsm/vdsm.log <== 2018-07-30 07:55:32,647-0400 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'volumeName': u'rhev_export'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'9daf4941-c9fb-4387-a51e-cadcae22272f'} at 0x7f3148341390> timeout=60, duration=41880 at 0x7f3148341450> task#=29967 at 0x7f316c070950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo return self._gluster.volumeHealInfo(volumeName) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)} File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in glusterVolumeHealInfo File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod kind, result = conn.recv() (executor:363) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH repoStats return={} from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:52) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] START multipath_health() from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:46) 2018-07-30 07:55:34,025-0400 INFO (periodic/0) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:52) ==> /var/log/vdsm/mom.log <== 2018-07-30 07:55:34,026 - mom.RPCServer - INFO - ping() 2018-07-30 07:55:34,027 - mom.RPCServer - INFO - getStatistics() 2018-07-30 07:55:41,745 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed: (code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}) Expected results: Heal operations do not run or skipped and the host remains online. Additional info: