Bug 1614430 - vdsm-gluster stuck on operations forcing vdsm to stop working
Summary: vdsm-gluster stuck on operations forcing vdsm to stop working
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.3.3
: 4.3.0
Assignee: Kaustav Majumder
QA Contact: SATHEESARAN
Rolfe Dlugy-Hegwer
URL:
Whiteboard:
: 1712654 (view as bug list)
Depends On:
Blocks: 1711830
TreeView+ depends on / blocked
 
Reported: 2018-08-09 14:49 UTC by Sahina Bose
Modified: 2020-04-29 14:51 UTC (History)
20 users (show)

Fixed In Version: vdsm-4.30.6
Doc Type: Bug Fix
Doc Text:
Vdsm-gluster tries to run heal operations on all volumes. Previously, if the gluster commands got stuck, VDSM started waiting indefinitely for them, exhausting threads, until it timed-out. Then it stopped communicating with the Manager and went offline. The current release adds a timeout to the gluster heal info command so the command terminates within a set timeout and threads do not become exhausted. On timeout, the system issues a GlusterCommandTimeoutException, which causes the command to exit and notifies the Manager. As a result, VDSM threads are not stuck, and VDSM does not go offline.
Clone Of: 1609792
Environment:
Last Closed: 2019-05-08 12:36:02 UTC
oVirt Team: Gluster
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1609792 0 high CLOSED Heal operations called on a single node volume forcing vdsm to stop working 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3888291 0 None None None 2019-07-08 20:36:21 UTC
Red Hat Product Errata RHBA-2019:1077 0 None None None 2019-05-08 12:36:24 UTC
oVirt gerrit 94269 0 'None' MERGED gluster: Configured gluster heal info command to run in async 2020-12-31 22:57:28 UTC
oVirt gerrit 94481 0 'None' MERGED gluster: Configured gluster heal info command to run in async 2020-12-31 22:57:28 UTC
oVirt gerrit 94784 0 'None' MERGED gluster: Fix errors related to heal info command 2020-12-31 22:57:28 UTC
oVirt gerrit 96487 0 'None' ABANDONED gluster: Fix errors related to heal info command 2020-12-31 22:57:28 UTC

Internal Links: 1609792

Description Sahina Bose 2018-08-09 14:49:50 UTC
+++ This bug was initially created as a clone of Bug #1609792 +++

Description of problem:

vdsm-gluster tries to run heal operations on all volumes. It fails to run on a single node volume that causes vdsm to timeout and stop communication with the engine and will go offline. 

Version-Release number of selected component (if applicable):


How reproducible:

Always


Steps to Reproduce:
1. Create a single brick/node volume in Gluster that is apart of a RHEV cluster

Like the following.

Volume Name: rhev_export
Type: Distribute
Volume ID: ebd22164-41cf-45be-8fb4-c8f684ada43e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: rdma
Bricks:
Brick1: 100.64.78.11:/gluster/brick/rhev_export
Options Reconfigured:
network.ping-timeout: 30
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

2. Wait for VDSM to go offline.


Actual results:

==> /var/log/vdsm/mom.log <==
2018-07-30 07:55:26,729 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed:
(code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80})

==> /var/log/vdsm/vdsm.log <==
2018-07-30 07:55:32,647-0400 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'volumeName': u'rhev_export'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'9daf4941-c9fb-4387-a51e-cadcae22272f'} at 0x7f3148341390> timeout=60, duration=41880 at 0x7f3148341450> task#=29967 at 0x7f316c070950>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo
  return self._gluster.volumeHealInfo(volumeName)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper
  rv = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo
  return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)}
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
  **kwargs)
File: "<string>", line 2, in glusterVolumeHealInfo
File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:46)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] FINISH repoStats return={} from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:52)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] START multipath_health() from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:46)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:52)

==> /var/log/vdsm/mom.log <==
2018-07-30 07:55:34,026 - mom.RPCServer - INFO - ping()
2018-07-30 07:55:34,027 - mom.RPCServer - INFO - getStatistics()
2018-07-30 07:55:41,745 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed:
(code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80})


Expected results:

Heal operations do not run or skipped and the host remains online.


Additional info:

--- Additional comment from Denis Chaplygin on 2018-08-07 10:16:57 EDT ---

According to additional info, provided by Dan, gluster commands may stuck and then VDSM starts waiting for them infinitely, thus exhausting working threads. I think we need to wait for gluster command result with timeout, using AsyncProc

Comment 1 Sahina Bose 2018-08-09 14:51:05 UTC
Cloned this bug to vdsm to investigate the issue mentioned by Denis about stuck gluster commands exhausting vdsm threads

Comment 2 Gobinda Das 2019-01-03 12:43:37 UTC
Moving to 4.3.0 as it's not a blocker

Comment 3 Sandro Bonazzola 2019-01-28 09:41:43 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 4 Sahina Bose 2019-02-05 06:36:46 UTC
The patches are merged in master and available in 4.3.0

Comment 7 Sahina Bose 2019-03-04 05:25:57 UTC
Kaustav, please provide doc_text

Comment 10 SATHEESARAN 2019-04-11 09:07:29 UTC
Tested with RHV 4.3.3-3 

And with the mix of distribute volume and replicate volume managed by RHV Manager, 
there are no issues found

Comment 12 errata-xmlrpc 2019-05-08 12:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 13 Ryan Barry 2019-05-22 08:31:25 UTC
*** Bug 1712654 has been marked as a duplicate of this bug. ***

Comment 14 Daniel Gur 2019-08-28 13:14:42 UTC
sync2jira

Comment 15 Daniel Gur 2019-08-28 13:19:44 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.