1614430 – vdsm-gluster stuck on operations forcing vdsm to stop working

Bug 1614430 - vdsm-gluster stuck on operations forcing vdsm to stop working

Summary: vdsm-gluster stuck on operations forcing vdsm to stop working

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.3.3
Target Release:	4.3.0
Assignee:	Kaustav Majumder
QA Contact:	SATHEESARAN
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:
Duplicates (1):	1712654 (view as bug list)
Depends On:
Blocks:	1711830
TreeView+	depends on / blocked

Reported:	2018-08-09 14:49 UTC by Sahina Bose
Modified:	2020-04-29 14:51 UTC (History)
CC List:	20 users (show)
Fixed In Version:	vdsm-4.30.6
Doc Type:	Bug Fix
Doc Text:	Vdsm-gluster tries to run heal operations on all volumes. Previously, if the gluster commands got stuck, VDSM started waiting indefinitely for them, exhausting threads, until it timed-out. Then it stopped communicating with the Manager and went offline. The current release adds a timeout to the gluster heal info command so the command terminates within a set timeout and threads do not become exhausted. On timeout, the system issues a GlusterCommandTimeoutException, which causes the command to exit and notifies the Manager. As a result, VDSM threads are not stuck, and VDSM does not go offline.
Clone Of:	1609792
Environment:
Last Closed:	2019-05-08 12:36:02 UTC
oVirt Team:	Gluster
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1609792	high	CLOSED	Heal operations called on a single node volume forcing vdsm to stop working	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3888291	None	None	None	2019-07-08 20:36:21 UTC
Red Hat Product Errata	RHBA-2019:1077	None	None	None	2019-05-08 12:36:24 UTC
oVirt gerrit	94269	'None'	MERGED	gluster: Configured gluster heal info command to run in async	2020-12-31 22:57:28 UTC
oVirt gerrit	94481	'None'	MERGED	gluster: Configured gluster heal info command to run in async	2020-12-31 22:57:28 UTC
oVirt gerrit	94784	'None'	MERGED	gluster: Fix errors related to heal info command	2020-12-31 22:57:28 UTC
oVirt gerrit	96487	'None'	ABANDONED	gluster: Fix errors related to heal info command	2020-12-31 22:57:28 UTC

Internal Links: 1609792

Description Sahina Bose 2018-08-09 14:49:50 UTC

+++ This bug was initially created as a clone of Bug #1609792 +++

Description of problem:

vdsm-gluster tries to run heal operations on all volumes. It fails to run on a single node volume that causes vdsm to timeout and stop communication with the engine and will go offline. 

Version-Release number of selected component (if applicable):


How reproducible:

Always


Steps to Reproduce:
1. Create a single brick/node volume in Gluster that is apart of a RHEV cluster

Like the following.

Volume Name: rhev_export
Type: Distribute
Volume ID: ebd22164-41cf-45be-8fb4-c8f684ada43e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: rdma
Bricks:
Brick1: 100.64.78.11:/gluster/brick/rhev_export
Options Reconfigured:
network.ping-timeout: 30
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

2. Wait for VDSM to go offline.


Actual results:

==> /var/log/vdsm/mom.log <==
2018-07-30 07:55:26,729 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed:
(code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80})

==> /var/log/vdsm/vdsm.log <==
2018-07-30 07:55:32,647-0400 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'volumeName': u'rhev_export'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'9daf4941-c9fb-4387-a51e-cadcae22272f'} at 0x7f3148341390> timeout=60, duration=41880 at 0x7f3148341450> task#=29967 at 0x7f316c070950>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo
  return self._gluster.volumeHealInfo(volumeName)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper
  rv = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo
  return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)}
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
  **kwargs)
File: "<string>", line 2, in glusterVolumeHealInfo
File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:46)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] FINISH repoStats return={} from=internal, task_id=7a0d264c-dd13-4b77-a1ef-bed4637e7017 (api:52)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] START multipath_health() from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:46)
2018-07-30 07:55:34,025-0400 INFO  (periodic/0) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=0d984935-c93d-4ef5-98ee-d3f5a5109c8c (api:52)

==> /var/log/vdsm/mom.log <==
2018-07-30 07:55:34,026 - mom.RPCServer - INFO - ping()
2018-07-30 07:55:34,027 - mom.RPCServer - INFO - getStatistics()
2018-07-30 07:55:41,745 - mom.VdsmRpcBase - ERROR - Command Host.getAllVmStats with args {} failed:
(code=1100, message=Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80})


Expected results:

Heal operations do not run or skipped and the host remains online.


Additional info:

--- Additional comment from Denis Chaplygin on 2018-08-07 10:16:57 EDT ---

According to additional info, provided by Dan, gluster commands may stuck and then VDSM starts waiting for them infinitely, thus exhausting working threads. I think we need to wait for gluster command result with timeout, using AsyncProc

Comment 1 Sahina Bose 2018-08-09 14:51:05 UTC

Cloned this bug to vdsm to investigate the issue mentioned by Denis about stuck gluster commands exhausting vdsm threads

Comment 2 Gobinda Das 2019-01-03 12:43:37 UTC

Moving to 4.3.0 as it's not a blocker

Comment 3 Sandro Bonazzola 2019-01-28 09:41:43 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 4 Sahina Bose 2019-02-05 06:36:46 UTC

The patches are merged in master and available in 4.3.0

Comment 7 Sahina Bose 2019-03-04 05:25:57 UTC

Kaustav, please provide doc_text

Comment 10 SATHEESARAN 2019-04-11 09:07:29 UTC

Tested with RHV 4.3.3-3 

And with the mix of distribute volume and replicate volume managed by RHV Manager, 
there are no issues found

Comment 12 errata-xmlrpc 2019-05-08 12:36:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 13 Ryan Barry 2019-05-22 08:31:25 UTC

*** Bug 1712654 has been marked as a duplicate of this bug. ***

Comment 14 Daniel Gur 2019-08-28 13:14:42 UTC

sync2jira

Comment 15 Daniel Gur 2019-08-28 13:19:44 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.

adevolder
bugs
dfediuck
dlavu
fdelorey
godas
info
kmajumde
lsurette
lsvaty
mkalinin
pasik
pdhange
rbarry
rcyriac
rdlugyhe
sabose
sasundar
srevivo
ycui