Bug 1733520

Summary:	potential deadlock while processing callbacks in gfapi
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Soumya Koduri <skoduri>
Component:	libgfapi	Assignee:	Soumya Koduri <skoduri>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	bugs, jthottan, pasik, rhs-bugs, sheggodu, skoduri, storage-qa-internal, vdas
Target Milestone:	---
Target Release:	RHGS 3.5.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	glusterfs-6.0-11	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1733166	Environment:
Last Closed:	2019-10-30 12:22:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1733166, 1736341, 1736342, 1736345
Bug Blocks:	1696809, 1730654

Description Soumya Koduri 2019-07-26 11:29:43 UTC

+++ This bug was initially created as a clone of Bug #1733166 +++

Description of problem:

While running parallel I/Os involving many files on nfs-ganesha mount, have hit below deadlock in the nfs-ganesha process.


epoll thread: ....glfs_cbk_upcall_data->upcall_syncop_args_init->glfs_h_poll_cache_invalidation->glfs_h_find_handle->priv_glfs_active_subvol->glfs_lock (waiting on lock)

I/O thread:

...glfs_h_stat->glfs_resolve_inode->__glfs_resolve_inode (at this point we acquired glfs_lock) -> ...->glfs_refresh_inode_safe->syncop_lookup

To summarize-
I/O thread which acquired glfs_lock are waiting for epoll threads to receive response where as epoll threads are waiting for I/O threads to release lock. 

Similar issue was identified earlier (bug1693575).

There could be other issues at different layers depending on how client xlators choose to process these callbacks.

The correct way of avoiding or fixing these issues is to re-design upcall model which is to use different sockets for callback communication  instead of using same epoll threads. Raised github issue for that - https://github.com/gluster/glusterfs/issues/697 

Since it may take a while, raising this BZ to provide a workaround fix in gfapi layer for now 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2019-07-25 10:09:58 UTC ---

REVIEW: https://review.gluster.org/23107 (gfapi: Fix deadlock while processing upcall) posted (#1) for review on release-6 by soumya k

--- Additional comment from Worker Ant on 2019-07-25 10:16:57 UTC ---

REVIEW: https://review.gluster.org/23108 (gfapi: Fix deadlock while processing upcall) posted (#1) for review on master by soumya k

Comment 5 Manisha Saini 2019-10-17 18:40:58 UTC

Verified this BZ with

]# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-9.el7rhgs.x86_64
glusterfs-ganesha-6.0-17.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-9.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.7.3-9.el7rhgs.x86_64

Moving this BZ to verified state.

Comment 7 errata-xmlrpc 2019-10-30 12:22:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249