1663131 – [Ganesha] Ganesha failed on one node while exporting volumes in loop

Bug 1663131 - [Ganesha] Ganesha failed on one node while exporting volumes in loop

Summary: [Ganesha] Ganesha failed on one node while exporting volumes in loop

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libgfapi
Sub Component:
Version:	5
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:	bugs@gluster.org
Docs Contact:
URL:
Whiteboard:
Depends On:	1660577 1663132
Blocks:	1658132
TreeView+	depends on / blocked

Reported:	2019-01-03 09:29 UTC by Soumya Koduri
Modified:	2019-01-22 14:08 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-5.3
Clone Of:	1660577
Environment:
Last Closed:	2019-01-22 14:08:56 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	21985	0	None	Merged	gfapi: Access fs->oldvolfile under mutex lock	2019-01-09 15:17:03 UTC

Description Soumya Koduri 2019-01-03 09:29:00 UTC

+++ This bug was initially created as a clone of Bug #1660577 +++

+++ This bug was initially created as a clone of Bug #1658132 +++

Description of problem:
-----------------------
ganesha entered failed state in one node of the four node cluster while exporting volumes in loop. Tried to export 109 volumes one after the other in loop.

===============================================================================
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64

===============================================================================
How reproducible:
-----------------
1/1

===============================================================================
Steps to Reproduce:
-------------------
1. Create 4 node ganesha cluster.
2. Create and start 100 or more volumes.
3. Verify status of all volumes.
4. Export volumes one after the other in a loop.

===============================================================================
Actual results:
---------------
Ganesha entered failed state in one of the nodes.

===============================================================================
Expected results:
-----------------
No failure should be observed.

==============================================================================
Additional info:
----------------
* All volumes were exported on other 3 nodes in the 4 node cluster.
* The failure observed is on a different node than the one from where export operation was executed.

Setup is kept  in same state and can be shared if required.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-12-11 10:35:37 UTC ---

This bug is automatically being proposed for a Z-stream release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Jilju Joy on 2018-12-11 10:37:00 UTC ---

Logs and sos report will be shared shortly.

--- Additional comment from Jilju Joy on 2018-12-11 11:59:20 UTC ---

Logs and sosreport : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/jj/1658132/

--- Additional comment from Soumya Koduri on 2018-12-11 16:26:13 UTC ---

(gdb) bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
#2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at rpc-clnt.c:778
#3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at rpc-clnt.c:971
#4  0x00007f5c8e9e4a73 in rpc_transport_notify (this=this@entry=0x7f5a08cc5930, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
#5  0x00007f5c849e5576 in socket_event_poll_in (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at socket.c:2322
#6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
#8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
#9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at pthread_create.c:307
#10 0x00007f5d2050fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
(gdb) l
620	
621		ret = 0;
622		size = rsp.op_ret;
623	
624		if ((size == fs->oldvollen) &&
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
627	                        API_MSG_VOLFILE_INFO,
628				"No change in volfile, continuing");
629			goto out;
(gdb) p fs->olvollen
There is no member named olvollen.
(gdb) p fs->oldvollen
$1 = 1674
(gdb) p size
$2 = 1674
(gdb) p fs->oldvolfile
$3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) p rsp.spec
$4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) 


The crash happened while doing memcmp of fs->oldvolfile and the new volfile received in the response (rsp.spec). The contents of both the variables seem fine in the core. 

From code reading observed that we update fs->oldvollen and fs->oldvolfile under fs->mutex lock, but that lock is not taken while reading those values here in glfs_mgmt_spec_cbk. That could have resulted in the crash while accessing un/partially intialized variable.

@Jilju,

Are you able to consistently reproduce this issue?

--- Additional comment from Daniel Gryniewicz on 2018-12-11 16:33:41 UTC ---

Are the buffers smaller than 1674?  It might be going off the end of one of the buffers.

--- Additional comment from Jilju Joy on 2018-12-12 04:50:00 UTC ---

(In reply to Soumya Koduri from comment #4)
> (gdb) bt
> #0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> #2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
> (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at
> rpc-clnt.c:778
> #3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
> rpc-clnt.c:971
> #4  0x00007f5c8e9e4a73 in rpc_transport_notify
> (this=this@entry=0x7f5a08cc5930,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
> #5  0x00007f5c849e5576 in socket_event_poll_in
> (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at
> socket.c:2322
> #6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
> data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
> #7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler
> (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
> #8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
> #9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
> pthread_create.c:307
> #10 0x00007f5d2050fead in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) f 1
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> (gdb) l
> 620	
> 621		ret = 0;
> 622		size = rsp.op_ret;
> 623	
> 624		if ((size == fs->oldvollen) &&
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> 626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
> 627	                        API_MSG_VOLFILE_INFO,
> 628				"No change in volfile, continuing");
> 629			goto out;
> (gdb) p fs->olvollen
> There is no member named olvollen.
> (gdb) p fs->oldvollen
> $1 = 1674
> (gdb) p size
> $2 = 1674
> (gdb) p fs->oldvolfile
> $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) p rsp.spec
> $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) 
> 
> 
> The crash happened while doing memcmp of fs->oldvolfile and the new volfile
> received in the response (rsp.spec). The contents of both the variables seem
> fine in the core. 
> 
> From code reading observed that we update fs->oldvollen and fs->oldvolfile
> under fs->mutex lock, but that lock is not taken while reading those values
> here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
> accessing un/partially intialized variable.
> 
> @Jilju,
> 
> Are you able to consistently reproduce this issue?

Hi Soumya,

The first occurrence is reported here. Kept the setup in same state for the favour of debugging. I can share the setup if required or I can try to reproduce.

--- Additional comment from Worker Ant on 2018-12-18 17:05:42 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-26 02:17:03 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under mutex lock) posted (#2) for review on master by Amar Tumballi

--- Additional comment from Worker Ant on 2018-12-26 10:33:07 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to releasing fs->mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-31 16:10:41 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to releasing fs->mutex lock) posted (#2) for review on master by Kaleb KEITHLEY

Comment 1 Worker Ant 2019-01-03 09:31:14 UTC

REVIEW: https://review.gluster.org/21985 (gfapi: Access fs->oldvolfile under mutex lock) posted (#3) for review on release-5 by soumya k

Comment 2 Worker Ant 2019-01-09 15:17:02 UTC

REVIEW: https://review.gluster.org/21985 (gfapi: Access fs->oldvolfile under mutex lock) posted (#4) for review on release-5 by Shyamsundar Ranganathan

Comment 3 Shyamsundar 2019-01-22 14:08:56 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.3, please open a new bug report.

glusterfs-5.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-January/000118.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.