Bug 1658132

Summary:	[Ganesha] Ganesha failed on one node while exporting volumes in loop
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Jilju Joy <jijoy>
Component:	libgfapi	Assignee:	Soumya Koduri <skoduri>
Status:	CLOSED ERRATA	QA Contact:	Upasana <ubansal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	dang, ffilz, grajoria, jthottan, kkeithle, mbenjamin, nchilaka, pasik, pprakash, puebele, rhs-bugs, rkothiya, sheggodu, skoduri, storage-qa-internal, ubansal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.5.z Batch Update 3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.0-38	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1660577 (view as bug list)		Environment:
Last Closed:	2020-12-17 04:50:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1660577, 1663131, 1663132
Bug Blocks:

Description Jilju Joy 2018-12-11 10:35:34 UTC

Description of problem:
-----------------------
ganesha entered failed state in one node of the four node cluster while exporting volumes in loop. Tried to export 109 volumes one after the other in loop.

===============================================================================
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64

===============================================================================
How reproducible:
-----------------
1/1

===============================================================================
Steps to Reproduce:
-------------------
1. Create 4 node ganesha cluster.
2. Create and start 100 or more volumes.
3. Verify status of all volumes.
4. Export volumes one after the other in a loop.

===============================================================================
Actual results:
---------------
Ganesha entered failed state in one of the nodes.

===============================================================================
Expected results:
-----------------
No failure should be observed.

==============================================================================
Additional info:
----------------
* All volumes were exported on other 3 nodes in the 4 node cluster.
* The failure observed is on a different node than the one from where export operation was executed.

Setup is kept  in same state and can be shared if required.

Comment 2 Jilju Joy 2018-12-11 10:37:00 UTC

Logs and sos report will be shared shortly.

Comment 4 Soumya Koduri 2018-12-11 16:26:13 UTC

(gdb) bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
#2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at rpc-clnt.c:778
#3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at rpc-clnt.c:971
#4  0x00007f5c8e9e4a73 in rpc_transport_notify (this=this@entry=0x7f5a08cc5930, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
#5  0x00007f5c849e5576 in socket_event_poll_in (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at socket.c:2322
#6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
#8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
#9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at pthread_create.c:307
#10 0x00007f5d2050fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
(gdb) l
620	
621		ret = 0;
622		size = rsp.op_ret;
623	
624		if ((size == fs->oldvollen) &&
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
627	                        API_MSG_VOLFILE_INFO,
628				"No change in volfile, continuing");
629			goto out;
(gdb) p fs->olvollen
There is no member named olvollen.
(gdb) p fs->oldvollen
$1 = 1674
(gdb) p size
$2 = 1674
(gdb) p fs->oldvolfile
$3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) p rsp.spec
$4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) 


The crash happened while doing memcmp of fs->oldvolfile and the new volfile received in the response (rsp.spec). The contents of both the variables seem fine in the core. 

From code reading observed that we update fs->oldvollen and fs->oldvolfile under fs->mutex lock, but that lock is not taken while reading those values here in glfs_mgmt_spec_cbk. That could have resulted in the crash while accessing un/partially intialized variable.

@Jilju,

Are you able to consistently reproduce this issue?

Comment 5 Daniel Gryniewicz 2018-12-11 16:33:41 UTC

Are the buffers smaller than 1674?  It might be going off the end of one of the buffers.

Comment 6 Jilju Joy 2018-12-12 04:50:00 UTC

(In reply to Soumya Koduri from comment #4)
> (gdb) bt
> #0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> #2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
> (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at
> rpc-clnt.c:778
> #3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
> rpc-clnt.c:971
> #4  0x00007f5c8e9e4a73 in rpc_transport_notify
> (this=this@entry=0x7f5a08cc5930,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
> #5  0x00007f5c849e5576 in socket_event_poll_in
> (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at
> socket.c:2322
> #6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
> data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
> #7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler
> (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
> #8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
> #9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
> pthread_create.c:307
> #10 0x00007f5d2050fead in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) f 1
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> (gdb) l
> 620	
> 621		ret = 0;
> 622		size = rsp.op_ret;
> 623	
> 624		if ((size == fs->oldvollen) &&
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> 626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
> 627	                        API_MSG_VOLFILE_INFO,
> 628				"No change in volfile, continuing");
> 629			goto out;
> (gdb) p fs->olvollen
> There is no member named olvollen.
> (gdb) p fs->oldvollen
> $1 = 1674
> (gdb) p size
> $2 = 1674
> (gdb) p fs->oldvolfile
> $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) p rsp.spec
> $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) 
> 
> 
> The crash happened while doing memcmp of fs->oldvolfile and the new volfile
> received in the response (rsp.spec). The contents of both the variables seem
> fine in the core. 
> 
> From code reading observed that we update fs->oldvollen and fs->oldvolfile
> under fs->mutex lock, but that lock is not taken while reading those values
> here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
> accessing un/partially intialized variable.
> 
> @Jilju,
> 
> Are you able to consistently reproduce this issue?

Hi Soumya,

The first occurrence is reported here. Kept the setup in same state for the favour of debugging. I can share the setup if required or I can try to reproduce.

Comment 8 Soumya Koduri 2019-05-06 12:46:02 UTC

The fix is available in the current rhgs-3.5 gluster rebase. Kindle validate and update.

Comment 22 errata-xmlrpc 2020-12-17 04:50:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603