1658132 – [Ganesha] Ganesha failed on one node while exporting volumes in loop

Bug 1658132 - [Ganesha] Ganesha failed on one node while exporting volumes in loop

Summary: [Ganesha] Ganesha failed on one node while exporting volumes in loop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	libgfapi
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.5.z Batch Update 3
Assignee:	Soumya Koduri
QA Contact:	Upasana
Docs Contact:
URL:
Whiteboard:
Depends On:	1660577 1663131 1663132
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-11 10:35 UTC by Jilju Joy
Modified:	2020-12-17 04:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	glusterfs-6.0-38
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1660577 (view as bug list)
Environment:
Last Closed:	2020-12-17 04:50:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5603	0	None	None	None	2020-12-17 04:50:47 UTC

Description Jilju Joy 2018-12-11 10:35:34 UTC

Description of problem:
-----------------------
ganesha entered failed state in one node of the four node cluster while exporting volumes in loop. Tried to export 109 volumes one after the other in loop.

===============================================================================
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64

===============================================================================
How reproducible:
-----------------
1/1

===============================================================================
Steps to Reproduce:
-------------------
1. Create 4 node ganesha cluster.
2. Create and start 100 or more volumes.
3. Verify status of all volumes.
4. Export volumes one after the other in a loop.

===============================================================================
Actual results:
---------------
Ganesha entered failed state in one of the nodes.

===============================================================================
Expected results:
-----------------
No failure should be observed.

==============================================================================
Additional info:
----------------
* All volumes were exported on other 3 nodes in the 4 node cluster.
* The failure observed is on a different node than the one from where export operation was executed.

Setup is kept  in same state and can be shared if required.

Comment 2 Jilju Joy 2018-12-11 10:37:00 UTC

Logs and sos report will be shared shortly.

Comment 4 Soumya Koduri 2018-12-11 16:26:13 UTC

(gdb) bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
#2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at rpc-clnt.c:778
#3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at rpc-clnt.c:971
#4  0x00007f5c8e9e4a73 in rpc_transport_notify (this=this@entry=0x7f5a08cc5930, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
#5  0x00007f5c849e5576 in socket_event_poll_in (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at socket.c:2322
#6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1, data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
#8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
#9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at pthread_create.c:307
#10 0x00007f5d2050fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at glfs-mgmt.c:625
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
(gdb) l
620	
621		ret = 0;
622		size = rsp.op_ret;
623	
624		if ((size == fs->oldvollen) &&
625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
627	                        API_MSG_VOLFILE_INFO,
628				"No change in volfile, continuing");
629			goto out;
(gdb) p fs->olvollen
There is no member named olvollen.
(gdb) p fs->oldvollen
$1 = 1674
(gdb) p size
$2 = 1674
(gdb) p fs->oldvolfile
$3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) p rsp.spec
$4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type protocol/client\n    option send-gids true\n    option transport.socket.keepalive-count 9\n    option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) 


The crash happened while doing memcmp of fs->oldvolfile and the new volfile received in the response (rsp.spec). The contents of both the variables seem fine in the core. 

From code reading observed that we update fs->oldvollen and fs->oldvolfile under fs->mutex lock, but that lock is not taken while reading those values here in glfs_mgmt_spec_cbk. That could have resulted in the crash while accessing un/partially intialized variable.

@Jilju,

Are you able to consistently reproduce this issue?

Comment 5 Daniel Gryniewicz 2018-12-11 16:33:41 UTC

Are the buffers smaller than 1674?  It might be going off the end of one of the buffers.

Comment 6 Jilju Joy 2018-12-12 04:50:00 UTC

(In reply to Soumya Koduri from comment #4)
> (gdb) bt
> #0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> #2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
> (clnt=clnt@entry=0x7f5a08cc5760, pollin=pollin@entry=0x7f5b7f09acb0) at
> rpc-clnt.c:778
> #3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
> rpc-clnt.c:971
> #4  0x00007f5c8e9e4a73 in rpc_transport_notify
> (this=this@entry=0x7f5a08cc5930,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7f5b7f09acb0) at rpc-transport.c:538
> #5  0x00007f5c849e5576 in socket_event_poll_in
> (this=this@entry=0x7f5a08cc5930, notify_handled=<optimized out>) at
> socket.c:2322
> #6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
> data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
> #7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler
> (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
> #8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
> #9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
> pthread_create.c:307
> #10 0x00007f5d2050fead in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) f 1
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> (gdb) l
> 620	
> 621		ret = 0;
> 622		size = rsp.op_ret;
> 623	
> 624		if ((size == fs->oldvollen) &&
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> 626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
> 627	                        API_MSG_VOLFILE_INFO,
> 628				"No change in volfile, continuing");
> 629			goto out;
> (gdb) p fs->olvollen
> There is no member named olvollen.
> (gdb) p fs->oldvollen
> $1 = 1674
> (gdb) p size
> $2 = 1674
> (gdb) p fs->oldvolfile
> $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) p rsp.spec
> $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) 
> 
> 
> The crash happened while doing memcmp of fs->oldvolfile and the new volfile
> received in the response (rsp.spec). The contents of both the variables seem
> fine in the core. 
> 
> From code reading observed that we update fs->oldvollen and fs->oldvolfile
> under fs->mutex lock, but that lock is not taken while reading those values
> here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
> accessing un/partially intialized variable.
> 
> @Jilju,
> 
> Are you able to consistently reproduce this issue?

Hi Soumya,

The first occurrence is reported here. Kept the setup in same state for the favour of debugging. I can share the setup if required or I can try to reproduce.

Comment 8 Soumya Koduri 2019-05-06 12:46:02 UTC

The fix is available in the current rhgs-3.5 gluster rebase. Kindle validate and update.

Comment 22 errata-xmlrpc 2020-12-17 04:50:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603

Note You need to log in before you can comment on or make changes to this bug.