1633177 – gluster-NFS crash while expanding volume

Bug 1633177 - gluster-NFS crash while expanding volume

Summary: gluster-NFS crash while expanding volume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 3
Assignee:	Jiffin
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-26 11:02 UTC by Vijay Avuthu
Modified:	2019-12-31 07:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:	glusterfs-3.12.2-33
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1651439 (view as bug list)
Environment:
Last Closed:	2019-02-04 07:41:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	21685	0	None	None	None	2018-11-22 08:31:22 UTC
Red Hat Product Errata	RHBA-2019:0263	0	None	None	None	2019-02-04 07:41:38 UTC

Description Vijay Avuthu 2018-09-26 11:02:07 UTC

Description of problem:

gluster-NFS is crashed while expanding volume

Version-Release number of selected component (if applicable):

glusterfs-3.12.2-18.1.el7rhgs.x86_64

How reproducible: 


Steps to Reproduce:

While running automation runs, gluster-NFS is crashed while expanding volume

1) create distribute volume ( 1 * 4 )
2) write IO from 2 clients
3) Add bricks while IO is in progress
4) start re-balance
5) check for IO 

After step 5), mount point is hung due to gluster-NFS crash.

Actual results:

gluster-NFS crash and IO is hung

Expected results:

IO should be success

Additional info:

volume info:

[root@rhsauto023 glusterfs]# gluster vol info
 
Volume Name: testvol_distributed
Type: Distribute
Volume ID: a809a120-f582-4358-8a70-5c53f71734ee
Status: Started
Snapshot Count: 0
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick0
Brick2: rhsauto030.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick1
Brick3: rhsauto031.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick2
Brick4: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick3
Brick5: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_distributed_brick4
Options Reconfigured:
transport.address-family: inet
nfs.disable: off
[root@rhsauto023 glusterfs]# 


> volume status

[root@rhsauto023 glusterfs]# gluster vol status
Status of volume: testvol_distributed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick0      49153     0          Y       22557
Brick rhsauto030.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick1      49153     0          Y       21814
Brick rhsauto031.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick2      49153     0          Y       20441
Brick rhsauto027.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick3      49152     0          Y       19886
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_distributed_brick4      49152     0          Y       23019
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on rhsauto027.lab.eng.blr.redhat
.com                                        2049      0          Y       20008
NFS Server on rhsauto033.lab.eng.blr.redhat
.com                                        2049      0          Y       19752
NFS Server on rhsauto030.lab.eng.blr.redhat
.com                                        2049      0          Y       21936
NFS Server on rhsauto031.lab.eng.blr.redhat
.com                                        2049      0          Y       20557
NFS Server on rhsauto040.lab.eng.blr.redhat
.com                                        2049      0          Y       20047
 
Task Status of Volume testvol_distributed
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 8e5b404f-5740-4d87-a0d7-3ce94178329f
Status               : completed           
 
[root@rhsauto023 glusterfs]#

> NFS crash

[2018-09-25 13:58:35.381085] I [dict.c:471:dict_get] (-->/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x22f5d) [0x7f93543fdf5d] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distri
bute.so(+0x202e7) [0x7f93541572e7] -->/lib64/libglusterfs.so.0(dict_get+0x10c) [0x7f9361aefb3c] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-09-25 13:58:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9361af8cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9361b02c04]
/lib64/libc.so.6(+0x36280)[0x7f9360158280]
/lib64/libglusterfs.so.0(+0x3b6fa)[0x7f9361b086fa]
/lib64/libglusterfs.so.0(inode_parent+0x52)[0x7f9361b09822]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0xc243)[0x7f934f95c243]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3e1d8)[0x7f934f98e1d8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ea2b)[0x7f934f98ea2b]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ead5)[0x7f934f98ead5]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ecf8)[0x7f934f98ecf8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x29d7c)[0x7f934f979d7c]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x2a184)[0x7f934f97a184]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f93618ba955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f93618bab3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f93618bca73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f93566e2566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f93566e4b0c]
/lib64/libglusterfs.so.0(+0x894c4)[0x7f9361b564c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f9360957dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f9360220b3d]
---------

Comment 6 Atin Mukherjee 2018-11-20 13:11:22 UTC

If this is fairly reproducible and we find this use case to be important why we're not marking it as blocker for 3.4.2 so that this can come to the triage queue of blocker/exception proposed bugs? What's blocking us here?

(Of course, not all BZs should be marked as blocker through automation test, but this seems important?)

Jiffin - have we got a chance to look at the automation test which leads to this crash? Have we tried the same in our local setup?

Comment 8 Rahul Hinduja 2018-11-22 08:01:59 UTC

(In reply to Atin Mukherjee from comment #6)
> If this is fairly reproducible and we find this use case to be important why
> we're not marking it as blocker for 3.4.2 so that this can come to the
> triage queue of blocker/exception proposed bugs? What's blocking us here?
> 
I was waiting for the discussion to close properly on flags and keywords on the program mailing list and hence approached via need_info. This wasnt a inflight bug. 

However, based on the discussion. Moving forward with below explanation. 

During the automation run on nfs client, we are either seeing 

1. client hung Bug 1648783 Or
2. NFS crash Bug 1633177

From the automation runs perspective, I consider these as AutomationBlocker for NFS protocols (Different usecases) and hence setting the appropriate keyword and flag for traction and decision.

Comment 19 Jilju Joy 2019-01-06 20:00:17 UTC

Checked manually and through automation. No crash was observed even after executing the testcase multiple times.
Observed hang(bz 1648783) a couple of times.
Volume types used: Distribute, Distributed-Replicate, Distributed-Replicate(arbiter)

Verified in version: glusterfs-3.12.2-36.el7rhgs.x86_64

Comment 21 errata-xmlrpc 2019-02-04 07:41:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263

Note You need to log in before you can comment on or make changes to this bug.