Bug 1651439 - gluster-NFS crash while expanding volume
Summary: gluster-NFS crash while expanding volume
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: nfs
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Jiffin
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1679275
TreeView+ depends on / blocked
 
Reported: 2018-11-20 05:56 UTC by Jiffin
Modified: 2019-03-25 16:32 UTC (History)
13 users (show)

Fixed In Version: glusterfs-6.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1633177
: 1679275 (view as bug list)
Environment:
Last Closed: 2019-03-25 16:32:04 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Gluster.org Gerrit 21685 None Open inode : prevent dentry creation if parent does not have ctx 2018-11-20 06:00:24 UTC
Gluster.org Gerrit 21749 None Merged nfs : set ctx for every inode looked up nfs3_fh_resolve_inode_lookup_cbk() 2018-12-03 05:50:46 UTC
Gluster.org Gerrit 21998 None Open dht: fix double extra unref of inode at heal path 2019-02-13 18:22:32 UTC

Description Jiffin 2018-11-20 05:56:00 UTC
+++ This bug was initially created as a clone of Bug #1633177 +++

Description of problem:

gluster-NFS is crashed while expanding volume

Version-Release number of selected component (if applicable):

glusterfs-3.12.2-18.1.el7rhgs.x86_64

How reproducible: 


Steps to Reproduce:

While running automation runs, gluster-NFS is crashed while expanding volume

1) create distribute volume ( 1 * 4 )
2) write IO from 2 clients
3) Add bricks while IO is in progress
4) start re-balance
5) check for IO 

After step 5), mount point is hung due to gluster-NFS crash.

Actual results:

gluster-NFS crash and IO is hung

Expected results:

IO should be success

Additional info:

volume info:

[root@rhsauto023 glusterfs]# gluster vol info
 
Volume Name: testvol_distributed
Type: Distribute
Volume ID: a809a120-f582-4358-8a70-5c53f71734ee
Status: Started
Snapshot Count: 0
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick0
Brick2: rhsauto030.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick1
Brick3: rhsauto031.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick2
Brick4: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick3
Brick5: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_distributed_brick4
Options Reconfigured:
transport.address-family: inet
nfs.disable: off
[root@rhsauto023 glusterfs]# 


> volume status

[root@rhsauto023 glusterfs]# gluster vol status
Status of volume: testvol_distributed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick0      49153     0          Y       22557
Brick rhsauto030.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick1      49153     0          Y       21814
Brick rhsauto031.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick2      49153     0          Y       20441
Brick rhsauto027.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick3      49152     0          Y       19886
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_distributed_brick4      49152     0          Y       23019
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on rhsauto027.lab.eng.blr.redhat
.com                                        2049      0          Y       20008
NFS Server on rhsauto033.lab.eng.blr.redhat
.com                                        2049      0          Y       19752
NFS Server on rhsauto030.lab.eng.blr.redhat
.com                                        2049      0          Y       21936
NFS Server on rhsauto031.lab.eng.blr.redhat
.com                                        2049      0          Y       20557
NFS Server on rhsauto040.lab.eng.blr.redhat
.com                                        2049      0          Y       20047
 
Task Status of Volume testvol_distributed
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 8e5b404f-5740-4d87-a0d7-3ce94178329f
Status               : completed           
 
[root@rhsauto023 glusterfs]#

> NFS crash

[2018-09-25 13:58:35.381085] I [dict.c:471:dict_get] (-->/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x22f5d) [0x7f93543fdf5d] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distri
bute.so(+0x202e7) [0x7f93541572e7] -->/lib64/libglusterfs.so.0(dict_get+0x10c) [0x7f9361aefb3c] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-09-25 13:58:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9361af8cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9361b02c04]
/lib64/libc.so.6(+0x36280)[0x7f9360158280]
/lib64/libglusterfs.so.0(+0x3b6fa)[0x7f9361b086fa]
/lib64/libglusterfs.so.0(inode_parent+0x52)[0x7f9361b09822]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0xc243)[0x7f934f95c243]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3e1d8)[0x7f934f98e1d8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ea2b)[0x7f934f98ea2b]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ead5)[0x7f934f98ead5]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ecf8)[0x7f934f98ecf8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x29d7c)[0x7f934f979d7c]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x2a184)[0x7f934f97a184]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f93618ba955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f93618bab3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f93618bca73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f93566e2566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f93566e4b0c]
/lib64/libglusterfs.so.0(+0x894c4)[0x7f9361b564c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f9360957dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f9360220b3d]
---------

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-26 07:02:14 EDT ---

This bug is automatically being proposed for a Z-stream release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgsā€‘3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Vijay Avuthu on 2018-09-26 07:03:44 EDT ---

SOS reports: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_crash_on_expanding_volume/

jenkin Job: http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/28/consoleFull

Glusto Logs : http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/ws/glusto_28.log

--- Additional comment from Jiffin on 2018-09-27 08:07:28 EDT ---

0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30, inode=0x7f933c0133b0) at inode.c:455
455	        if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30, inode=0x7f933c0133b0) at inode.c:455
#1  __inode_ref (inode=inode@entry=0x7f933c0133b0) at inode.c:537
#2  0x00007f9361b09822 in inode_parent (inode=inode@entry=0x7f933c01d990, pargfid=pargfid@entry=0x7f93400aa2e8 "", name=name@entry=0x0) at inode.c:1359
#3  0x00007f934f95c243 in nfs_inode_loc_fill (inode=inode@entry=0x7f933c01d990, loc=loc@entry=0x7f93400aa2b8, how=how@entry=1) at nfs-common.c:206
#4  0x00007f934f98e1d8 in nfs3_fh_resolve_inode_done (cs=cs@entry=0x7f93400a9df0, inode=inode@entry=0x7f933c01d990) at nfs3-helpers.c:3611
#5  0x00007f934f98ea2b in nfs3_fh_resolve_inode (cs=0x7f93400a9df0) at nfs3-helpers.c:3828
#6  0x00007f934f98ead5 in nfs3_fh_resolve_resume (cs=cs@entry=0x7f93400a9df0) at nfs3-helpers.c:3860
#7  0x00007f934f98ecf8 in nfs3_fh_resolve_root (cs=cs@entry=0x7f93400a9df0) at nfs3-helpers.c:3915
#8  0x00007f934f98ef41 in nfs3_fh_resolve_and_resume (cs=cs@entry=0x7f93400a9df0, fh=fh@entry=0x7f934e195ae0, entry=entry@entry=0x0, resum_fn=resum_fn@entry=0x7f934f9798b0 <nfs3_access_resume>)
    at nfs3-helpers.c:4011
#9  0x00007f934f979d7c in nfs3_access (req=req@entry=0x7f934022dcd0, fh=fh@entry=0x7f934e195ae0, accbits=31) at nfs3.c:1783
#10 0x00007f934f97a184 in nfs3svc_access (req=0x7f934022dcd0) at nfs3.c:1819
#11 0x00007f93618ba955 in rpcsvc_handle_rpc_call (svc=0x7f935002c430, trans=trans@entry=0x7f935007a960, msg=<optimized out>) at rpcsvc.c:695
#12 0x00007f93618bab3b in rpcsvc_notify (trans=0x7f935007a960, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:789
#13 0x00007f93618bca73 in rpc_transport_notify (this=this@entry=0x7f935007a960, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f9340031290) at rpc-transport.c:538
#14 0x00007f93566e2566 in socket_event_poll_in (this=this@entry=0x7f935007a960, notify_handled=<optimized out>) at socket.c:2315
#15 0x00007f93566e4b0c in socket_event_handler (fd=10, idx=7, gen=46, data=0x7f935007a960, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#16 0x00007f9361b564c4 in event_dispatch_epoll_handler (event=0x7f934e195e80, event_pool=0x55c696306210) at event-epoll.c:583
#17 event_dispatch_epoll_worker (data=0x7f9350043b00) at event-epoll.c:659
#18 0x00007f9360957dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f9360220b3d in clone () from /lib64/libc.so.6


Above as part of nfs_local_filling() it was trying to find the parent inode and there is valid inode for parent as well, but context for that inode is NULL.
From code reading  i was not able to find place in which ctx is NULL with valid inode

p *inode -- parent
$27 = {table = 0x7f935002d000, gfid = "{\033g\270K\202B\202\211\320B\"\373u", <incomplete sequence \311>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, nlookup = 0, 
  fd_count = 0, active_fd_count = 0, ref = 1, ia_type = IA_IFDIR, fd_list = {next = 0x7f933c013408, prev = 0x7f933c013408}, dentry_list = {next = 0x7f933c013418, prev = 0x7f933c013418}, hash = {
    next = 0x7f933c013428, prev = 0x7f933c013428}, list = {next = 0x7f93503a5408, prev = 0x7f935002d060}, _ctx = 0x0}

I tried to reproduce the issue(twice) but, it was not hitting in my setup.

Requesting Vijay to recheck how frequently it can be reproduced and please try to run ith debug log level for nfs-server(diagonsis-cient log level)

Comment 1 Worker Ant 2018-11-20 06:00:20 UTC
REVIEW: https://review.gluster.org/21685 (inode : prevent dentry creation if parent does not have ctx) posted (#1) for review on master by jiffin tony Thottan

Comment 2 Worker Ant 2018-11-29 14:03:58 UTC
REVIEW: https://review.gluster.org/21749 (nfs : set ctx for every inode looked up nfs3_fh_resolve_inode_lookup_cbk()) posted (#1) for review on master by jiffin tony Thottan

Comment 3 Worker Ant 2018-12-03 05:50:44 UTC
REVIEW: https://review.gluster.org/21749 (nfs : set ctx for every inode looked up nfs3_fh_resolve_inode_lookup_cbk()) posted (#4) for review on master by Amar Tumballi

Comment 4 Worker Ant 2019-01-08 08:49:15 UTC
REVIEW: https://review.gluster.org/21998 (dht: fix inode leak when heal path) posted (#1) for review on master by Kinglong Mee

Comment 5 Worker Ant 2019-02-13 18:22:33 UTC
REVIEW: https://review.gluster.org/21998 (dht: fix double extra unref of inode at heal path) merged (#4) on master by Raghavendra G

Comment 6 Shyamsundar 2019-03-25 16:32:04 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.