1679275 – dht: fix double extra unref of inode at heal path

Bug 1679275 - dht: fix double extra unref of inode at heal path

Summary: dht: fix double extra unref of inode at heal path

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	nfs
Sub Component:
Version:	6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Susant Kumar Palai
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1651439
Blocks:	glusterfs-6.0 1732875
TreeView+	depends on / blocked

Reported:	2019-02-20 19:02 UTC by Sunil Kumar Acharya
Modified:	2019-07-24 15:04 UTC (History)
CC List:	16 users (show)
Fixed In Version:	glusterfs-6.0
Clone Of:	1651439
Environment:
Last Closed:	2019-03-25 16:33:20 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	22244	0	None	Open	dht: fix double extra unref of inode at heal path	2019-02-22 03:34:32 UTC

Description Sunil Kumar Acharya 2019-02-20 19:02:17 UTC

+++ This bug was initially created as a clone of Bug #1651439 +++

+++ This bug was initially created as a clone of Bug #1633177 +++

Description of problem:

gluster-NFS is crashed while expanding volume

Version-Release number of selected component (if applicable):

glusterfs-3.12.2-18.1.el7rhgs.x86_64

How reproducible: 


Steps to Reproduce:

While running automation runs, gluster-NFS is crashed while expanding volume

1) create distribute volume ( 1 * 4 )
2) write IO from 2 clients
3) Add bricks while IO is in progress
4) start re-balance
5) check for IO 

After step 5), mount point is hung due to gluster-NFS crash.

Actual results:

gluster-NFS crash and IO is hung

Expected results:

IO should be success

Additional info:

volume info:

[root@rhsauto023 glusterfs]# gluster vol info
 
Volume Name: testvol_distributed
Type: Distribute
Volume ID: a809a120-f582-4358-8a70-5c53f71734ee
Status: Started
Snapshot Count: 0
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick0
Brick2: rhsauto030.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick1
Brick3: rhsauto031.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick2
Brick4: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick3
Brick5: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_distributed_brick4
Options Reconfigured:
transport.address-family: inet
nfs.disable: off
[root@rhsauto023 glusterfs]# 


> volume status

[root@rhsauto023 glusterfs]# gluster vol status
Status of volume: testvol_distributed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick0      49153     0          Y       22557
Brick rhsauto030.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick1      49153     0          Y       21814
Brick rhsauto031.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick2      49153     0          Y       20441
Brick rhsauto027.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick3      49152     0          Y       19886
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_distributed_brick4      49152     0          Y       23019
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on rhsauto027.lab.eng.blr.redhat
.com                                        2049      0          Y       20008
NFS Server on rhsauto033.lab.eng.blr.redhat
.com                                        2049      0          Y       19752
NFS Server on rhsauto030.lab.eng.blr.redhat
.com                                        2049      0          Y       21936
NFS Server on rhsauto031.lab.eng.blr.redhat
.com                                        2049      0          Y       20557
NFS Server on rhsauto040.lab.eng.blr.redhat
.com                                        2049      0          Y       20047
 
Task Status of Volume testvol_distributed
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 8e5b404f-5740-4d87-a0d7-3ce94178329f
Status               : completed           
 
[root@rhsauto023 glusterfs]#

> NFS crash

[2018-09-25 13:58:35.381085] I [dict.c:471:dict_get] (-->/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x22f5d) [0x7f93543fdf5d] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distri
bute.so(+0x202e7) [0x7f93541572e7] -->/lib64/libglusterfs.so.0(dict_get+0x10c) [0x7f9361aefb3c] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-09-25 13:58:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9361af8cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9361b02c04]
/lib64/libc.so.6(+0x36280)[0x7f9360158280]
/lib64/libglusterfs.so.0(+0x3b6fa)[0x7f9361b086fa]
/lib64/libglusterfs.so.0(inode_parent+0x52)[0x7f9361b09822]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0xc243)[0x7f934f95c243]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3e1d8)[0x7f934f98e1d8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ea2b)[0x7f934f98ea2b]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ead5)[0x7f934f98ead5]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ecf8)[0x7f934f98ecf8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x29d7c)[0x7f934f979d7c]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x2a184)[0x7f934f97a184]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f93618ba955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f93618bab3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f93618bca73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f93566e2566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f93566e4b0c]
/lib64/libglusterfs.so.0(+0x894c4)[0x7f9361b564c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f9360957dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f9360220b3d]
---------

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-26 07:02:14 EDT ---

This bug is automatically being proposed for a Z-stream release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Vijay Avuthu on 2018-09-26 07:03:44 EDT ---

SOS reports: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_crash_on_expanding_volume/

jenkin Job: http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/28/consoleFull

Glusto Logs : http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/ws/glusto_28.log

--- Additional comment from Jiffin on 2018-09-27 08:07:28 EDT ---

0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30, inode=0x7f933c0133b0) at inode.c:455
455	        if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30, inode=0x7f933c0133b0) at inode.c:455
#1  __inode_ref (inode=inode@entry=0x7f933c0133b0) at inode.c:537
#2  0x00007f9361b09822 in inode_parent (inode=inode@entry=0x7f933c01d990, pargfid=pargfid@entry=0x7f93400aa2e8 "", name=name@entry=0x0) at inode.c:1359
#3  0x00007f934f95c243 in nfs_inode_loc_fill (inode=inode@entry=0x7f933c01d990, loc=loc@entry=0x7f93400aa2b8, how=how@entry=1) at nfs-common.c:206
#4  0x00007f934f98e1d8 in nfs3_fh_resolve_inode_done (cs=cs@entry=0x7f93400a9df0, inode=inode@entry=0x7f933c01d990) at nfs3-helpers.c:3611
#5  0x00007f934f98ea2b in nfs3_fh_resolve_inode (cs=0x7f93400a9df0) at nfs3-helpers.c:3828
#6  0x00007f934f98ead5 in nfs3_fh_resolve_resume (cs=cs@entry=0x7f93400a9df0) at nfs3-helpers.c:3860
#7  0x00007f934f98ecf8 in nfs3_fh_resolve_root (cs=cs@entry=0x7f93400a9df0) at nfs3-helpers.c:3915
#8  0x00007f934f98ef41 in nfs3_fh_resolve_and_resume (cs=cs@entry=0x7f93400a9df0, fh=fh@entry=0x7f934e195ae0, entry=entry@entry=0x0, resum_fn=resum_fn@entry=0x7f934f9798b0 <nfs3_access_resume>)
    at nfs3-helpers.c:4011
#9  0x00007f934f979d7c in nfs3_access (req=req@entry=0x7f934022dcd0, fh=fh@entry=0x7f934e195ae0, accbits=31) at nfs3.c:1783
#10 0x00007f934f97a184 in nfs3svc_access (req=0x7f934022dcd0) at nfs3.c:1819
#11 0x00007f93618ba955 in rpcsvc_handle_rpc_call (svc=0x7f935002c430, trans=trans@entry=0x7f935007a960, msg=<optimized out>) at rpcsvc.c:695
#12 0x00007f93618bab3b in rpcsvc_notify (trans=0x7f935007a960, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:789
#13 0x00007f93618bca73 in rpc_transport_notify (this=this@entry=0x7f935007a960, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f9340031290) at rpc-transport.c:538
#14 0x00007f93566e2566 in socket_event_poll_in (this=this@entry=0x7f935007a960, notify_handled=<optimized out>) at socket.c:2315
#15 0x00007f93566e4b0c in socket_event_handler (fd=10, idx=7, gen=46, data=0x7f935007a960, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#16 0x00007f9361b564c4 in event_dispatch_epoll_handler (event=0x7f934e195e80, event_pool=0x55c696306210) at event-epoll.c:583
#17 event_dispatch_epoll_worker (data=0x7f9350043b00) at event-epoll.c:659
#18 0x00007f9360957dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f9360220b3d in clone () from /lib64/libc.so.6


Above as part of nfs_local_filling() it was trying to find the parent inode and there is valid inode for parent as well, but context for that inode is NULL.
From code reading  i was not able to find place in which ctx is NULL with valid inode

p *inode -- parent
$27 = {table = 0x7f935002d000, gfid = "{\033g\270K\202B\202\211\320B\"\373u", <incomplete sequence \311>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, nlookup = 0, 
  fd_count = 0, active_fd_count = 0, ref = 1, ia_type = IA_IFDIR, fd_list = {next = 0x7f933c013408, prev = 0x7f933c013408}, dentry_list = {next = 0x7f933c013418, prev = 0x7f933c013418}, hash = {
    next = 0x7f933c013428, prev = 0x7f933c013428}, list = {next = 0x7f93503a5408, prev = 0x7f935002d060}, _ctx = 0x0}

I tried to reproduce the issue(twice) but, it was not hitting in my setup.

Requesting Vijay to recheck how frequently it can be reproduced and please try to run ith debug log level for nfs-server(diagonsis-cient log level)

--- Additional comment from Worker Ant on 2018-11-20 06:00:20 UTC ---

REVIEW: https://review.gluster.org/21685 (inode : prevent dentry creation if parent does not have ctx) posted (#1) for review on master by jiffin tony Thottan

--- Additional comment from Worker Ant on 2018-11-29 14:03:58 UTC ---

REVIEW: https://review.gluster.org/21749 (nfs : set ctx for every inode looked up nfs3_fh_resolve_inode_lookup_cbk()) posted (#1) for review on master by jiffin tony Thottan

--- Additional comment from Worker Ant on 2018-12-03 05:50:44 UTC ---

REVIEW: https://review.gluster.org/21749 (nfs : set ctx for every inode looked up nfs3_fh_resolve_inode_lookup_cbk()) posted (#4) for review on master by Amar Tumballi

--- Additional comment from Worker Ant on 2019-01-08 08:49:15 UTC ---

REVIEW: https://review.gluster.org/21998 (dht: fix inode leak when heal path) posted (#1) for review on master by Kinglong Mee

--- Additional comment from Worker Ant on 2019-02-13 18:22:33 UTC ---

REVIEW: https://review.gluster.org/21998 (dht: fix double extra unref of inode at heal path) merged (#4) on master by Raghavendra G

Comment 1 Worker Ant 2019-02-21 05:02:42 UTC

REVIEW: https://review.gluster.org/22244 (dht: fix double extra unref of inode at heal path) posted (#1) for review on release-6 by Susant Palai

Comment 2 Worker Ant 2019-02-22 03:34:33 UTC

REVIEW: https://review.gluster.org/22244 (dht: fix double extra unref of inode at heal path) merged (#2) on release-6 by Shyamsundar Ranganathan

Comment 3 Atin Mukherjee 2019-03-12 05:10:10 UTC

Is there anything pending on this bug? I still see the bug is in POST state even though the above patch is merged (as the commit had 'updates' tag).

Comment 4 Susant Kumar Palai 2019-03-12 07:29:14 UTC

(In reply to Atin Mukherjee from comment #3)
> Is there anything pending on this bug? I still see the bug is in POST state
> even though the above patch is merged (as the commit had 'updates' tag).

There was a crash seen dht layer in which was fixed by the above patch. But the patch was written originally for https://bugzilla.redhat.com/show_bug.cgi?id=1651439 which targetted mostly the nfs use case. Since we needed the dht fix in release-6, I guess Sunil cloned the mainline bug directly. 

Will change the summary to reflect dht-crash part and move the bug status to modified.


Susant

Comment 5 Shyamsundar 2019-03-25 16:33:20 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.