Description of problem: gluster-NFS is crashed while expanding volume Version-Release number of selected component (if applicable): glusterfs-3.12.2-18.1.el7rhgs.x86_64 How reproducible: Steps to Reproduce: While running automation runs, gluster-NFS is crashed while expanding volume 1) create distribute volume ( 1 * 4 ) 2) write IO from 2 clients 3) Add bricks while IO is in progress 4) start re-balance 5) check for IO After step 5), mount point is hung due to gluster-NFS crash. Actual results: gluster-NFS crash and IO is hung Expected results: IO should be success Additional info: volume info: [root@rhsauto023 glusterfs]# gluster vol info Volume Name: testvol_distributed Type: Distribute Volume ID: a809a120-f582-4358-8a70-5c53f71734ee Status: Started Snapshot Count: 0 Number of Bricks: 5 Transport-type: tcp Bricks: Brick1: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick0 Brick2: rhsauto030.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick1 Brick3: rhsauto031.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick2 Brick4: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick3 Brick5: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_distributed_brick4 Options Reconfigured: transport.address-family: inet nfs.disable: off [root@rhsauto023 glusterfs]# > volume status [root@rhsauto023 glusterfs]# gluster vol status Status of volume: testvol_distributed Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto023.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed_brick0 49153 0 Y 22557 Brick rhsauto030.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed_brick1 49153 0 Y 21814 Brick rhsauto031.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed_brick2 49153 0 Y 20441 Brick rhsauto027.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed_brick3 49152 0 Y 19886 Brick rhsauto023.lab.eng.blr.redhat.com:/br icks/brick1/testvol_distributed_brick4 49152 0 Y 23019 NFS Server on localhost N/A N/A N N/A NFS Server on rhsauto027.lab.eng.blr.redhat .com 2049 0 Y 20008 NFS Server on rhsauto033.lab.eng.blr.redhat .com 2049 0 Y 19752 NFS Server on rhsauto030.lab.eng.blr.redhat .com 2049 0 Y 21936 NFS Server on rhsauto031.lab.eng.blr.redhat .com 2049 0 Y 20557 NFS Server on rhsauto040.lab.eng.blr.redhat .com 2049 0 Y 20047 Task Status of Volume testvol_distributed ------------------------------------------------------------------------------ Task : Rebalance ID : 8e5b404f-5740-4d87-a0d7-3ce94178329f Status : completed [root@rhsauto023 glusterfs]# > NFS crash [2018-09-25 13:58:35.381085] I [dict.c:471:dict_get] (-->/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x22f5d) [0x7f93543fdf5d] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distri bute.so(+0x202e7) [0x7f93541572e7] -->/lib64/libglusterfs.so.0(dict_get+0x10c) [0x7f9361aefb3c] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument] pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-09-25 13:58:36 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9361af8cc0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9361b02c04] /lib64/libc.so.6(+0x36280)[0x7f9360158280] /lib64/libglusterfs.so.0(+0x3b6fa)[0x7f9361b086fa] /lib64/libglusterfs.so.0(inode_parent+0x52)[0x7f9361b09822] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0xc243)[0x7f934f95c243] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3e1d8)[0x7f934f98e1d8] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ea2b)[0x7f934f98ea2b] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ead5)[0x7f934f98ead5] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ecf8)[0x7f934f98ecf8] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x29d7c)[0x7f934f979d7c] /usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x2a184)[0x7f934f97a184] /lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f93618ba955] /lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f93618bab3b] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f93618bca73] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f93566e2566] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f93566e4b0c] /lib64/libglusterfs.so.0(+0x894c4)[0x7f9361b564c4] /lib64/libpthread.so.0(+0x7dd5)[0x7f9360957dd5] /lib64/libc.so.6(clone+0x6d)[0x7f9360220b3d] ---------
If this is fairly reproducible and we find this use case to be important why we're not marking it as blocker for 3.4.2 so that this can come to the triage queue of blocker/exception proposed bugs? What's blocking us here? (Of course, not all BZs should be marked as blocker through automation test, but this seems important?) Jiffin - have we got a chance to look at the automation test which leads to this crash? Have we tried the same in our local setup?
(In reply to Atin Mukherjee from comment #6) > If this is fairly reproducible and we find this use case to be important why > we're not marking it as blocker for 3.4.2 so that this can come to the > triage queue of blocker/exception proposed bugs? What's blocking us here? > I was waiting for the discussion to close properly on flags and keywords on the program mailing list and hence approached via need_info. This wasnt a inflight bug. However, based on the discussion. Moving forward with below explanation. During the automation run on nfs client, we are either seeing 1. client hung Bug 1648783 Or 2. NFS crash Bug 1633177 From the automation runs perspective, I consider these as AutomationBlocker for NFS protocols (Different usecases) and hence setting the appropriate keyword and flag for traction and decision.
Checked manually and through automation. No crash was observed even after executing the testcase multiple times. Observed hang(bz 1648783) a couple of times. Volume types used: Distribute, Distributed-Replicate, Distributed-Replicate(arbiter) Verified in version: glusterfs-3.12.2-36.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0263