Description of problem: ======================= We recently found a reproducible issue in 3.7.5 which causes the NFS service to get repeatedly taken offline when an in-use volume is stopped. How reproducible: ================= 100% Methods of reproducing: ======================= A) Have an active NFS mount from a Linux client, and while data is being either read form or written to that mount, issue a "volume stop" on gluster. To simulate io, I'm using a simple dd from /dev/zero B) Similar to A, but instead of having active data movement, simply have a shell on the client be sitting in the mounted directory. Once the volume is stopped, perform an "ls" from the client to trigger the crash. This only works if you were already in the mounted directory while the stop was issued. Actual results: =============== For either A or B, the NFS service on the gluster node the client was connected to will continue to crash at X interval (~5min) if manually brought back online after each crash. This will continue to occur until the offending hung process on the client is killed, or the gluster volume is brought back online. Each time the NFS service crashes, a large core dump is left on the gluster node in "/" for the NFS host was communicating with. The dump from this test was 641MB. Log information: =============== (from nfs.log) [2016-01-29 23:48:58.996528] E [nfs3.c:2303:nfs3_write] 0-nfs-nfsv3: Failed to map FH to vol: client=10.1.254.125:872, exportid=d9c54d47-26ed-4305-9650-042d28e79234, gfid=f38a51a5-9977-4de5-a12b-792b6bfd30a0 pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-01-29 23:48:58 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.5 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f30494309b6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x32f)[0x7f304945051f] /lib64/libc.so.6(+0x326a0)[0x7f3047dd06a0] /usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3_write+0x244)[0x7f303b1ea724] /usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3svc_write+0xbc)[0x7f303b1eab6c] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x7f30491f9f74] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7f30491fa173] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f30491fbb28] /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xabd5)[0x7f303df82bd5] /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc7bd)[0x7f303df847bd] /usr/lib64/libglusterfs.so.0(+0x8b180)[0x7f3049496180] /lib64/libpthread.so.0(+0x7a51)[0x7f304851ca51] /lib64/libc.so.6(clone+0x6d)[0x7f3047e8693d] --------- Environment Info: ================ This is a 3 node cluster, node 1 is only for quorum, nodes 2/3 serve data from 1x2 replicated vols. We utilize CTBD for NFS HA. This failure has been repeated several times in 2 identically setup clusters in different datacenters "ctdb status" and "peer status" show healthy prior to starting the tests Underlying bricks are XFS, backed by iscsi SAN LUNs, carved up via LVM. This is reproducible newly created volumes. (this is the volume I was using when generating the above nfs.log error) [root ~]$ gluster volume info res_temp Volume Name: res_temp Type: Replicate Volume ID: d9c54d47-26ed-4305-9650-042d28e79234 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gfs-int02.mgmt:/data/glusterfs/res_temp_brick1/brick1 Brick2: gfs-int03.mgmt:/data/glusterfs/res_temp_brick1/brick1 Options Reconfigured: nfs.rpc-auth-allow: 10.123.12.47,10.1.254.125 performance.readdir-ahead: on nfs.export-volumes: on nfs.addr-namelookup: Off nfs.disable: off network.ping-timeout: 5 cluster.server-quorum-type: server cluster.server-quorum-ratio: 51% [root ~]$ xfs_info /dev/mapper/int-res_temp_brick1 meta-data=/dev/mapper/int-res_temp_brick1 isize=512 agcount=4, agsize=25600000 blks = sectsz=4096 attr=2, projid32bit=0 data = bsize=4096 blocks=102400000, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=50000, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root ~]$ cat /etc/issue CentOS release 6.7 (Final) Kernel \r on an \m [root ~]$ uname -a Linux gfs-int02.mgmt 2.6.32-573.7.1.el6.x86_64 #1 SMP Tue Sep 22 22:00:00 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux [root ~]$ yum list installed | grep gluster glusterfs.x86_64 3.7.5-1.el6 @nwea-util glusterfs-api.x86_64 3.7.5-1.el6 @nwea-util glusterfs-cli.x86_64 3.7.5-1.el6 @nwea-util glusterfs-client-xlators.x86_64 3.7.5-1.el6 @nwea-util glusterfs-fuse.x86_64 3.7.5-1.el6 @nwea-util glusterfs-geo-replication.x86_64 3.7.5-1.el6 @nwea-util glusterfs-libs.x86_64 3.7.5-1.el6 @nwea-util glusterfs-server.x86_64 3.7.5-1.el6 @nwea-util Please let me know if further information or specific full log files would be helpful.
If possible could you upload the core as well?
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.