1303259 – cyclic NFS daemon crash when stopping a volume with active NFS connections in 3.7.5

Bug 1303259 - cyclic NFS daemon crash when stopping a volume with active NFS connections in 3.7.5

Summary: cyclic NFS daemon crash when stopping a volume with active NFS connections in...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	nfs
Sub Component:
Version:	3.7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-30 00:21 UTC by Kris Laib
Modified:	2017-03-08 10:48 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-03-08 10:48:24 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kris Laib 2016-01-30 00:21:39 UTC

Description of problem:
=======================
We recently found a reproducible issue in 3.7.5 which causes the NFS service to get repeatedly taken offline when an in-use volume is stopped.


How reproducible:
=================
100%


Methods of reproducing:
=======================

A) Have an active NFS mount from a Linux client, and while data is being either read form or written to that mount, issue a "volume stop" on gluster.  To simulate io, I'm using a simple dd from /dev/zero

B) Similar to A, but instead of having active data movement, simply have a shell on the client be sitting in the mounted directory.  Once the volume is stopped, perform an "ls" from the client to trigger the crash.     This only works if you were already in the mounted directory while the stop was issued.


Actual results:
===============

For either A or B, the NFS service on the gluster node the client was connected to will continue to crash at X interval (~5min) if manually brought back online after each crash.  This will continue to occur until the offending hung process on the client is killed, or the gluster volume is brought back online.

Each time the NFS service crashes, a large core dump is left on the gluster node in "/" for the NFS host was communicating with.  The dump from this test was 641MB.



Log information:
===============
(from nfs.log)

[2016-01-29 23:48:58.996528] E [nfs3.c:2303:nfs3_write] 0-nfs-nfsv3: Failed to map FH to vol: client=10.1.254.125:872, exportid=d9c54d47-26ed-4305-9650-042d28e79234, gfid=f38a51a5-9977-4de5-a12b-792b6bfd30a0
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-01-29 23:48:58
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f30494309b6]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x32f)[0x7f304945051f]
/lib64/libc.so.6(+0x326a0)[0x7f3047dd06a0]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3_write+0x244)[0x7f303b1ea724]
/usr/lib64/glusterfs/3.7.5/xlator/nfs/server.so(nfs3svc_write+0xbc)[0x7f303b1eab6c]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x314)[0x7f30491f9f74]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7f30491fa173]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f30491fbb28]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xabd5)[0x7f303df82bd5]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so(+0xc7bd)[0x7f303df847bd]
/usr/lib64/libglusterfs.so.0(+0x8b180)[0x7f3049496180]
/lib64/libpthread.so.0(+0x7a51)[0x7f304851ca51]
/lib64/libc.so.6(clone+0x6d)[0x7f3047e8693d]
---------


Environment Info:
================

This is a 3 node cluster, node 1 is only for quorum, nodes 2/3 serve data from 1x2 replicated vols.  We utilize CTBD for NFS HA.

This failure has been repeated several times in 2 identically setup clusters in different datacenters

"ctdb status" and "peer status" show healthy prior to starting the tests

Underlying bricks are XFS, backed by iscsi SAN LUNs, carved up via LVM.

This is reproducible newly created volumes.

(this is the volume I was using when generating the above nfs.log error)
[root ~]$ gluster volume info res_temp
 
Volume Name: res_temp
Type: Replicate
Volume ID: d9c54d47-26ed-4305-9650-042d28e79234
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-int02.mgmt:/data/glusterfs/res_temp_brick1/brick1
Brick2: gfs-int03.mgmt:/data/glusterfs/res_temp_brick1/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 10.123.12.47,10.1.254.125
performance.readdir-ahead: on
nfs.export-volumes: on
nfs.addr-namelookup: Off
nfs.disable: off
network.ping-timeout: 5
cluster.server-quorum-type: server
cluster.server-quorum-ratio: 51%

[root ~]$ xfs_info /dev/mapper/int-res_temp_brick1 
meta-data=/dev/mapper/int-res_temp_brick1 isize=512    agcount=4, agsize=25600000 blks
         =                       sectsz=4096  attr=2, projid32bit=0
data     =                       bsize=4096   blocks=102400000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=50000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


[root ~]$ cat /etc/issue
CentOS release 6.7 (Final)
Kernel \r on an \m

[root ~]$ uname -a
Linux gfs-int02.mgmt 2.6.32-573.7.1.el6.x86_64 #1 SMP Tue Sep 22 22:00:00 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[root ~]$ yum list installed | grep gluster
glusterfs.x86_64                   3.7.5-1.el6              @nwea-util          
glusterfs-api.x86_64               3.7.5-1.el6              @nwea-util          
glusterfs-cli.x86_64               3.7.5-1.el6              @nwea-util          
glusterfs-client-xlators.x86_64    3.7.5-1.el6              @nwea-util          
glusterfs-fuse.x86_64              3.7.5-1.el6              @nwea-util          
glusterfs-geo-replication.x86_64   3.7.5-1.el6              @nwea-util          
glusterfs-libs.x86_64              3.7.5-1.el6              @nwea-util          
glusterfs-server.x86_64            3.7.5-1.el6              @nwea-util  


Please let me know if further information or specific full log files would be helpful.

Comment 1 Soumya Koduri 2016-02-09 12:14:54 UTC

If possible could you upload the core as well?

Comment 2 Kaushal 2017-03-08 10:48:24 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.