Bug 893778

Summary: Gluster 3.3.1 NFS service died after writing bunch of data
Product: [Community] GlusterFS Reporter: Rob <robinr>
Component: nfsAssignee: Vivek Agarwal <vagarwal>
Status: CLOSED DUPLICATE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: gluster-bugs, jlu, nock, rhs-bugs, sankarshan, shaines, spradhan, vbellur, wica128
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 902857 (view as bug list) Environment:
Last Closed: 2013-08-29 18:46:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 902857, 998649    
Attachments:
Description Flags
core dump file (as requested)
none
another core dump file (as requested) none

Description Rob 2013-01-09 21:51:07 UTC
Description of problem:
Gluster NFS service died.

Version-Release number of selected component (if applicable):
3.3.1

How reproducible:
Unsure. 

Steps to Reproduce:
1. Mount the NFS storage
2. Write a bunch of data
3. Gluster NFS service died
  
Actual results:

Gluster NFS service died.

Expected results:

Gluster NFS service stays up.

Additional info:

Relevant logs are below:

[2013-01-09 14:46:37.198861] W [quota.c:2177:quota_fstat_cbk] 0-RedhawkShared-quota: quota context not set in inode (gfid:1d16621a-b000-401b-b51e-f554f201705a)

pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2013-01-09 16:33:06
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib64/libc.so.6[0x381c832920]
/lib64/libc.so.6(gsignal+0x35)[0x381c8328a5]
/lib64/libc.so.6(abort+0x175)[0x381c834085]
/lib64/libc.so.6[0x381c86fa37]
/lib64/libc.so.6(__fortify_fail+0x37)[0x381c9012a7]
/lib64/libc.so.6(__fortify_fail+0x0)[0x381c901270]
/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(+0x30f47)[0x7f597f9a9f47]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x3c9e40a443]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x3c9e40a5b3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x3c9e40b018]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f5985236924]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f5985236a07]
/usr/lib64/libglusterfs.so.0[0x3c9e03ed14]
/usr/sbin/glusterfs(main+0x58a)[0x40741a]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x381c81ecdd]
/usr/sbin/glusterfs[0x4043c9]
---------

Comment 2 Rob 2013-01-10 14:19:48 UTC
### The following info is set

Also, I'm getting a lot of "quota context not set in inode"

# gluster volume info RedhawkShared
 
Volume Name: RedhawkShared
Type: Replicate
Volume ID: f9b943f8-dcb9-448f-a8b2-795d3c19ef3d
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: mualglup01:/mnt/gluster/RedhawkShared
Brick2: mualglup02:/mnt/gluster/RedhawkShared
Options Reconfigured:
nfs.register-with-portmap: 1
nfs.disable: off
auth.allow: 10.0.72.135,10.0.93.*,192.168.251.*
features.quota: on
features.limit-usage: /robintest:1MB

Comment 3 Rob 2013-01-10 14:23:29 UTC
I have 2 nodes: the other replica node is mistakenly running at 3.3.0. It did not have the this NFS problems. 

The nodes with the following RPMs have issues:
glusterfs-server-3.3.1-1.el6.x86_64
glusterfs-fuse-3.3.1-1.el6.x86_64
glusterfs-3.3.1-1.el6.x86_64
 
The node with the following RPMs did _NOT_ have issues:
glusterfs-3.3.0-1.el6.x86_64
glusterfs-server-3.3.0-1.el6.x86_64
glusterfs-fuse-3.3.0-1.el6.x86_64

Comment 4 Rob 2013-01-10 15:08:44 UTC
I have the core dumps. I can attach it if you want it. Platform is RHEL 6.2.

Comment 5 Kaleb KEITHLEY 2013-01-22 14:26:55 UTC
(In reply to comment #4)
> I have the core dumps. I can attach it if you want it. Platform is RHEL 6.2.

Yes, please provide the core file(s). Thanks

Comment 6 Rob 2013-01-22 15:36:34 UTC
Created attachment 685250 [details]
core dump file (as requested)

Comment 7 Rob 2013-01-22 15:37:21 UTC
Created attachment 685251 [details]
another core dump file (as requested)

Comment 8 Shawn Nock 2013-01-24 21:11:45 UTC
I am seeing this behaviour on one of my nodes as well. 2 upgraded successfully, one fails to start the NFS service with a very similar backtrace.

Comment 9 Shawn Nock 2013-01-24 21:37:13 UTC
(In reply to comment #8)
> I am seeing this behaviour on one of my nodes as well. 2 upgraded
> successfully, one fails to start the NFS service with a very similar
> backtrace.

Reverting to 3.3.0-1 fixes the problem (the NFS server stays online).

Comment 10 Shawn Nock 2013-01-24 21:51:38 UTC
All 3.3.1-1 NFS servers eventually crashed. I've had to revert them all to 3.3.0-1.

Given volfile:
+------------------------------------------------------------------------------+
  1: volume mirror-client-0
  2:     type protocol/client
  3:     option remote-host lauterbur
  4:     option remote-subvolume /raid/mirror
  5:     option transport-type tcp
  6: end-volume
  7: 
  8: volume mirror-client-1
  9:     type protocol/client
 10:     option remote-host mansfield
 11:     option remote-subvolume /raid/mirror
 12:     option transport-type tcp
 13: end-volume
 14: 
 15: volume mirror-client-2
 16:     type protocol/client
 17:     option remote-host ogawa
 18:     option remote-subvolume /raid/mirror
 19:     option transport-type tcp
 20: end-volume
 21: 
 22: volume mirror-client-3
 23:     type protocol/client
 24:     option remote-host rabi
 25:     option remote-subvolume /raid/mirror
 26:     option transport-type tcp
 27: end-volume
 28: 
 29: volume mirror-client-4
 30:     type protocol/client
 31:     option remote-host rabi
 32:     option remote-subvolume /raid/mirror2
 33:     option transport-type tcp
 34: end-volume
 35: 
 36: volume mirror-client-5
 37:     type protocol/client
 38:     option remote-host ogawa
 39:     option remote-subvolume /raid/mirror2
 40:     option transport-type tcp
 41: end-volume
 42: 
 43: volume mirror-replicate-0
 44:     type cluster/replicate
 45:     subvolumes mirror-client-0 mirror-client-1
 46: end-volume
 47: 
 48: volume mirror-replicate-1
 49:     type cluster/replicate
 50:     subvolumes mirror-client-2 mirror-client-3
 51: end-volume
 52: 
 53: volume mirror-replicate-2
 54:     type cluster/replicate
 55:     subvolumes mirror-client-4 mirror-client-5
 56: end-volume
 57: 
 58: volume mirror-dht
 59:     type cluster/distribute
 60:     subvolumes mirror-replicate-0 mirror-replicate-1 mirror-replicate-2
 61: end-volume
 62: 
 63: volume mirror
 64:     type debug/io-stats
 65:     option latency-measurement off
 66:     option count-fop-hits off
 67:     subvolumes mirror-dht
 68: end-volume
 69: 
 70: volume stripe-client-0
 71:     type protocol/client
 72:     option remote-host lauterbur
 73:     option remote-subvolume /raid/stripe
 74:     option transport-type tcp
...skipping...
 60:     subvolumes mirror-replicate-0 mirror-replicate-1 mirror-replicate-2
 61: end-volume
 62: 
 63: volume mirror
 64:     type debug/io-stats
 65:     option latency-measurement off
 66:     option count-fop-hits off
 67:     subvolumes mirror-dht
 68: end-volume
 69: 
 70: volume stripe-client-0
 71:     type protocol/client
 72:     option remote-host lauterbur
 73:     option remote-subvolume /raid/stripe
 74:     option transport-type tcp
 75: end-volume
 76: 
 77: volume stripe-client-1
 78:     type protocol/client
 79:     option remote-host mansfield
 80:     option remote-subvolume /raid/stripe
 81:     option transport-type tcp
 82: end-volume
 83: 
 84: volume stripe-dht
 85:     type cluster/distribute
 86:     subvolumes stripe-client-0 stripe-client-1
 87: end-volume
 88: 
 89: volume stripe
 90:     type debug/io-stats
 91:     option latency-measurement off
 92:     option count-fop-hits off
 93:     subvolumes stripe-dht
 94: end-volume
 95: 
 96: volume nfs-server
 97:     type nfs/server
 98:     option nfs.dynamic-volumes on
 99:     option nfs.nlm on
100:     option rpc-auth.addr.stripe.allow 127.0.0.1,10.1.3.*,10.1.2.*
101:     option nfs3.stripe.volume-id 7bc6050a-6846-4aae-bfd1-af5930efe95f
102:     option rpc-auth.addr.mirror.allow 127.0.0.1,10.1.3.*,10.1.2.*
103:     option nfs3.mirror.volume-id 2361e511-42a2-4d95-a99b-a1461236f78c
104:     option nfs.enable-ino32 yes
105:     option rpc-auth.addr.namelookup off
106:     option rpc-auth.ports.stripe.insecure on
107:     option rpc-auth.ports.mirror.insecure on
108:     option nfs3.mirror.trusted-sync on
109:     subvolumes stripe mirror
110: end-volume

+------------------------------------------------------------------------------+
[2013-01-24 16:48:57.775744] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-stripe-client-0: changing port to 24010 (from 0)
[2013-01-24 16:48:57.775813] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-0: changing port to 24009 (from 0)
[2013-01-24 16:48:57.775878] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-stripe-client-1: changing port to 24010 (from 0)
[2013-01-24 16:48:57.775948] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-1: changing port to 24009 (from 0)
[2013-01-24 16:48:57.775989] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-3: changing port to 24010 (from 0)
[2013-01-24 16:48:57.776067] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-2: changing port to 24010 (from 0)
[2013-01-24 16:48:57.776107] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-5: changing port to 24011 (from 0)
[2013-01-24 16:48:57.776170] I [rpc-clnt.c:1657:rpc_clnt_reconfig] 0-mirror-client-4: changing port to 24011 (from 0)
[2013-01-24 16:48:58.416280] E [nfs3.c:1549:nfs3_access] 0-nfs-nfsv3: Volume is disabled: stripe
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2013-01-24 16:48:58
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib64/libc.so.6(+0x36300)[0x7fa8c5232300]
/lib64/libc.so.6(gsignal+0x35)[0x7fa8c5232285]
/lib64/libc.so.6(abort+0x17b)[0x7fa8c5233b9b]
/lib64/libc.so.6(+0x77a7e)[0x7fa8c5273a7e]
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fa8c5304af7]
/lib64/libc.so.6(__fortify_fail+0x0)[0x7fa8c5304ac0]
/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(+0x29799)[0x7fa8c0f55799]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x258)[0x7fa8c5f8b1c8]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x9b)[0x7fa8c5f8b7fb]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x7fa8c5f8f367]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7fa8c24b5c64]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7fa8c24b5fb7]
/usr/lib64/libglusterfs.so.0(+0x3e8f7)[0x7fa8c61d98f7]
/usr/sbin/glusterfs(main+0x34d)[0x40478d]
/lib64/libc.so.6(__libc_start_main+0xed)[0x7fa8c521d69d]
/usr/sbin/glusterfs[0x404a55]
---------

Comment 11 Rajesh 2013-01-28 07:04:29 UTC
*** Bug 893779 has been marked as a duplicate of this bug. ***

Comment 12 Shawn Nock 2013-02-21 17:11:14 UTC
Any updates or a link to a patch in the code-review site?

Comment 13 Jiri Hoogeveen 2013-03-05 19:34:55 UTC
Hi,

I have seen this bug in the mailing list. and wanted report that I don't have this issue wth glusterfs 3.3.1 and nfs.

Running 3.3.1 since 26 Oct 2012. using it for running vm in a vSphere cluster and for data storage.


Maybe, on thing,I use roundrobin in my DNS for the nfs hostname. Zo every peer can be a nfs server, at anytime.

OS: Ubuntu 12.04.1
packages from http://ppa.launchpad.net/semiosis/ubuntu-glusterfs-3.3/ubuntu

Volume info:

gluster volume info glusterfsvol01

Volume Name: glusterfsvol01
Type: Distributed-Replicate
Volume ID: 1013b94c-7299-46b5-907a-fe7f2ae51f0b
Status: Started
Number of Bricks: 18 x 2 = 36
Transport-type: tcp
Bricks:
Brick1: gluster-brick-01n1:/export/vol1
Brick2: gluster-brick-02n1:/export/vol1
Brick3: gluster-brick-03n1:/export/vol1
Brick4: gluster-brick-01n2:/export/vol1
Brick5: gluster-brick-02n2:/export/vol1
Brick6: gluster-brick-03n2:/export/vol1
Brick7: gluster-brick-01n1:/export/vol2
Brick8: gluster-brick-02n1:/export/vol2
Brick9: gluster-brick-03n1:/export/vol2
Brick10: gluster-brick-01n2:/export/vol2
Brick11: gluster-brick-02n2:/export/vol2
Brick12: gluster-brick-03n2:/export/vol2
Brick13: gluster-brick-01n1:/export/vol3
Brick14: gluster-brick-02n1:/export/vol3
Brick15: gluster-brick-03n1:/export/vol3
Brick16: gluster-brick-01n2:/export/vol3
Brick17: gluster-brick-02n2:/export/vol3
Brick18: gluster-brick-03n2:/export/vol3
Brick19: gluster-brick-01n1:/export/vol4
Brick20: gluster-brick-02n1:/export/vol4
Brick21: gluster-brick-03n1:/export/vol4
Brick22: gluster-brick-01n2:/export/vol4
Brick23: gluster-brick-02n2:/export/vol4
Brick24: gluster-brick-03n2:/export/vol4
Brick25: gluster-brick-01n1:/export/vol5
Brick26: gluster-brick-02n1:/export/vol5
Brick27: gluster-brick-03n1:/export/vol5
Brick28: gluster-brick-01n2:/export/vol5
Brick29: gluster-brick-02n2:/export/vol5
Brick30: gluster-brick-03n2:/export/vol5
Brick31: gluster-brick-01n1:/export/vol6
Brick32: gluster-brick-02n1:/export/vol6
Brick33: gluster-brick-03n1:/export/vol6
Brick34: gluster-brick-01n2:/export/vol6
Brick35: gluster-brick-02n2:/export/vol6
Brick36: gluster-brick-03n2:/export/vol6
Options Reconfigured:
diagnostics.client-sys-log-level: WARNING
diagnostics.brick-sys-log-level: WARNING
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.cache-size: 134217728
performance.io-thread-count: 64
performance.write-behind-window-size: 256MB
performance.io-cache: on
performance.read-ahead: on
auth.allow: 172.16.*
nfs.disable: off

Comment 14 Shawn Nock 2013-03-05 20:13:58 UTC
I am using the Fedora Packages from gluster.org. 

I too use DNS roundrobin, however the NFS process on all servers crashes (eventually, usually under an hour).

I am willing to provide more information, but I don't know how to proceed. Was anything discovered in the cores posted above or from the backtrace?

Comment 15 santosh pradhan 2013-08-29 18:46:52 UTC

*** This bug has been marked as a duplicate of bug 1002385 ***