Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1273267

Summary: nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O doesn't resume
Product: [Community] GlusterFS Reporter: Saurabh <saujain>
Component: ganesha-nfsAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.5CC: jthottan, kkeithle, mzywusko, ndevos, skoduri
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1298300 (view as bug list) Environment:
Last Closed: 2017-03-08 11:00:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298300    
Attachments:
Description Flags
vm1 messages
none
vm4 messages none

Description Saurabh 2015-10-20 06:06:40 UTC
Created attachment 1084605 [details]
vm1 messages

Description of problem:
I created a tiering volume and started I/O on the nfs-ganesha with vers=4. I/O being ltp test suite. The tests are hung as the nfs-ganesha server process got seg faulted and failover happens, still I/O does not resume. 

Version-Release number of selected component (if applicable):
nfs-ganesha-2.3-0.rc6.el7.centos.x86_64
glusterfs-3.7.5-1.el7.x86_64

How reproducible:
seg fault seen in first attempt

Steps to Reproduce:
1. create a volume of type dist-rep with tiering enabled
2. export the volume over nfs-ganesha and mount it with vers=4
3. execute the fs-sanity test suite

Actual results:
while ltp test suite is getting executed, the nfs-ganesha process sees a segfault,
as can be seen with the logs in /var/log/messages,
Oct 20 05:21:20 vm1 kernel: ganesha.nfsd[9750]: segfault at 0 ip 00000000004b0ede sp 00007f59122a0ae0 error 4 in ganesha.nfsd[400000+1df000]
Oct 20 05:21:21 vm1 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Oct 20 05:21:21 vm1 systemd: Unit nfs-ganesha.service entered failed state.
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Additional logging available in /var/log/pacemaker.log
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_monitor_0: not running (node=vm1, call=119, rc=7, cib-update=142, confirmed=true)
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_start_0: ok (node=vm1, call=120, rc=0, cib-update=143, confirmed=true)
Oct 20 05:21:38 vm1 IPaddr(vm1-cluster_ip-1)[21296]: INFO: IP status = ok, IP_CIP=
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm1, call=123, rc=0, cib-update=145, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_stop_0: ok (node=vm1, call=125, rc=0, cib-update=146, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-trigger_ip-1_stop_0: ok (node=vm1, call=127, rc=0, cib-update=147, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_start_0: ok (node=vm1, call=128, rc=0, cib-update=148, confirmed=true)
Oct 20 05:21:48 vm1 logger: warning: pcs resource create vm1-dead_ip-1 ocf:heartbeat:Dummy failed

Even the failover happens, as per the pcs status mentioned below,
 vm1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm2
 vm2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm2
 vm3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm3
 vm3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm3
 vm4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm1-dead_ip-1	(ocf::heartbeat:Dummy):	Started vm1


But even after failover, the I/O doesn't resume as the nfs-ganesha logs on the failed-over node has errors being mentioned,

19/10/2015 23:49:57 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume vol3 exported at : '/'
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781c0109c0
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f7814026c30
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781000b1d0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f78180352c0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f8020bd0
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-1] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-3] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 06:09:51 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
20/10/2015 06:11:06 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat


Expected results:
Even if the nfs-ganesha has seg-faulted the failover should let the I/O resume.
Also, we need to overcome the problem of this segfault

Additional info:
The segfault related coredump is not found, now I will the test again and see if can be reproduced.

Comment 1 Saurabh 2015-10-20 06:11:09 UTC
Created attachment 1084606 [details]
vm4 messages

Comment 3 Kaushal 2017-03-08 11:00:58 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.