Bug 1273267 - nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O doesn't resume
nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O d...
Status: CLOSED EOL
Product: GlusterFS
Classification: Community
Component: ganesha-nfs (Show other bugs)
3.7.5
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kaleb KEITHLEY
: Triaged
Depends On:
Blocks: 1298300
  Show dependency treegraph
 
Reported: 2015-10-20 02:06 EDT by Saurabh
Modified: 2017-03-08 06:00 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1298300 (view as bug list)
Environment:
Last Closed: 2017-03-08 06:00:58 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vm1 messages (699.80 KB, text/plain)
2015-10-20 02:06 EDT, Saurabh
no flags Details
vm4 messages (943.99 KB, text/plain)
2015-10-20 02:11 EDT, Saurabh
no flags Details

  None (edit)
Description Saurabh 2015-10-20 02:06:40 EDT
Created attachment 1084605 [details]
vm1 messages

Description of problem:
I created a tiering volume and started I/O on the nfs-ganesha with vers=4. I/O being ltp test suite. The tests are hung as the nfs-ganesha server process got seg faulted and failover happens, still I/O does not resume. 

Version-Release number of selected component (if applicable):
nfs-ganesha-2.3-0.rc6.el7.centos.x86_64
glusterfs-3.7.5-1.el7.x86_64

How reproducible:
seg fault seen in first attempt

Steps to Reproduce:
1. create a volume of type dist-rep with tiering enabled
2. export the volume over nfs-ganesha and mount it with vers=4
3. execute the fs-sanity test suite

Actual results:
while ltp test suite is getting executed, the nfs-ganesha process sees a segfault,
as can be seen with the logs in /var/log/messages,
Oct 20 05:21:20 vm1 kernel: ganesha.nfsd[9750]: segfault at 0 ip 00000000004b0ede sp 00007f59122a0ae0 error 4 in ganesha.nfsd[400000+1df000]
Oct 20 05:21:21 vm1 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Oct 20 05:21:21 vm1 systemd: Unit nfs-ganesha.service entered failed state.
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Additional logging available in /var/log/pacemaker.log
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_monitor_0: not running (node=vm1, call=119, rc=7, cib-update=142, confirmed=true)
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_start_0: ok (node=vm1, call=120, rc=0, cib-update=143, confirmed=true)
Oct 20 05:21:38 vm1 IPaddr(vm1-cluster_ip-1)[21296]: INFO: IP status = ok, IP_CIP=
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm1, call=123, rc=0, cib-update=145, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_stop_0: ok (node=vm1, call=125, rc=0, cib-update=146, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-trigger_ip-1_stop_0: ok (node=vm1, call=127, rc=0, cib-update=147, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_start_0: ok (node=vm1, call=128, rc=0, cib-update=148, confirmed=true)
Oct 20 05:21:48 vm1 logger: warning: pcs resource create vm1-dead_ip-1 ocf:heartbeat:Dummy failed

Even the failover happens, as per the pcs status mentioned below,
 vm1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm2
 vm2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm2
 vm3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm3
 vm3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm3
 vm4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm1-dead_ip-1	(ocf::heartbeat:Dummy):	Started vm1


But even after failover, the I/O doesn't resume as the nfs-ganesha logs on the failed-over node has errors being mentioned,

19/10/2015 23:49:57 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume vol3 exported at : '/'
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781c0109c0
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f7814026c30
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781000b1d0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f78180352c0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f8020bd0
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-1] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-3] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 06:09:51 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
20/10/2015 06:11:06 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat


Expected results:
Even if the nfs-ganesha has seg-faulted the failover should let the I/O resume.
Also, we need to overcome the problem of this segfault

Additional info:
The segfault related coredump is not found, now I will the test again and see if can be reproduced.
Comment 1 Saurabh 2015-10-20 02:11 EDT
Created attachment 1084606 [details]
vm4 messages
Comment 3 Kaushal 2017-03-08 06:00:58 EST
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.