1273267 – nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O doesn't resume

Bug 1273267 - nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O doesn't resume

Summary: nfs-ganesha: nfs-ganesha server process segfaults and post failover the I/O d...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	ganesha-nfs
Sub Component:
Version:	3.7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaleb KEITHLEY
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1298300
TreeView+	depends on / blocked

Reported:	2015-10-20 06:06 UTC by Saurabh
Modified:	2017-03-08 11:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1298300 (view as bug list)
Environment:
Last Closed:	2017-03-08 11:00:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vm1 messages (699.80 KB, text/plain) 2015-10-20 06:06 UTC, Saurabh	no flags	Details
vm4 messages (943.99 KB, text/plain) 2015-10-20 06:11 UTC, Saurabh	no flags	Details
View All

Description Saurabh 2015-10-20 06:06:40 UTC

Created attachment 1084605 [details]
vm1 messages

Description of problem:
I created a tiering volume and started I/O on the nfs-ganesha with vers=4. I/O being ltp test suite. The tests are hung as the nfs-ganesha server process got seg faulted and failover happens, still I/O does not resume. 

Version-Release number of selected component (if applicable):
nfs-ganesha-2.3-0.rc6.el7.centos.x86_64
glusterfs-3.7.5-1.el7.x86_64

How reproducible:
seg fault seen in first attempt

Steps to Reproduce:
1. create a volume of type dist-rep with tiering enabled
2. export the volume over nfs-ganesha and mount it with vers=4
3. execute the fs-sanity test suite

Actual results:
while ltp test suite is getting executed, the nfs-ganesha process sees a segfault,
as can be seen with the logs in /var/log/messages,
Oct 20 05:21:20 vm1 kernel: ganesha.nfsd[9750]: segfault at 0 ip 00000000004b0ede sp 00007f59122a0ae0 error 4 in ganesha.nfsd[400000+1df000]
Oct 20 05:21:21 vm1 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Oct 20 05:21:21 vm1 systemd: Unit nfs-ganesha.service entered failed state.
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Additional logging available in /var/log/pacemaker.log
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_monitor_0: not running (node=vm1, call=119, rc=7, cib-update=142, confirmed=true)
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_start_0: ok (node=vm1, call=120, rc=0, cib-update=143, confirmed=true)
Oct 20 05:21:38 vm1 IPaddr(vm1-cluster_ip-1)[21296]: INFO: IP status = ok, IP_CIP=
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-cluster_ip-1_stop_0: ok (node=vm1, call=123, rc=0, cib-update=145, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_stop_0: ok (node=vm1, call=125, rc=0, cib-update=146, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-trigger_ip-1_stop_0: ok (node=vm1, call=127, rc=0, cib-update=147, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_start_0: ok (node=vm1, call=128, rc=0, cib-update=148, confirmed=true)
Oct 20 05:21:48 vm1 logger: warning: pcs resource create vm1-dead_ip-1 ocf:heartbeat:Dummy failed

Even the failover happens, as per the pcs status mentioned below,
 vm1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm2
 vm2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm2
 vm3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm3
 vm3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm3
 vm4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started vm4
 vm4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started vm4
 vm1-dead_ip-1	(ocf::heartbeat:Dummy):	Started vm1


But even after failover, the I/O doesn't resume as the nfs-ganesha logs on the failed-over node has errors being mentioned,

19/10/2015 23:49:57 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume vol3 exported at : '/'
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781c0109c0
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f7814026c30
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f781000b1d0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f78180352c0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f8020bd0
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not connected
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-1] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-3] cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 06:09:51 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
20/10/2015 06:11:06 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat


Expected results:
Even if the nfs-ganesha has seg-faulted the failover should let the I/O resume.
Also, we need to overcome the problem of this segfault

Additional info:
The segfault related coredump is not found, now I will the test again and see if can be reproduced.

Comment 1 Saurabh 2015-10-20 06:11:09 UTC

Created attachment 1084606 [details]
vm4 messages

Comment 3 Kaushal 2017-03-08 11:00:58 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.