1224618 – Ganesha server became unresponsive after successfull failover

Bug 1224618 - Ganesha server became unresponsive after successfull failover

Summary: Ganesha server became unresponsive after successfull failover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Saurabh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1202842
TreeView+	depends on / blocked

Reported:	2015-05-25 07:03 UTC by Apeksha
Modified:	2023-09-14 02:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.7.1-2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-29 04:52:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1216039	high	CLOSED	nfs-ganesha: Discrepancies with lock states recovery during migration	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1219485	high	CLOSED	nfs-ganesha: Discrepancies with lock states recovery during migration	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1227028	high	CLOSED	nfs-ganesha: Discrepancies with lock states recovery during migration	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2015:1495	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Internal Links: 1216039 1219485 1227028

Description Apeksha 2015-05-25 07:03:30 UTC

Description of problem:
Ganesha server became unresponsive after successfull failover and io completion

Version-Release number of selected component (if applicable):
nfs-ganesha-2.2.0-0.el6.x86_64
glusterfs-3.7.0-2.el6rhs.x86_64

How reproducible:
once

Steps to Reproduce:
1. Running IO (linux untar process) on 4 node ganesha cluster
2. Rebooted nfs2, failover happened suceesfully on nfs1
3. IO resumed after around 7 min and also completes
4. nfs1 server times out for the showmount command but ganesha is still running on that server
5. showmount command is succeessfull from nfs3 and nfs4 servers
6. On client the df -h command hangs


Actual results: showmount command times out on nfs1 server and  On client the df -h command hangs

Expected results: showmount commanf must be successfull and df -h command on client must not hang

Additional info:
/var/log/ganesha.log:

24/05/2015 03:24:39 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
24/05/2015 03:25:54 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
24/05/2015 03:29:08 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat

/var/log/messages:

May 24 11:00:10 nfs1 crmd[9408]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
May 24 11:00:10 nfs1 crmd[9408]:   notice: run_graph: Transition 176 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-107.bz2): Complete
May 24 11:00:10 nfs1 crmd[9408]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 24 11:00:10 nfs1 pengine[9406]:   notice: process_pe_message: Calculated Transition 176: /var/lib/pacemaker/pengine/pe-input-107.bz2
May 24 11:00:19 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21301:stderr [ Error: Resource does not exist. ]
May 24 11:00:31 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21458:stderr [ Error: Resource does not exist. ]
May 24 11:00:42 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21515:stderr [ Error: Resource does not exist. ]


pcs status:
Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs3 nfs4 ]
     Stopped: [ nfs2 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs3 nfs4 ]
     Stopped: [ nfs2 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs1 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs1 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs4 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 

/tmp/gfapi.log

[2015-05-22 16:19:46.448068] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-rm with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],
[2015-05-22 16:19:46.869975] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-sibyte with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],
[2015-05-22 16:19:47.209560] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-tx39xx with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],

Comment 2 Apeksha 2015-05-25 07:11:11 UTC

sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1224618/

Comment 5 Saurabh 2015-07-07 09:59:16 UTC

Ganesha server was not unresponsive post failover and I/O also finished post failover.

Comment 6 errata-xmlrpc 2015-07-29 04:52:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Comment 7 Red Hat Bugzilla 2023-09-14 02:59:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.