Bug 1224618 - Ganesha server became unresponsive after successfull failover
Summary: Ganesha server became unresponsive after successfull failover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: x86_64
OS: All
high
high
Target Milestone: ---
: RHGS 3.1.0
Assignee: Kaleb KEITHLEY
QA Contact: Saurabh
URL:
Whiteboard:
Depends On:
Blocks: 1202842
TreeView+ depends on / blocked
 
Reported: 2015-05-25 07:03 UTC by Apeksha
Modified: 2023-09-14 02:59 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.7.1-2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-29 04:52:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1216039 0 high CLOSED nfs-ganesha: Discrepancies with lock states recovery during migration 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1219485 0 high CLOSED nfs-ganesha: Discrepancies with lock states recovery during migration 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1227028 0 high CLOSED nfs-ganesha: Discrepancies with lock states recovery during migration 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Internal Links: 1216039 1219485 1227028

Description Apeksha 2015-05-25 07:03:30 UTC
Description of problem:
Ganesha server became unresponsive after successfull failover and io completion

Version-Release number of selected component (if applicable):
nfs-ganesha-2.2.0-0.el6.x86_64
glusterfs-3.7.0-2.el6rhs.x86_64

How reproducible:
once

Steps to Reproduce:
1. Running IO (linux untar process) on 4 node ganesha cluster
2. Rebooted nfs2, failover happened suceesfully on nfs1
3. IO resumed after around 7 min and also completes
4. nfs1 server times out for the showmount command but ganesha is still running on that server
5. showmount command is succeessfull from nfs3 and nfs4 servers
6. On client the df -h command hangs


Actual results: showmount command times out on nfs1 server and  On client the df -h command hangs

Expected results: showmount commanf must be successfull and df -h command on client must not hang

Additional info:
/var/log/ganesha.log:

24/05/2015 03:24:39 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
24/05/2015 03:25:54 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
24/05/2015 03:29:08 : epoch 555f34d6 : nfs1 : ganesha.nfsd-8909[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat

/var/log/messages:

May 24 11:00:10 nfs1 crmd[9408]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
May 24 11:00:10 nfs1 crmd[9408]:   notice: run_graph: Transition 176 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-107.bz2): Complete
May 24 11:00:10 nfs1 crmd[9408]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 24 11:00:10 nfs1 pengine[9406]:   notice: process_pe_message: Calculated Transition 176: /var/lib/pacemaker/pengine/pe-input-107.bz2
May 24 11:00:19 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21301:stderr [ Error: Resource does not exist. ]
May 24 11:00:31 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21458:stderr [ Error: Resource does not exist. ]
May 24 11:00:42 nfs1 lrmd[9401]:   notice: operation_finished: nfs-mon_monitor_10000:21515:stderr [ Error: Resource does not exist. ]


pcs status:
Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs3 nfs4 ]
     Stopped: [ nfs2 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs3 nfs4 ]
     Stopped: [ nfs2 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs1 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs1 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs4 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 

/tmp/gfapi.log

[2015-05-22 16:19:46.448068] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-rm with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],
[2015-05-22 16:19:46.869975] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-sibyte with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],
[2015-05-22 16:19:47.209560] I [MSGID: 109036] [dht-common.c:6689:dht_log_new_layout_for_dir_selfheal] 0-vol1-dht: Setting layout of /t2/linux-2.6.31.1/arch/mips/include/asm/mach-tx39xx with [Subvol_name: vol1-disperse-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ],

Comment 2 Apeksha 2015-05-25 07:11:11 UTC
sosreports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1224618/

Comment 5 Saurabh 2015-07-07 09:59:16 UTC
Ganesha server was not unresponsive post failover and I/O also finished post failover.

Comment 6 errata-xmlrpc 2015-07-29 04:52:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Comment 7 Red Hat Bugzilla 2023-09-14 02:59:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.