1347286 – dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected.

Bug 1347286 - dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected.

Summary: dead IP's seen in pcs status after upgrade and failover/failback doesn't work...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Kaleb KEITHLEY
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1347196
TreeView+	depends on / blocked

Reported:	2016-06-16 12:49 UTC by Shashank Raj
Modified:	2020-04-13 07:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-09 12:18:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shashank Raj 2016-06-16 12:49:43 UTC

Description of problem:

dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected.

Version-Release number of selected component (if applicable):

nfs-ganesha-2.3.1-8
glusterfs-3.7.9-10

How reproducible:
Always

Steps to Reproduce:

1.Upgrade to 3.1.3 using the below steps:

1. Stop the nfs-ganesha service on all the nodes of the cluster by executing the following command:

# service nfs-ganesha stop

2.Verify the status by executing the following command on all the nodes:

# pcs status

3. Stop the glusterd service and kill any running gluster process on all the nodes:

# service glusterd stop
# pkill glusterfs
# pkill glusterfsd
4. Place the entire cluster in standby mode on all the nodes by executing the following command:

# pcs cluster standby <node-name>

For example:
# pcs cluster standby nfs1
# pcs status

Cluster name: G1455878027.97
Last updated: Tue Feb 23 08:05:13 2016
Last change: Tue Feb 23 08:04:55 2016
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
16 Resources configured

Node nfs1: standby
Online: [ nfs2 nfs3 nfs4 ]

....
5. Stop the cluster software on all the nodes using pcs, by executing the following command:

# pcs cluster stop <node-name>

Ensure that it stops pacemaker and cman.
For example:
# pcs cluster stop nfs1
nfs1: Stopping Cluster (pacemaker)...
nfs1: Stopping Cluster (cman)...

6. Update the NFS-Ganesha packages on all the nodes by executing the following command:

# yum update nfs-ganesha
# yum update glusterfs-ganesha
Note
This will install glusterfs-ganesha and nfs-ganesha-gluster package along with other dependent gluster packages.
Some warnings might appear during the upgrade related to shared_storage and which can be ignored.
Verify on all the nodes that the required packages are updated, the nodes are fully functional and are using the correct versions. If anything does not seem correct, then do not proceed until the situation is resolved. Contact the Red Hat Global Support Services for assistance if needed.
7. a) Copy the export entries of all the volumes from the old ganesha.conf file to the newly created ganesha.conf.rpmnew file after the upgrade under /etc/ganesha/.
export entries will look like:
%include "/etc/ganesha/exports/export.vol1.conf"
b) Remove the old ganesha.conf file and rename the new ganesha.conf.rpmsave to ganesha.conf
8. Change the firewall settings (if required) for the new services and ports as mentioned under important section of 7.2.4.NFS-Ganesha in 3.1.3 Administration guide

9. Start glusterd service on all the nodes by executing the following command:
# service glusterd start
10. Mount the shared storage volume created before update on all the nodes:
# mount -t glusterfs localhost:/gluster_shared_storage /var/run/gluster/shared_storage

11. Start the nfs-ganesha service on all the nodes by executing the following command:
# service nfs-ganesha start

12. Start the cluster software on all the nodes by executing the following command:
# pcs cluster start <node-name>

For example:
# pcs cluster start nfs1
nfs1: Starting Cluster...
13. Check the pcs status output to determine if everything appears as it should. Once the nodes are functioning properly, reactivate it for service by taking it out of standby mode by executing the following command:
# pcs cluster unstandby <node-name>

For example:
# pcs cluster unstandby nfs1

# pcs status
Cluster name: G1455878027.97
Last updated: Tue Feb 23 08:14:01 2016
Last change: Tue Feb 23 08:13:57 2016
Stack: cman
Current DC: nfs3 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
16 Resources configured

Online: [ nfs1 nfs2 nfs3 nfs4 ]

.…
Make sure there are no failures and unexpected results.

2. Observe that after the upgrade, the dead IP's are seen in pcs status as below:

Clone Set: nfs-mon-clone [nfs-mon]
Started: [ dhcp43-139.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com dhcp43-243.lab.eng.blr.redhat.com ]
Clone Set: nfs-grace-clone [nfs-grace]
Started: [ dhcp43-139.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com ]
Stopped: [ dhcp43-243.lab.eng.blr.redhat.com ]
dhcp43-139.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-139.lab.eng.blr.redhat.com
dhcp43-139.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-139.lab.eng.blr.redhat.com
dhcp43-243.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-139.lab.eng.blr.redhat.com
dhcp43-243.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-139.lab.eng.blr.redhat.com
dhcp43-159.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-159.lab.eng.blr.redhat.com
dhcp43-159.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-159.lab.eng.blr.redhat.com
dhcp43-141.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-141.lab.eng.blr.redhat.com
dhcp43-141.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-141.lab.eng.blr.redhat.com
dhcp43-139-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com
dhcp43-159-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com
dhcp43-243-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com
dhcp43-141-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com

3. After upgrade, if we try to perform failover/failback, the ganesha nodes are not going into grace period and IO's are not getting blocked.

Actual results:

dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected.

Expected results:

All the basic functionality should work fine after the upgrade.

Additional info:

This looks like an issue which is seen because of the way we are doing an upgrade and which needs to be changed.

Discussed this among the nfs-ganesha team and we agreed upon changing the current upgrade section again with the modifications that ganesha cluster has to be disabled before proceeding for upgrade.

Will be updating the details in below bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1347196

Comment 3 Kaleb KEITHLEY 2016-06-20 13:00:47 UTC

upgrade process was rewritten for 3.1.3

Comment 4 Soumya Koduri 2016-09-29 13:27:59 UTC

As per comment#3, kindly re-test and update if the issue is still seen

Comment 7 Sweta Anandpara 2018-01-23 09:43:04 UTC

Resetting the needinfo from sraj to current Ganesha QE.

Note You need to log in before you can comment on or make changes to this bug.