Description of problem: dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected. Version-Release number of selected component (if applicable): nfs-ganesha-2.3.1-8 glusterfs-3.7.9-10 How reproducible: Always Steps to Reproduce: 1.Upgrade to 3.1.3 using the below steps: 1. Stop the nfs-ganesha service on all the nodes of the cluster by executing the following command: # service nfs-ganesha stop 2.Verify the status by executing the following command on all the nodes: # pcs status 3. Stop the glusterd service and kill any running gluster process on all the nodes: # service glusterd stop # pkill glusterfs # pkill glusterfsd 4. Place the entire cluster in standby mode on all the nodes by executing the following command: # pcs cluster standby <node-name> For example: # pcs cluster standby nfs1 # pcs status Cluster name: G1455878027.97 Last updated: Tue Feb 23 08:05:13 2016 Last change: Tue Feb 23 08:04:55 2016 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 16 Resources configured Node nfs1: standby Online: [ nfs2 nfs3 nfs4 ] .... 5. Stop the cluster software on all the nodes using pcs, by executing the following command: # pcs cluster stop <node-name> Ensure that it stops pacemaker and cman. For example: # pcs cluster stop nfs1 nfs1: Stopping Cluster (pacemaker)... nfs1: Stopping Cluster (cman)... 6. Update the NFS-Ganesha packages on all the nodes by executing the following command: # yum update nfs-ganesha # yum update glusterfs-ganesha Note This will install glusterfs-ganesha and nfs-ganesha-gluster package along with other dependent gluster packages. Some warnings might appear during the upgrade related to shared_storage and which can be ignored. Verify on all the nodes that the required packages are updated, the nodes are fully functional and are using the correct versions. If anything does not seem correct, then do not proceed until the situation is resolved. Contact the Red Hat Global Support Services for assistance if needed. 7. a) Copy the export entries of all the volumes from the old ganesha.conf file to the newly created ganesha.conf.rpmnew file after the upgrade under /etc/ganesha/. export entries will look like: %include "/etc/ganesha/exports/export.vol1.conf" b) Remove the old ganesha.conf file and rename the new ganesha.conf.rpmsave to ganesha.conf 8. Change the firewall settings (if required) for the new services and ports as mentioned under important section of 7.2.4.NFS-Ganesha in 3.1.3 Administration guide 9. Start glusterd service on all the nodes by executing the following command: # service glusterd start 10. Mount the shared storage volume created before update on all the nodes: # mount -t glusterfs localhost:/gluster_shared_storage /var/run/gluster/shared_storage 11. Start the nfs-ganesha service on all the nodes by executing the following command: # service nfs-ganesha start 12. Start the cluster software on all the nodes by executing the following command: # pcs cluster start <node-name> For example: # pcs cluster start nfs1 nfs1: Starting Cluster... 13. Check the pcs status output to determine if everything appears as it should. Once the nodes are functioning properly, reactivate it for service by taking it out of standby mode by executing the following command: # pcs cluster unstandby <node-name> For example: # pcs cluster unstandby nfs1 # pcs status Cluster name: G1455878027.97 Last updated: Tue Feb 23 08:14:01 2016 Last change: Tue Feb 23 08:13:57 2016 Stack: cman Current DC: nfs3 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 16 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] .… Make sure there are no failures and unexpected results. 2. Observe that after the upgrade, the dead IP's are seen in pcs status as below: Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp43-139.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com dhcp43-243.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp43-139.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com ] Stopped: [ dhcp43-243.lab.eng.blr.redhat.com ] dhcp43-139.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-139.lab.eng.blr.redhat.com dhcp43-139.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-139.lab.eng.blr.redhat.com dhcp43-243.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-139.lab.eng.blr.redhat.com dhcp43-243.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-139.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-159.lab.eng.blr.redhat.com dhcp43-159.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-159.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-141.lab.eng.blr.redhat.com dhcp43-141.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-141.lab.eng.blr.redhat.com dhcp43-139-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com dhcp43-159-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com dhcp43-243-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com dhcp43-141-dead_ip-1 (ocf::heartbeat:Dummy): Started dhcp43-243.lab.eng.blr.redhat.com 3. After upgrade, if we try to perform failover/failback, the ganesha nodes are not going into grace period and IO's are not getting blocked. Actual results: dead IP's seen in pcs status after upgrade and failover/failback doesn't work as expected. Expected results: All the basic functionality should work fine after the upgrade. Additional info: This looks like an issue which is seen because of the way we are doing an upgrade and which needs to be changed. Discussed this among the nfs-ganesha team and we agreed upon changing the current upgrade section again with the modifications that ganesha cluster has to be disabled before proceeding for upgrade. Will be updating the details in below bug: https://bugzilla.redhat.com/show_bug.cgi?id=1347196
upgrade process was rewritten for 3.1.3
As per comment#3, kindly re-test and update if the issue is still seen
Resetting the needinfo from sraj to current Ganesha QE.