Bug 1309991

Summary: Inservice upgrade of nfs-ganesha from 3.1 to 3.1.2 failed and triggered a shutdown of one of the node.
Product: Red Hat Gluster Storage Reporter: Shashank Raj <sraj>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED NOTABUG QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: akhakhar, jthottan, kkeithle, ndevos, nlevinki, sashinde, skoduri
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-20 12:19:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Shashank Raj 2016-02-19 07:25:33 UTC
Description of problem:
Inservice upgrade of nfs-ganesha from 3.1 to 3.1.2  failed and triggered a shutdown of one of the node.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
Once

Steps to Reproduce:
1.Create a 4 node cluster and install 3.1 rhgs on all the nodes.
2.Configure all the required settings and setup ganesha on the cluster.
3.Mount the volume through VIP of node1 and start some dd from the client.
4.Do a upgrade of node1 by following the procedure mentioned below:

service nfs-ganesha stop
failover happened from node1 to node1
service glusterd stop
pkill glusterfs
pkill glusterfsd
pcs cluster standby node1
pcs cluster stop node1 
enable puddles for 3.1.2 latest
yum update nfs-ganesha
pcs cluster start node1
pcs cluster unstandby node1
service glusterd start
service nfs-ganesha start

5. Node1 got upgraded properly without any issues.
6. Followed the same steps for the upgrade of node2 as below

mounted the volume on client with node2 VIP
service nfs-ganesha stop
failover happened from node2 to node1
service glusterd stop
pkill glusterfs
pkill glusterfsd
pcs cluster standby node2
pcs cluster stop node2
IO was going on
yum update nfs-ganesha - all the packages got updated.
pcs cluster start node2
pcs cluster unstandby node2

After this pcs status gives below output:

Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 ]
     Stopped: [ nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 ]
     Stopped: [ nfs4 ]
 nfs1-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3 
 nfs1-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs2-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs2-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs3-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs3-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs4-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs4-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs2-dead_ip-1 (ocf::heartbeatummy): Started nfs2
 nfs1-dead_ip-1 (ocf::heartbeatummy): Started nfs1

and then the node4 got shutdown; IO got stopped with "Remote I/O error"and below messages are observed in /var/log/messages.

Feb 18 08:10:32 nfs2 stonith-ng[14598]:  warning: get_xpath_object: No match for //@st_delegate in /st-reply
Feb 18 08:10:32 nfs2 stonith-ng[14598]:   notice: remote_op_done: Operation reboot of nfs4 by nfs1 for stonith_admin.cman.12399: No such device
Feb 18 08:10:32 nfs2 crmd[14602]:   notice: tengine_stonith_notify: Peer nfs4 was not terminated (reboot) by nfs1 for nfs1: No such device (ref=d6228995-f151-4a89-8b62-69a98ac5d76a) by client stonith_admin.cman.12399
Feb 18 08:10:34 nfs2 root: warning: pcs resource create nfs2-dead_ip-1 ocf:heartbeatummy failed
Feb 18 08:10:35 nfs2 stonith-ng[14598]:  warning: get_xpath_object: No match for //@st_delegate in /st-reply
Feb 18 08:10:35 nfs2 stonith-ng[14598]:   notice: remote_op_done: Operation reboot of nfs4 by nfs1 for stonith_admin.cman.12432: No such device
Feb 18 08:10:35 nfs2 crmd[14602]:   notice: tengine_stonith_notify: Peer nfs4 was not terminated (reboot) by nfs1 for nfs1: No such device (ref=a2c22e17-bf30-4bf4-915b-9129959f1d7c) by client stonith_admin.cman.12432'


Actual results:


Expected results:
Upgrade should be successful.

Additional info:
sos reports and ganesha logs are placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1309984

Comment 2 Kaleb KEITHLEY 2016-06-20 12:19:02 UTC
in service upgrade is not supported