Bug 1084925

Summary: [Upgrade]: while RHS upgrade for geo-rep setup from RHS-2.1.1 to RHS-3.0, IO on client failed with OSError: [Errno 116] Stale file handle.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED CURRENTRELEASE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: avishwan, chrisw, csaba, david.macdonald, mzywusko, nlevinki, nsathyan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-01 12:39:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport of the all one of the master node, where monitor died after upgrade none

Description Vijaykumar Koppad 2014-04-07 09:54:47 UTC
Description of problem: while RHS upgrade for geo-rep setup from RHS-2.1.1 to RHS-3.0 (glusterfs-3.5qa2-0.304.git0c1d78f.el6rhs.x86_64.rpm), IO on old(rhs-2.1.1[glusterfs-3.4.0.44rhs])client failed with OSError: [Errno 116] Stale file handle: '5341784d%%LXY2Y1W7YB'

snippet from client log-file

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-04-06 16:09:20.219221] I [client-handshake.c:1676:select_server_supported_programs] 0-master-client-8: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-06 16:09:20.219499] I [client-handshake.c:1676:select_server_supported_programs] 0-master-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-06 16:09:20.229233] I [client-handshake.c:1474:client_setvolume_cbk] 0-master-client-8: Connected to 10.70.43.0:49154, attached to remote volume '/bricks/master_brick9'.
[2014-04-06 16:09:20.229261] I [client-handshake.c:1486:client_setvolume_cbk] 0-master-client-8: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-06 16:09:20.229510] I [client-handshake.c:1676:select_server_supported_programs] 1-master-client-8: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-06 16:09:20.229832] I [client-handshake.c:1676:select_server_supported_programs] 1-master-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-06 16:09:20.230548] I [client-handshake.c:1474:client_setvolume_cbk] 1-master-client-0: Connected to 10.70.43.0:49152, attached to remote volume '/bricks/master_brick1'.
[2014-04-06 16:09:20.230573] I [client-handshake.c:1486:client_setvolume_cbk] 1-master-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-06 16:09:20.238699] I [client-handshake.c:450:client_set_lk_version_cbk] 0-master-client-8: Server lk version = 1
[2014-04-06 16:09:20.239253] I [client-handshake.c:1474:client_setvolume_cbk] 1-master-client-8: Connected to 10.70.43.0:49154, attached to remote volume '/bricks/master_brick9'.
[2014-04-06 16:09:20.239380] I [client-handshake.c:1486:client_setvolume_cbk] 1-master-client-8: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-06 16:09:20.240839] I [client-handshake.c:450:client_set_lk_version_cbk] 1-master-client-0: Server lk version = 1
[2014-04-06 16:09:20.241869] I [client-handshake.c:1474:client_setvolume_cbk] 0-master-client-4: Connected to 10.70.43.0:49153, attached to remote volume '/bricks/master_brick5'.
[2014-04-06 16:09:20.241895] I [client-handshake.c:1486:client_setvolume_cbk] 0-master-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-06 16:09:20.242415] I [client-handshake.c:1676:select_server_supported_programs] 1-master-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-04-06 16:09:20.243037] I [client-handshake.c:450:client_set_lk_version_cbk] 0-master-client-4: Server lk version = 1
[2014-04-06 16:09:20.243222] I [client-handshake.c:1474:client_setvolume_cbk] 1-master-client-4: Connected to 10.70.43.0:49153, attached to remote volume '/bricks/master_brick5'.
[2014-04-06 16:09:20.243245] I [client-handshake.c:1486:client_setvolume_cbk] 1-master-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2014-04-06 16:09:20.244121] I [client-handshake.c:450:client_set_lk_version_cbk] 1-master-client-4: Server lk version = 1
[2014-04-06 16:09:20.250322] I [client-handshake.c:450:client_set_lk_version_cbk] 1-master-client-8: Server lk version = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable): glusterfs-3.5qa2-0.304.git0c1d78f.el6rhs.x86_64.rpm


How reproducible: didn't try to reproduce the issue. 


Steps to Reproduce:

Setup:

1.Create master and slave cluster with RHS-2.1.1 

2. Create and start a gep-rep relationship between master and slave (6x2)

3. Keep creating data on master.

Actions:

Steps to upgrade master or slave node:

===============================================================================================

1. Kill all the gsync and gluster processes. You can use the following commands to do the same.

ps -aef | grep gluster | grep gsync | awk '{print $2}' | xargs kill -9

pkill glusterfsd

pkill glusterfs

pkill glusterd

2. Upgrade the node with RHS-3.0

3. Reboot the node

reboot

 

Repeat the above processes with each slave and master.  The order of upgrading them should follow the  following convention.

a. Start with upgrading any slave.

b. During the upgrade process the status shows that session in one of the master node has gone faulty. That node where it went faulty is the next node to be upgraded.

c. Next one to be upgraded is the replica pair of the slave which just got upgraded. Make sure that both the backend bricks are in sync during the upgrade of this node. If the bricks are not in sync at the time of upgrade, you may enter into split brain situation.

d. See step b (Upgrade the master node which went to faulty)


Actual results: While upgrade there was IO failure on client  


Expected results: While rolling upgrade, there shouldn't be any IO failure on the client. 


Additional info:

Comment 2 Vijaykumar Koppad 2014-06-18 09:53:01 UTC
Created attachment 909917 [details]
sosreport of the all one of the master node, where monitor died after upgrade