Bug 1066389

Summary:	[AFR] I/O fails when one of the replica nodes go down
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sachidananda Urs <surs>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	Sachidananda Urs <surs>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	nlevinki, nsathyan, rhs-bugs, ssamanta, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Regression
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.6.0.18-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1106408 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:34:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1106408, 1112348

Description Sachidananda Urs 2014-02-18 10:15:59 UTC

Description of problem:

When one of the replica nodes in each of replica sub-volumes are reset, IO on the mount fails.

For example, in the following volume:

Volume Name: nafr
Type: Distributed-Replicate
Volume ID: 825ceba6-d098-4237-9371-3fa093c0f85b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.161:/rhs/brick1/r0
Brick2: 10.70.46.162:/rhs/brick1/r0
Brick3: 10.70.46.168:/rhs/brick1/r1
Brick4: 10.70.46.170:/rhs/brick1/r1

If nodes 10.70.46.162 and 10.70.46.168 are reset, the IO on mount fails.


The following errors are seen on the terminal:

tar: linux-3.13.3/arch/mips/include/asm/uasm.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/unaligned.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/user.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/giu.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/irq.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/mpc30x.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/pci.h: Cannot open: Stale file handle

-snip-

tar: linux-3.13.3/arch/arm/boot: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch/arm: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3: Cannot stat: Transport endpoint is not connected
tar: Error is not recoverable: exiting now

Version-Release number of selected component (if applicable):
glusterfs 3.4afr2.2 built on Feb 12 2014 01:43:08

How reproducible:
Always

Steps to Reproduce:
1. Create a 2x2 volume
2. Create some IO on the mount
3. reset couple of servers. 

Actual results:
IO fails on the mount

Expected results:
IO should not fail. Since the other replica is online.

Additional info:
Attached sosreports with the bug

Comment 2 Sachidananda Urs 2014-02-18 10:21:29 UTC

Sosreports are located at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1066389/

Comment 3 Sachidananda Urs 2014-04-23 11:02:35 UTC

The same errors are seen in:

glusterfs 3.5qa2 built on Apr  8 2014 10:40:16

Comment 4 Sachidananda Urs 2014-04-24 07:26:20 UTC

Induce IO errors on the brick with xfstests/src/godown <mount-point> IO on the client fails.

Comment 8 Pranith Kumar K 2014-05-15 09:48:14 UTC

I mis-understood the steps given by Sac in the bug description, these are the things performed when a 'reset' is done:

1) Perform for i in {1..30}; do mkdir $i; tar xf glusterfs-3.5git.tar.gz -C $i& done
2) kill one of the bricks in the replica pair, while this is going on
3) After a while, kill all tar processes
4) Create a backup directory and move all 1..30 dirs inside 'backup'
5) Start the untar processes in 1) again
6) Bring up the brick.
Now the estale failures are observed.

Comment 11 Sachidananda Urs 2014-06-17 11:13:36 UTC

Verified on: glusterfs 3.6.0.18
Looks good.

Comment 13 errata-xmlrpc 2014-09-22 19:34:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html