1066389 – [AFR] I/O fails when one of the replica nodes go down

Bug 1066389 - [AFR] I/O fails when one of the replica nodes go down

Summary: [AFR] I/O fails when one of the replica nodes go down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Pranith Kumar K
QA Contact:	Sachidananda Urs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1106408 1112348
TreeView+	depends on / blocked

Reported:	2014-02-18 10:15 UTC by Sachidananda Urs
Modified:	2016-09-17 12:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.6.0.18-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1106408 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:34:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Sachidananda Urs 2014-02-18 10:15:59 UTC

Description of problem:

When one of the replica nodes in each of replica sub-volumes are reset, IO on the mount fails.

For example, in the following volume:

Volume Name: nafr
Type: Distributed-Replicate
Volume ID: 825ceba6-d098-4237-9371-3fa093c0f85b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.161:/rhs/brick1/r0
Brick2: 10.70.46.162:/rhs/brick1/r0
Brick3: 10.70.46.168:/rhs/brick1/r1
Brick4: 10.70.46.170:/rhs/brick1/r1

If nodes 10.70.46.162 and 10.70.46.168 are reset, the IO on mount fails.


The following errors are seen on the terminal:

tar: linux-3.13.3/arch/mips/include/asm/uasm.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/unaligned.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/user.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/giu.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/irq.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/mpc30x.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/pci.h: Cannot open: Stale file handle

-snip-

tar: linux-3.13.3/arch/arm/boot: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch/arm: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3: Cannot stat: Transport endpoint is not connected
tar: Error is not recoverable: exiting now

Version-Release number of selected component (if applicable):
glusterfs 3.4afr2.2 built on Feb 12 2014 01:43:08

How reproducible:
Always

Steps to Reproduce:
1. Create a 2x2 volume
2. Create some IO on the mount
3. reset couple of servers. 

Actual results:
IO fails on the mount

Expected results:
IO should not fail. Since the other replica is online.

Additional info:
Attached sosreports with the bug

Comment 2 Sachidananda Urs 2014-02-18 10:21:29 UTC

Sosreports are located at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1066389/

Comment 3 Sachidananda Urs 2014-04-23 11:02:35 UTC

The same errors are seen in:

glusterfs 3.5qa2 built on Apr  8 2014 10:40:16

Comment 4 Sachidananda Urs 2014-04-24 07:26:20 UTC

Induce IO errors on the brick with xfstests/src/godown <mount-point> IO on the client fails.

Comment 8 Pranith Kumar K 2014-05-15 09:48:14 UTC

I mis-understood the steps given by Sac in the bug description, these are the things performed when a 'reset' is done:

1) Perform for i in {1..30}; do mkdir $i; tar xf glusterfs-3.5git.tar.gz -C $i& done
2) kill one of the bricks in the replica pair, while this is going on
3) After a while, kill all tar processes
4) Create a backup directory and move all 1..30 dirs inside 'backup'
5) Start the untar processes in 1) again
6) Bring up the brick.
Now the estale failures are observed.

Comment 11 Sachidananda Urs 2014-06-17 11:13:36 UTC

Verified on: glusterfs 3.6.0.18
Looks good.

Comment 13 errata-xmlrpc 2014-09-22 19:34:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.