Bug 1066389 - [AFR] I/O fails when one of the replica nodes go down
Summary: [AFR] I/O fails when one of the replica nodes go down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Pranith Kumar K
QA Contact: Sachidananda Urs
URL:
Whiteboard:
Depends On:
Blocks: 1106408 1112348
TreeView+ depends on / blocked
 
Reported: 2014-02-18 10:15 UTC by Sachidananda Urs
Modified: 2016-09-17 12:20 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.6.0.18-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1106408 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:34:02 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Sachidananda Urs 2014-02-18 10:15:59 UTC
Description of problem:

When one of the replica nodes in each of replica sub-volumes are reset, IO on the mount fails.

For example, in the following volume:

Volume Name: nafr
Type: Distributed-Replicate
Volume ID: 825ceba6-d098-4237-9371-3fa093c0f85b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.161:/rhs/brick1/r0
Brick2: 10.70.46.162:/rhs/brick1/r0
Brick3: 10.70.46.168:/rhs/brick1/r1
Brick4: 10.70.46.170:/rhs/brick1/r1

If nodes 10.70.46.162 and 10.70.46.168 are reset, the IO on mount fails.


The following errors are seen on the terminal:

tar: linux-3.13.3/arch/mips/include/asm/uasm.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/unaligned.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/user.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/giu.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/irq.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/mpc30x.h: Cannot open: Stale file handle
tar: linux-3.13.3/arch/mips/include/asm/vr41xx/pci.h: Cannot open: Stale file handle

-snip-

tar: linux-3.13.3/arch/arm/boot: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch/arm: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3/arch: Cannot stat: Transport endpoint is not connected
tar: linux-3.13.3: Cannot stat: Transport endpoint is not connected
tar: Error is not recoverable: exiting now

Version-Release number of selected component (if applicable):
glusterfs 3.4afr2.2 built on Feb 12 2014 01:43:08

How reproducible:
Always

Steps to Reproduce:
1. Create a 2x2 volume
2. Create some IO on the mount
3. reset couple of servers. 

Actual results:
IO fails on the mount

Expected results:
IO should not fail. Since the other replica is online.

Additional info:
Attached sosreports with the bug

Comment 2 Sachidananda Urs 2014-02-18 10:21:29 UTC
Sosreports are located at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1066389/

Comment 3 Sachidananda Urs 2014-04-23 11:02:35 UTC
The same errors are seen in:

glusterfs 3.5qa2 built on Apr  8 2014 10:40:16

Comment 4 Sachidananda Urs 2014-04-24 07:26:20 UTC
Induce IO errors on the brick with xfstests/src/godown <mount-point> IO on the client fails.

Comment 8 Pranith Kumar K 2014-05-15 09:48:14 UTC
I mis-understood the steps given by Sac in the bug description, these are the things performed when a 'reset' is done:

1) Perform for i in {1..30}; do mkdir $i; tar xf glusterfs-3.5git.tar.gz -C $i& done
2) kill one of the bricks in the replica pair, while this is going on
3) After a while, kill all tar processes
4) Create a backup directory and move all 1..30 dirs inside 'backup'
5) Start the untar processes in 1) again
6) Bring up the brick.
Now the estale failures are observed.

Comment 11 Sachidananda Urs 2014-06-17 11:13:36 UTC
Verified on: glusterfs 3.6.0.18
Looks good.

Comment 13 errata-xmlrpc 2014-09-22 19:34:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.