Bug 996987

Summary:	AFR: processes on mount point fail when one of the disks crash
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sachidananda Urs <surs>
Component:	glusterfs	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED ERRATA	QA Contact:	senaik
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	2.1	CC:	amarts, pkarampu, rhs-bugs, surs, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.22rhs-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-09-23 22:36:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	996089

Description Sachidananda Urs 2013-08-14 12:21:42 UTC

Description of problem:
In a replicate setup (1x2), when one of the backend disk crashes, the processes on the mount-point fail, the processes running exit with error.

[root@boggs xfstests]# gluster volume info
 
Volume Name: foo
Type: Replicate
Volume ID: b47b4690-1594-4f44-ae3e-e1e86ceacd53
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.72:/rhs/brick1/foo
Brick2: 10.70.37.97:/rhs/brick1/foo


Version-Release number of selected component (if applicable):

glusterfs 3.4.0.19rhs built on Aug 14 2013 00:11:42

How reproducible:
Always

Steps to Reproduce:
1. Create a replicate volume and mount.
2. Do some I/O on the client. Create/modify/delete files
3. Crash one of the disks... use: xfstests/src/godown <mount-point>
4. The processes on the client mount-point fail.

Actual results:

Processes fail with errors.

Expected results:

Client should see no difference as one of the replica is alive and doing good.

Additional info:

The same was tested and verified in bug: Bug 892730, this is a regression from glusterfs-3.4.0qa8 on which it was verified.

Below is the log snippet when the processes fail:

[root@bob-the-minion 2.0]# tar xvf linux-3.10.3.tar.xz 
tar: linux-3.10.3.tar.xz: Cannot read: Transport endpoint is not connected
tar: At beginning of tape, quitting now
tar: Error is not recoverable: exiting now


=========================================================

[2013-08-14 07:02:16.785953] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-foo-client-0: remote operation fai
led: Input/output error
[2013-08-14 07:02:16.786320] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-foo-client-0: remote operation fa
iled: Input/output error. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c (00000000-0000-0000-0000-000000000000
)
[2013-08-14 07:02:16.787246] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-foo-client-0: remote operation fa
iled: Input/output error. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c (00000000-0000-0000-0000-000000000000)
[2013-08-14 07:02:16.789109] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error
[2013-08-14 07:02:16.789717] W [client-rpc-fops.c:2058:client3_3_create_cbk] 0-foo-client-0: remote operation failed: Input/output error. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c
[2013-08-14 07:02:16.790258] W [client-rpc-fops.c:2058:client3_3_create_cbk] 0-foo-client-0: remote operation failed: Input/output error. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c
[2013-08-14 07:02:16.790651] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/3/linux-3.10.3/drivers/usb/c67x00/Makefile (b17498dc-8dae-460c-89ed-d9f9aaccb31b)
[2013-08-14 07:02:16.790793] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/2/linux-3.10.3/drivers/uio (86b4219d-1b5d-488b-9df4-71e92eeb42be)
[2013-08-14 07:02:16.791160] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget (9f75a59c-545e-4956-8537-46a0def72b9d)
[2013-08-14 07:02:16.792593] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error
[2013-08-14 07:02:16.793228] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/3/linux-3.10.3/drivers/usb/c67x00/Makefile (b17498dc-8dae-460c-89ed-d9f9aaccb31b)
[2013-08-14 07:02:16.808014] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-foo-client-0: remote operation failed: No such file or directory. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c (6018e9b0-60bc-4cd7-b0cb-2d417f202255)
[2013-08-14 07:02:16.808070] E [afr-open.c:273:afr_openfd_fix_open_cbk] 0-foo-replicate-0: Failed to open /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c on subvolume foo-client-0
[2013-08-14 07:02:16.808526] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-foo-client-0: remote operation failed: No such file or directory
[2013-08-14 07:02:16.808827] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-foo-client-0: remote operation failed: No such file or directory. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c (5bce55b9-78ad-465a-8ea6-8d58caa524d4)
[2013-08-14 07:02:16.808863] E [afr-open.c:273:afr_openfd_fix_open_cbk] 0-foo-replicate-0: Failed to open /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c on subvolume foo-client-0
[2013-08-14 07:02:16.809001] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-foo-client-0: remote operation failed: No such file or directory
[2013-08-14 07:02:16.809164] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error

Comment 1 Pranith Kumar K 2013-08-16 08:17:40 UTC

Sachi,
     Bug 892730 was causing EIO errors to the client where as this issue is causing ENOTCONN to the client. I just verified that the test case which is attached to the commit (http://review.gluster.org/#/c/4376/2/tests/bugs/bug-892730.t) is succeeding on downstream. So this is not a regression of that bug and a new issue. This bug looks a bit similar to https://bugzilla.redhat.com/show_bug.cgi?id=996089.

    We are in the process of figuring out the root cause.

Pranith.

Comment 2 Ravishankar N 2013-08-20 04:34:53 UTC

Sac, are you able to hit the issue consistently? I was not able to reproduce the issue on RHS-2.1-20130814 ISO. The  test was the same i.e. kernel untar and bring down one of the replicas with xfstest-godown. Could you please upload the SOS report if you are able to hit it?

Comment 3 Sachidananda Urs 2013-08-20 06:12:12 UTC

Since the sosreports are ~30M, I've uploaded them to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/996987/

I've shown the steps to reproduce to Ravi.

Comment 4 Sachidananda Urs 2013-08-20 06:12:43 UTC

Setup is provided for investigation as well.

Comment 5 Sachidananda Urs 2013-08-20 10:17:17 UTC

More details on reproducing the issue:

=====================================================================

1.Create a 2x2 distributed-replicate volume

2.Fuse mount the volume and create some files on the mount point
for i in {100..1000} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done

3.While file creation is in progress , bring down one of the bricks in the replica pair

[root@boost b1]# gluster v status Vol3
Status of volume: Vol3
Gluster process                        Port    Online    Pid
------------------------------------------------------------------------------
Brick 10.70.34.85:/rhs/brick1/c1            N/A    N    10003
Brick 10.70.34.86:/rhs/brick1/c2            49281    Y    2536
Brick 10.70.34.87:/rhs/brick1/c3            49256    Y    20002
Brick 10.70.34.88:/rhs/brick1/c4            49201    Y    3810
NFS Server on localhost                    2049    Y    10015
Self-heal Daemon on localhost                N/A    Y    10022
NFS Server on 10.70.34.86                2049    Y    2550
Self-heal Daemon on 10.70.34.86                N/A    Y    2558
NFS Server on 10.70.34.87                2049    Y    20014
Self-heal Daemon on 10.70.34.87                N/A    Y    20021
NFS Server on 10.70.34.88                2049    Y    3822
Self-heal Daemon on 10.70.34.88                N/A    Y    3831

There are no active volume tasks

4. After file creation is completed , calculate are-equal check sum on the mount point

[root@RHEL6 Vol3]# /opt/qa/tools/arequal-checksum /mnt/Vol3/
md5sum: /mnt/Vol3/f100: Transport endpoint is not connected
/mnt/Vol3/f100: short read
ftw (/mnt/Vol3/) returned -1 (Success), terminating

Comment 6 Amar Tumballi 2013-08-21 08:18:22 UTC

https://code.engineering.redhat.com/gerrit/#/c/11666

Comment 7 senaik 2013-08-26 12:55:59 UTC

Verified in Version : glusterfs-3.4.0.22rhs-1 

Followed the same steps as mentioned in comment 5. Unable to reproduce.

Comment 8 Scott Haines 2013-09-23 22:36:02 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html