Description of problem: In a replicate setup (1x2), when one of the backend disk crashes, the processes on the mount-point fail, the processes running exit with error. [root@boggs xfstests]# gluster volume info Volume Name: foo Type: Replicate Volume ID: b47b4690-1594-4f44-ae3e-e1e86ceacd53 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.37.72:/rhs/brick1/foo Brick2: 10.70.37.97:/rhs/brick1/foo Version-Release number of selected component (if applicable): glusterfs 3.4.0.19rhs built on Aug 14 2013 00:11:42 How reproducible: Always Steps to Reproduce: 1. Create a replicate volume and mount. 2. Do some I/O on the client. Create/modify/delete files 3. Crash one of the disks... use: xfstests/src/godown <mount-point> 4. The processes on the client mount-point fail. Actual results: Processes fail with errors. Expected results: Client should see no difference as one of the replica is alive and doing good. Additional info: The same was tested and verified in bug: Bug 892730, this is a regression from glusterfs-3.4.0qa8 on which it was verified. Below is the log snippet when the processes fail: [root@bob-the-minion 2.0]# tar xvf linux-3.10.3.tar.xz tar: linux-3.10.3.tar.xz: Cannot read: Transport endpoint is not connected tar: At beginning of tape, quitting now tar: Error is not recoverable: exiting now ========================================================= [2013-08-14 07:02:16.785953] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-foo-client-0: remote operation fai led: Input/output error [2013-08-14 07:02:16.786320] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-foo-client-0: remote operation fa iled: Input/output error. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c (00000000-0000-0000-0000-000000000000 ) [2013-08-14 07:02:16.787246] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-foo-client-0: remote operation fa iled: Input/output error. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c (00000000-0000-0000-0000-000000000000) [2013-08-14 07:02:16.789109] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error [2013-08-14 07:02:16.789717] W [client-rpc-fops.c:2058:client3_3_create_cbk] 0-foo-client-0: remote operation failed: Input/output error. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c [2013-08-14 07:02:16.790258] W [client-rpc-fops.c:2058:client3_3_create_cbk] 0-foo-client-0: remote operation failed: Input/output error. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c [2013-08-14 07:02:16.790651] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/3/linux-3.10.3/drivers/usb/c67x00/Makefile (b17498dc-8dae-460c-89ed-d9f9aaccb31b) [2013-08-14 07:02:16.790793] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/2/linux-3.10.3/drivers/uio (86b4219d-1b5d-488b-9df4-71e92eeb42be) [2013-08-14 07:02:16.791160] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget (9f75a59c-545e-4956-8537-46a0def72b9d) [2013-08-14 07:02:16.792593] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error [2013-08-14 07:02:16.793228] W [client-rpc-fops.c:1744:client3_3_xattrop_cbk] 0-foo-client-0: remote operation failed: Success. Path: /2.0/3/linux-3.10.3/drivers/usb/c67x00/Makefile (b17498dc-8dae-460c-89ed-d9f9aaccb31b) [2013-08-14 07:02:16.808014] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-foo-client-0: remote operation failed: No such file or directory. Path: /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c (6018e9b0-60bc-4cd7-b0cb-2d417f202255) [2013-08-14 07:02:16.808070] E [afr-open.c:273:afr_openfd_fix_open_cbk] 0-foo-replicate-0: Failed to open /2.0/2/linux-3.10.3/drivers/uio/uio_pdrv.c on subvolume foo-client-0 [2013-08-14 07:02:16.808526] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-foo-client-0: remote operation failed: No such file or directory [2013-08-14 07:02:16.808827] W [client-rpc-fops.c:464:client3_3_open_cbk] 0-foo-client-0: remote operation failed: No such file or directory. Path: /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c (5bce55b9-78ad-465a-8ea6-8d58caa524d4) [2013-08-14 07:02:16.808863] E [afr-open.c:273:afr_openfd_fix_open_cbk] 0-foo-replicate-0: Failed to open /2.0/1/linux-3.10.3/drivers/usb/gadget/acm_ms.c on subvolume foo-client-0 [2013-08-14 07:02:16.809001] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-foo-client-0: remote operation failed: No such file or directory [2013-08-14 07:02:16.809164] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-foo-client-0: remote operation failed: Input/output error
Sachi, Bug 892730 was causing EIO errors to the client where as this issue is causing ENOTCONN to the client. I just verified that the test case which is attached to the commit (http://review.gluster.org/#/c/4376/2/tests/bugs/bug-892730.t) is succeeding on downstream. So this is not a regression of that bug and a new issue. This bug looks a bit similar to https://bugzilla.redhat.com/show_bug.cgi?id=996089. We are in the process of figuring out the root cause. Pranith.
Sac, are you able to hit the issue consistently? I was not able to reproduce the issue on RHS-2.1-20130814 ISO. The test was the same i.e. kernel untar and bring down one of the replicas with xfstest-godown. Could you please upload the SOS report if you are able to hit it?
Since the sosreports are ~30M, I've uploaded them to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/996987/ I've shown the steps to reproduce to Ravi.
Setup is provided for investigation as well.
More details on reproducing the issue: ===================================================================== 1.Create a 2x2 distributed-replicate volume 2.Fuse mount the volume and create some files on the mount point for i in {100..1000} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done 3.While file creation is in progress , bring down one of the bricks in the replica pair [root@boost b1]# gluster v status Vol3 Status of volume: Vol3 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.34.85:/rhs/brick1/c1 N/A N 10003 Brick 10.70.34.86:/rhs/brick1/c2 49281 Y 2536 Brick 10.70.34.87:/rhs/brick1/c3 49256 Y 20002 Brick 10.70.34.88:/rhs/brick1/c4 49201 Y 3810 NFS Server on localhost 2049 Y 10015 Self-heal Daemon on localhost N/A Y 10022 NFS Server on 10.70.34.86 2049 Y 2550 Self-heal Daemon on 10.70.34.86 N/A Y 2558 NFS Server on 10.70.34.87 2049 Y 20014 Self-heal Daemon on 10.70.34.87 N/A Y 20021 NFS Server on 10.70.34.88 2049 Y 3822 Self-heal Daemon on 10.70.34.88 N/A Y 3831 There are no active volume tasks 4. After file creation is completed , calculate are-equal check sum on the mount point [root@RHEL6 Vol3]# /opt/qa/tools/arequal-checksum /mnt/Vol3/ md5sum: /mnt/Vol3/f100: Transport endpoint is not connected /mnt/Vol3/f100: short read ftw (/mnt/Vol3/) returned -1 (Success), terminating
https://code.engineering.redhat.com/gerrit/#/c/11666
Verified in Version : glusterfs-3.4.0.22rhs-1 Followed the same steps as mentioned in comment 5. Unable to reproduce.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html