Bug 874348
| Summary: | mount point broke on client when a lun from a storage backend offline or missing . After there the data are scrap | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Andreas Huser <ahuser> | |
| Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | ahuser, bfoster, gluster-bugs, jdarcy | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 878874 (view as bug list) | Environment: | ||
| Last Closed: | 2013-07-24 17:55:17 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 878874 | |||
|
Description
Andreas Huser
2012-11-08 01:11:47 UTC
hi Andreas, According to the log messages, it seems the entry /test2/usr has changed file type as in, Something like on one machine it could be file and on the other it could be dir. This is the reason you are observing Input/Output Error. "when i go to a Storage backend and set one lun to offline or remove a infiniband cable then crashed the complete gluster volume." What exactly do you mean by 'crashed the complete gluster volume'. Does it only mean Input/Output errors you are observing or something more? Hi Pranith,
test2 it's only a test mount point with a copied folder usr.
I have check some different environments and scenarios. The errors comes only with a ,,xfs" filesystem. I have no Problems with ext4.
When xfs lost his device /dev/sdX, or whatever else which source (LUN,Disk, Fc-Lun, iSCSI, Raid Controller etc.) xfs make a xfs_do_force_shutdown. The result is a not usable replica mount point for clients. And the mounted volume in a client crash with I/O errors.
To test this build a environment: Infiniband are not needed.
1.) two GlusterFS Server with two Harddrives
one for system end one for glusterfs volume.
2.) A client to mount the glusterfs volume.
Format each glusterfs Volume with xfs on server 1 and server 2. Create a replica volume and mount this on the client. Now copy a lot of data in this mount point. During the copy process pull the sata cable from one volume harddrive on the server 1 or 2 and wait a few second. Now you see the glusterfs volume on the client is crashed. You not can remount this volume. You must reset the failed device and restart the glustervolume.
I hope that helps you
Many greetings from germany
Andreas
ov 13 14:43:16 kvm01 sm-notify[2387]: Version 1.2.3 starting
Nov 13 14:44:15 kvm01 kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x90000 action 0xe frozen
Nov 13 14:44:15 kvm01 kernel: ata4: irq_stat 0x00400000, PHY RDY changed
Nov 13 14:44:15 kvm01 kernel: ata4: SError: { PHYRdyChg 10B8B }
Nov 13 14:44:15 kvm01 kernel: ata4: hard resetting link
Nov 13 14:44:15 kvm01 kernel: ata4: SATA link down (SStatus 0 SControl 300)
Nov 13 14:44:20 kvm01 kernel: ata4: hard resetting link
Nov 13 14:44:21 kvm01 kernel: ata4: SATA link down (SStatus 0 SControl 300)
Nov 13 14:44:21 kvm01 kernel: ata4: limiting SATA link speed to 1.5 Gbps
Nov 13 14:44:26 kvm01 kernel: ata4: hard resetting link
Nov 13 14:44:26 kvm01 kernel: ata4: SATA link down (SStatus 0 SControl 310)
Nov 13 14:44:26 kvm01 kernel: ata4.00: disabled
Nov 13 14:44:26 kvm01 kernel: ata4: EH complete
Nov 13 14:44:26 kvm01 kernel: ata4.00: detaching (SCSI 3:0:0:0)
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] killing request
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: rejecting I/O to offline device
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Unhandled error code
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 12 8e 1e 1f 00 00 08 00
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108397
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108439
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108465
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108571
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108620
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108621
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108622
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108623
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108624
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: Buffer I/O error on device sdb1, logical block 61108625
Nov 13 14:44:26 kvm01 kernel: lost page write due to I/O error on sdb1
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): Device sdb1: metadata write error block 0xe8e24b0
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): I/O error occurred: meta-data dev sdb1 block 0x1d1c4e19 ("xlog_iodone") error 5 buf count 32768
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 891 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa087c8dc
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): Log I/O Error Detected. Shutting down filesystem
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): Please umount the filesystem and rectify the problem(s)
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Nov 13 14:44:26 kvm01 kernel: XFS (sdb1): xfs_do_force_shutdown(0x1) called from line 1056 of file fs/xfs/linux-2.6/xfs_buf.c. Return address = 0xffffffffa0898283
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Synchronizing SCSI cache
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Stopping disk
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] START_STOP FAILED
Nov 13 14:44:26 kvm01 kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Nov 13 14:44:38 kvm01 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Brian,
This seems a lot like the bug 892730. Could you take a look at these xfs logs and confirm that this is the same.
Here is the commit log of the bug to refresh your memory.
Pranith.
commit 679cb2399fc1f8e97f2b29654ec422f267b03783
Author: Brian Foster <bfoster>
Date: Thu Jan 10 10:49:17 2013 -0500
afr: conditionally prioritize EIO errors over ENOENT
The most important errno logic historically only prioritized ESTALE
over ENOENT. Commit c8c0942d added EIO prioritization over ENOENT
to ensure that split-brain was reported when it occurs in
conjunction with bricks missing the file entry. The unintended side
effect of this change is that (non split-brain) EIO errors reported
from the bricks themselves are now reported to the client when the
expectation is that afr should squash said errors in favor of
marking the file inconsistent.
The high-level problem is that EIO is overloaded with different
meanings from different contexts. This commit adds an eio parameter
to the errno priority logic to conditionally flag when EIO is of
higher priority and should be propagated to the client.
BUG: 892730
(In reply to comment #3) > Brian, > This seems a lot like the bug 892730. Could you take a look at these xfs > logs and confirm that this is the same. > It's certainly similar and could be a factor. I think the caveat to note is that 82730 addressed an error prioritization issue as opposed to an error suppression issue. If I recall correctly, the invalid exposure of EIO was tied to a lookup failure which was expected to fail, but with ENOENT instead of EIO. I suppose that could be source of errors here if a fileset copy is in progress. I don't recall seeing such a blatant problem as ls failure on the mountpoint however. This might be worth attempting to simulate in a VM. > Here is the commit log of the bug to refresh your memory. > > Pranith. > > commit 679cb2399fc1f8e97f2b29654ec422f267b03783 > Author: Brian Foster <bfoster> > Date: Thu Jan 10 10:49:17 2013 -0500 > > afr: conditionally prioritize EIO errors over ENOENT > > The most important errno logic historically only prioritized ESTALE > over ENOENT. Commit c8c0942d added EIO prioritization over ENOENT > to ensure that split-brain was reported when it occurs in > conjunction with bricks missing the file entry. The unintended side > effect of this change is that (non split-brain) EIO errors reported > from the bricks themselves are now reported to the client when the > expectation is that afr should squash said errors in favor of > marking the file inconsistent. > > The high-level problem is that EIO is overloaded with different > meanings from different contexts. This commit adds an eio parameter > to the errno priority logic to conditionally flag when EIO is of > higher priority and should be propagated to the client. > > BUG: 892730 Andreas Huser,
The bug is most probably fixed by Brian's patch. Could you let us know the results with 3.4 alpha. You can install them through the packages at http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.4.0alpha/
Pranith.
Hi Pranith, gladly i do this, but i must build a new test environment. this can take a few days. Many thanks for your great work! Regards Andreas Andreas Huser,
Please add a comment with your findings after you get to test it.
Pranith
Andrea,
Please feel free to re-open the bug if you think that previous patch does not address the problem. I am closing it for now.
Pranith
Hi Pranith, i'm sorry it's too much work at time. Next week it looks better. I post the results as soon as posible. Many thanks Regards Andreas |