Bug 231230
Summary: | leg failure on cmirrors causes devices to be stuck in SUSPEND state | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 4 | CC: | agk, cfeist, dwysocha, jbrassow, mbroz, prockai | ||||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2008-08-05 21:42:36 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Corey Marthaler
2007-03-06 21:27:24 UTC
Created attachment 149388 [details]
log from link-02
Created attachment 149389 [details]
log from link-04
Created attachment 149390 [details]
log from link-07
Created attachment 149391 [details]
log from link-08
This bug is reproducable. Name: corey-mirror4 State: SUSPENDED Tables present: LIVE & INACTIVE Open count: 1 Event number: 1 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-r25APQaO2jVckDoLKsoAARInYt1p0mqop2zn17MV0vt2xIcSGS1tjxAfGr0KB9X5 After trying this a few times with complicated 3 and 4 legged cmirror configurations, I attempted the simpliest case. One cmirror, one GFS filesystem, one failure, and it appears to hang just like the more difficult cases above. This failure scenario used to work greater than 90% of the time, so it appears that a regression has been introduced into cmirror-kernel-largesmp-2.6.9-24.0. That, or this is a largesmp issue. I'll attempt this now with smp. [root@link-08 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror feist mwi-ao 8.00G cmirror_mlog 100.00 cmirror_mimage_0(0),cmirror_mimage_1(0) [cmirror_mimage_0] feist iwi-ao 8.00G /dev/sdh1(0) [cmirror_mimage_1] feist iwi-ao 8.00G /dev/sda1(0) [cmirror_mlog] feist lwi-ao 4.00M /dev/sdg1(0) [ killed /dev/sdh ] [root@link-08 ~]# lvs -a -o +devices /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror feist mwi-so 8.00G cmirror_mlog 99.95 cmirror_mimage_0(0),cmirror_mimage_1(0) [cmirror_mimage_0] feist iwi-so 8.00G [cmirror_mimage_1] feist iwi-so 8.00G /dev/sda1(0) [cmirror_mlog] feist lwi-so 4.00M /dev/sdg1(0) [root@link-08 ~]# lvs -a -o +devices /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error [HANG] [root@link-08 ~]# dmsetup info Name: feist-cmirror State: SUSPENDED Tables present: LIVE & INACTIVE Open count: 1 Event number: 1 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-Bdy1hJ3CV5h3FjQYzF6oUj1Diy8WlHuFrP78jR8t27LUNs2TuoUI4bNvnf9VIJcG Name: feist-cmirror_mlog State: SUSPENDED Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 2 Number of targets: 1 UUID: LVM-Bdy1hJ3CV5h3FjQYzF6oUj1Diy8WlHuFHuA3j3gV3DsqnwONzJrvi6wAHTW0402u Name: feist-cmirror_mimage_1 State: SUSPENDED Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-Bdy1hJ3CV5h3FjQYzF6oUj1Diy8WlHuFgFCtJ639S8sFOi12g2F6JvO1ZBZxLW8w Name: feist-cmirror_mimage_0 State: SUSPENDED Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 3 Number of targets: 1 UUID: LVM-Bdy1hJ3CV5h3FjQYzF6oUj1Diy8WlHuFcIpRkYVU2XcQNyqdZkrjWlaXhdCh6q1m Name: VolGroup00-LogVol01 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-kXi6raoZjwmxhIMi9yE47lLMxD7rCSgkT1h6aidVTpJgaACOvfJduKZ88jI03FEf Name: VolGroup00-LogVol00 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-kXi6raoZjwmxhIMi9yE47lLMxD7rCSgkXAqtIL9TdeqFs3GAoSPDj3JVguOzhxCd reproduced on smp cluster as well. Reproduced this issue without I/O or GFS on top of the cmirror. It no longer appears that a proper down convert works after a cmirror leg failure. This issue will need to block the release of cluster mirrors. I can easily reproduce this. It appears that the suspended devices are not the problem, as those appear on the machines that are making progress. It's the machine that has a LIVE & INACTIVE table for the mirror (but no suspended devices) that is causing the problem. This machine also seems to be the log server (every time). reverting one of the previous changes put in for another bug seems to fix the problem. Not sure why yet... Notes from check-in: The problem here appears to be timeouts related to clvmd. During failures under heavy load, clvmd commands (suspend/resume/ activate/deactivate) can take a long time. Clvmd assumes to quickly that they have failed. This results in the fault handling being left half done. Further calls to vgreduce (by hand or by dmeventd) will not help because the _on-disk_ version of the meta-data is consistent - that is, the faulty device has been removed. The most significant change in this patch is the removal of the 'is_remote_recovering' function. This function was designed to check if a remote node was recovering a region so that writes to the region could be delayed. However, even with this function, it was possible for a remote node to begin recovery on a region _after_ the function was called, but before the write (mark request) took place. Because of this, checking is done during the mark request stage - rendering the call to 'is_remote_recovering' meaningless. Given the useless nature of this function, it has been pulled. The benefits of its removal are increased performance and much faster (more than an order of magnitude) response during the mirror suspend process. The faster suspend process leads to less clvmd timeouts and reduced probability that bug 231230 will be triggered. However, when a mirror device is reconfigured, the mirror sub-devices are removed. This is done by activating them cluster-wide before their removal. With high enough load during recovery, these operations can still take a long time - even though they are linear devices. This too has the potential for causing clvmd to timeout and trigger bug 231230. There is no cluster logging fix for this issue. The delay on the linear devices must be determined. A temporary work-around would be to increase the timeout of clvmd (e.g. clvmd -t #). assinged -> post post -> modified Initial tests seems to show that simple mirror failure cases once again work. However before marking this verified, when a cmirror is not fully sync'ed and a failure takes place, clvmd appears to hang. Is this a new bug, or an unresolved portion of this bug? <insert the usual questions> Also, if you can capture the clvmd debug output, that might help us figure out what clvmd is stuck on. To do this, you can run 'clvmd -d >& clvmd_output.txt &' I added my lastest non-synced cmirror leg failure attempt to bz 217895 as it resulted in lost election results from cmirror server and caused a node to be fenced. Lately I can no longer recreate the hang mentioned in comment #14, however, the mirror -> linear down conversion is still not taking place. Instead it appears to be corrupted. SCENARIO - [fail_non_syncd_primary_leg] Creating mirror on link-08 using device /dev/sdh1 (that we will fail) for primary leg lvcreate -m 1 -n fail_non_syncd_primary -L 500M helter_skelter /dev/sdh1:0-500 /dev/sdf1:0-500 /dev/sdb1:0-50 mirror is only 65.60% synced right now Disabling device sdh on link-02 Disabling device sdh on link-04 Disabling device sdh on link-07 Disabling device sdh on link-08 Attempting I/O to cause mirror conversion dd: writing to `/dev/helter_skelter/fail_non_syncd_primary': Input/output error 1+0 records in 0+0 records out Verifying the down conversion from mirror to linear /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error /dev/dm-3: read failed after 0 of 4096 at 524222464: Input/output error [...] /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error Couldn't find device with uuid 'Xhx3VZ-NGbq-1Mzy-n2rk-3xS1-aG2k-LJYjqz'. Couldn't find all physical volumes for volume group helter_skelter. Volume group "helter_skelter" not found down conversion didn't appear to work as /dev/sdf1 should now be apart of a linear I see these I/O errors on all the consoles: device-mapper: A read failure occurred on a mirror device. device-mapper: Unable to retry read. scsi2 (1:1): rejecting I/O to offline device scsi2 (1:1): rejecting I/O to offline device device-mapper: A read failure occurred on a mirror device. device-mapper: Unable to retry read. scsi2 (1:1): rejecting I/O to offline device scsi2 (1:1): rejecting I/O to offline device device-mapper: A read failure occurred on a mirror device. device-mapper: Unable to retry read. scsi2 (1:1): rejecting I/O to offline device Jon - are you seeing any issues w/ non sync'ed cmirror failures, or am I still the only one? I am not seeing this at all. I wonder if it has something to do with new LVM packages. [When you update to new versions of the software, please indicate that.] I will try updating my LVM/device-mapper packages I've updated and I still don't see it (after about ten tries). I'll keep trying, but I hope you can help me reproduce. Jon and I have narrowed this issue down to the write size being used to trigger the down conversion. The following will cause the down conversion to work properly: dd if=/dev/zero of=/dev/vg/cmirror bs=4M count=1 The follow will not: dd if=/dev/zero of=/dev/vg/cmirror count=1 I've seen this now. It depends on the arguments you use (and tools) to do the "write". corey has been using 'dd if=/dev/zero of=/dev/vg/lv count=10'. The dd actually does a read before writing if the request sizes are small enough. Looking at the mirror code for reads: static void do_reads(struct mirror_set *ms, struct bio_list *reads) { struct bio *bio; struct mirror *m; while ((bio = bio_list_pop(reads))) { /* * We can only read balance if the region is in sync. */ if (likely(rh_in_sync(&ms->rh, bio_to_region(&ms->rh, bio), 0))) m = choose_mirror(ms); else { m = ms->default_mirror; /* If the default fails, we give up .*/ if (unlikely(m && atomic_read(&m->error_count))) m = NULL; } if (likely(m)) read_async_bio(m, bio); else bio_endio(bio, bio->bi_size, -EIO); } } We see that we call rh_in_sync without the ability to block (third argument). Cluster mirroring (unlike single machine mirroring) must always block. Therefore, it is never allowed to 'choose_mirror'. Since the primary device has failed, 'bio_endio(bio, bio->bi_size, -EIO);' is called. This fails the read, and the writes from dd never take place... meaning an event never gets triggered and the mirror does not get down converted. I believe it is a bug in the kernel that we do not allow blocking in the rh_in_sync. Since we do not, reads are never allowed to balance [even when the mirrors are in-sync]. There should also be a debate about whether or not to trigger an event if a read fails... I don't think so, but that can be discussed in another bug. assigned -> modified Marking this bug verified. The easy to reproduce cmiror leg failure regression no longer happens. All other lesser reproducable cmirror failure cases should be documented in other BZs. Fixed in current release (4.7). |