Bug 596453
Summary: | multiple mirror image (leg) device failure cause lvm repair/allocation to fail and lvm to deadlock | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Corey Marthaler <cmarthal> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 6.0 | CC: | agk, antillon.maurizio, coughlan, dwysocha, heinzm, jbrassow, joe.thornber, mbroz, msnitzer, prajnoha, prockai, tyasui | ||||||||||||
Target Milestone: | rc | Keywords: | Regression, TestBlocker | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | lvm2-2.02.72-8.el6 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2010-11-10 21:08:00 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 606931 | ||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2010-05-26 19:03:33 UTC
Created attachment 416984 [details]
kern dump from taft-01
*** Bug 605447 has been marked as a duplicate of this bug. *** corey, could you please verify that you can hit this with just one mirror still? I can hit problems with this specific test: kill_primary_and_log_synced_2_legs. However, it is only after I have 2 mirrors. Both are properly converted to linear (if you look at 'dmsetup table'), but a failure to remove the old *mimage/mlog devices for one of the mirrors causes the test to halt. At this stage, I/O seems to continue happily, but there are left-over DM devices and LVMs perspective is all screwed. Below, you can see what I'm talking about. First, the system messages showning the sub-devices cannot be deactivated - this ultimately screws everything. Also shown is the 'lvs' listing and 'dmsetup table'. You can see that the mirrors have been properly converted to linear, but the clean-up was not finished. /var/log/messages: Jul 1 13:39:24 bp-01 lvm[2140]: LV helter_skelter/syncd_primary_log_2legs_2_mimage_0 in use: not deactivating Jul 1 13:39:24 bp-01 lvm[2140]: Repair of mirrored LV helter_skelter/syncd_primary_log_2legs_2 failed. Jul 1 13:39:24 bp-01 lvm[2140]: Failed to remove faulty devices in helter_skelter-syncd_primary_log_2legs_2. Jul 1 13:39:26 bp-01 lvm[2140]: No longer monitoring mirror device helter_skelter-syncd_primary_log_2legs_2 for events. [root@bp-01 ~]# devices LV Copy% Devices syncd_primary_log_2legs_1 /dev/sdg1(0) syncd_primary_log_2legs_2 /dev/sdg1(150) syncd_primary_log_2legs_2_mimage_0 unknown device(150) syncd_primary_log_2legs_2_mimage_1 syncd_primary_log_2legs_2_mlog unknown device(1) [root@bp-01 ~]# dmsetup table h-syncd_primary_log_2legs_2_mlog: 0 8192 linear 8:81 8576 h-syncd_primary_log_2legs_2_mimage_1: 0 1228800 error h-syncd_primary_log_2legs_2_mimage_0: 0 1228800 linear 8:113 1229184 h-syncd_primary_log_2legs_2: 0 1228800 linear 8:97 1229184 h-syncd_primary_log_2legs_1: 0 1228800 linear 8:97 384 This *could* be related to the persistent filter bug. Jon - I can hit this with only one mirror. Jul 2 15:44:08 taft-01 lvm[4279]: LV helter_skelter/syncd_primary_log_2legs_1_mimage_0 in use: not deactivating Jul 2 15:44:08 taft-01 kernel: sd 3:0:0:6: rejecting I/O to offline device Jul 2 15:44:08 taft-01 lvm[4279]: Repair of mirrored LV helter_skelter/syncd_primary_log_2legs_1 failed. Jul 2 15:44:08 taft-01 lvm[4279]: Failed to remove faulty devices in helter_skelter-syncd_primary_log_2legs_1. [root@silver helter_skelter]# cat foo LV VG Attr LSize Log Copy% Devices syncd_primary_log_2legs_1 helter_skelter -wi-ao 600.00m /dev/sdb1(0) syncd_primary_log_2legs_1_mimage_0 helter_skelter -wi-a- 600.00m unknown device(0) syncd_primary_log_2legs_1_mimage_1 helter_skelter vwi-a- 600.00m syncd_primary_log_2legs_1_mlog helter_skelter -wi-a- 4.00m unknown device(0) Just tested over the weekend... 38 iterations before failing - still same messages. Need to bump up verbosity - might need to alter dmeventd to get messages through. John, are you trying with current upstream CVS or with a released version (which one)? Particularly, does it include the persistent filter fixes that were done a couple weeks ago? I am less inclined to think this is actually the filter bug, but just to make sure... I have analysed Jon's logs and the code a bit, and these are my conclusions: 1) The problematic mimage is a *visible* one -- therefore, it must be one of those that are being removed from the mirror. Since it is visible, any userspace process could be opening it. Not the least an intervening LVM scan. 2) Jon has 3 failure logs, all of which exhibit this: Jul 21 19:26:58 bp-01 qarshd[10511]: Running cmdline: pvs -a Jul 21 19:26:58 bp-01 lvm[19659]: LV helter_skelter/syncd_primary_log_2legs_1_mlog in use: not deactivating Jul 21 19:27:10 bp-01 xinetd[1588]: EXIT: qarsh status=0 pid=10511 duration=12(sec) Jul 22 17:40:41 bp-01 qarshd[20140]: Running cmdline: pvs -a Jul 22 17:40:51 bp-01 lvm[28669]: LV helter_skelter/syncd_primary_log_2legs_3_mimage_0 in use: not deactivating Jul 22 17:40:51 bp-01 xinetd[1588]: EXIT: qarsh status=0 pid=20140 duration=10(sec) Jul 22 13:03:32 bp-01 qarshd[4297]: Running cmdline: pvs -a Jul 22 13:03:38 bp-01 lvm[25536]: LV helter_skelter/syncd_primary_log_2legs_2_mimage_0 appears to be in use: sleep/retry Jul 22 13:03:46 bp-01 xinetd[1588]: EXIT: qarsh status=0 pid=4297 duration=14(sec) Moreover, Corey's log from bug 606931 shows: Jun 28 23:18:24 taft-01 qarshd[12247]: Running cmdline: pvs -a Jun 28 23:18:31 taft-01 lvm[2266]: LV helter_skelter/syncd_primary_2legs_2_mimage_0 in use: not deactivating Jun 28 23:18:32 taft-01 xinetd[1761]: EXIT: qarsh status=0 pid=12247 duration=8(sec) From these logs, and from Jon's probes it can be derived that the pvs and the dmeventd IO is running in parallel and gets interleaved. It is fully conceivable that pvs would open the (visible) mimage (or mlog, for that matter) during a label/metadata scan. Now, there are two options to address this, assuming the hypothesis is correct. One is Jon's "retry" patch, which should work, although it is somewhat fragile and "icky" in how it works. I would rather avoid that. The other is a lot more robust, in my opinion: avoid marking the detached LVs as visible. This would prevent any userspace to interfere with the ongoing repair. Arguably, they are made visible for the cases where something fails later before they can be removed. There are two ways to deal with this, other than the current (which has the above side-effects). Either we can keep them hidden and add functionality to "vgck" that will report orphaned hidden LVs (and possibly offer to remove them) -- when a fatal error occurs during a VG-editing command, we could simply urge the administrator to run vgck on the affected VG. Another option involves the same check, but done by lvconvert and related tools themselves just before they fail. I like the former option (augment and advertise vgck) more. PS: A simple way to further validate the above hypothesis is to forbid LVM from scanning any mlog and mimage devices (using the regex filter). If things still fail with such a filter, the above pvs interaction could be ruled out, as unlikely as that is. I have manually reproduced the race between pvs and lvconvert using gdb. I believe this confirms the hypothesis about interaction of scanning in pvs and lvconvert. Unfortunately, pvs is accessing /dev/dm-X which will exist regardless of visibility of the LV, so my original proposed fix is inadequate. On the LVM level, it would be certainly an option to skip (even label) scans of DM devices that correspond to LVs with reserved names. Nevertheless, that still leaves us open to other userspace tools looking at /dev/dm-X. fairly certain that bug 596318 is affected by this issue too. Patch proposed upstream, pending review. (In reply to comment #15) > Patch proposed upstream, pending review. Snap 11 deadline is noon on Wednesday. Checked in the relevant patch into upstream CVS. This bug appears partially fixed. It now appears to work in single machine lvm however it still fails with cmirrors. 2.6.32-59.1.el6.x86_64 lvm2-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 lvm2-libs-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 lvm2-cluster-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 udev-147-2.22.el6 BUILT: Fri Jul 23 07:21:33 CDT 2010 device-mapper-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 device-mapper-libs-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 device-mapper-event-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 device-mapper-event-libs-1.02.53-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 cmirror-2.02.72-7.el6 BUILT: Wed Aug 11 17:12:24 CDT 2010 Scenario: Kill multiple legs of synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_multiple_legs_4legs_1 * sync: 1 * leg devices: /dev/sdf1 /dev/sdh1 /dev/sde1 /dev/sdc1 * log devices: /dev/sdd1 * failpv(s): /dev/sdf1 /dev/sdc1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg fault policy: remove * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-02... taft-02: lvcreate -m 3 -n syncd_multiple_legs_4legs_1 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-1000 /dev/sdc1:0-1000 /dev/sdd1:0-150 PV=/dev/sdf1 syncd_multiple_legs_4legs_1_mimage_0: 6 PV=/dev/sdc1 syncd_multiple_legs_4legs_1_mimage_3: 6 PV=/dev/sdf1 syncd_multiple_legs_4legs_1_mimage_0: 6 PV=/dev/sdc1 syncd_multiple_legs_4legs_1_mimage_3: 6 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 64.92% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Creating gfs2 on top of mirror(s) on taft-01... Mounting mirrored gfs2 filesystems on taft-01... Mounting mirrored gfs2 filesystems on taft-02... Mounting mirrored gfs2 filesystems on taft-03... Mounting mirrored gfs2 filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- <start name="taft-01_syncd_multiple_legs_4legs_1" pid="23176" time="Thu Aug 12 12:37:33 2010" type="cmd" /> <start name="taft-02_syncd_multiple_legs_4legs_1" pid="23178" time="Thu Aug 12 12:37:33 2010" type="cmd" /> <start name="taft-03_syncd_multiple_legs_4legs_1" pid="23179" time="Thu Aug 12 12:37:33 2010" type="cmd" /> <start name="taft-04_syncd_multiple_legs_4legs_1" pid="23180" time="Thu Aug 12 12:37:33 2010" type="cmd" /> Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdf on taft-01 Disabling device sdc on taft-01 Disabling device sdf on taft-02 Disabling device sdc on taft-02 Disabling device sdf on taft-03 Disabling device sdc on taft-03 Disabling device sdf on taft-04 Disabling device sdc on taft-04 [HANG] Console: [...] Aug 12 17:38:32 taft-01 lvm[2644]: Primary mirror device 253:4 has failed (D). Aug 12 17:38:32 taft-01 lvm[2644]: Secondary mirror device 253:7 has failed (D). Aug 12 17:38:32 taft-01 lvm[2644]: Device failure in helter_skelter-syncd_multiple_legs_4legs_1. [...] Aug 12 17:38:45 taft-01 lvm[2644]: Mirrored transient status: "4 253:4 253:5 253:6 253:7 1183/1200 1 DAAD 3 clustered-disk 253:3 A" Aug 12 17:38:45 taft-01 lvm[2644]: Mirror status: 2 of 4 images failed. [...] Aug 12 17:38:46 taft-01 lvm[2644]: cluster request failed: Resource temporarily unavailable Aug 12 17:38:46 taft-01 lvm[2644]: Failed to lock syncd_multiple_legs_4legs_1 [...] Aug 12 17:38:46 taft-01 lvm[2644]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_4legs_1 failed. Aug 12 17:38:46 taft-01 lvm[2644]: Failed to remove faulty devices in helter_skelter-syncd_multiple_legs_4legs_1. I'll post the full logs from each of the nodes in the cluster... Created attachment 438499 [details]
log from taft-01
Created attachment 438500 [details]
log from taft-02
Created attachment 438501 [details]
log from taft-03
Created attachment 438502 [details]
log from taft-04
> Aug 12 17:38:46 taft-01 lvm[2644]: cluster request failed: Resource temporarily unavailable
> Aug 12 17:38:46 taft-01 lvm[2644]: Failed to lock syncd_multiple_legs_4legs_1
Looks like something is wrong with clvmd, it is giving EBUSY in the status. It *seems* that this is due to clvmd being in_progress, which should mean that it is already processing a request from the same process. I am not sure how to verify that though and also it does make me wonder why that is the case... Jon, could you maybe have a look? You certainly know more about cmirror/clvmd than I do.
I just verified that 2, 3, and 4-way mirrors where a leg and the log (multiple devices but not multiple legs) are failed, does work. So what we know: 1. This bug does *not* affect 2-way mirrors. 2. This bug does *not* affect multiple device failures that include one leg and the log device. 3. This bug *does* affect multiple legs being failed (3 or more legged mirrors). I think the symptoms put forth in this bug are the tip of the iceberg. While investigating its cause, I found more disturbing possibilities... From my in-line comments: * If the mirror was successfully recovered, we want to always * force every machine to write to all devices - otherwise, * corruption will occur. Here's how: * Node1 suffers a failure and marks a region out-of-sync * Node2 attempts a write, gets by is_remote_recovering, * and queries the sync status of the region - finding * it out-of-sync. * Node2 thinks the write should be a nosync write, but it * hasn't suffered the drive failure that Node1 has yet. * It then issues a generic_make_request directly to * the primary image only - which is exactly the device * that has suffered the failure. * Node2 suffers a lost write - which completely bypasses the * mirror layer because it had gone through generic_m_r. * The file system will likely explode at this point due to * I/O errors. If it wasn't the primary that failed, it is * easily possible in this case to issue writes to just one * of the remaining images - also leaving the mirror inconsistent. * * We let in_sync() return 1 in a cluster regardless of what is * in the bitmap once recovery has successfully completed on a * mirror. This ensures the mirroring code will continue to * attempt to write to all mirror images. The worst that can * happen for reads is that additional read attempts may be * taken. Checked-in upstream. This bug appears to be fixed in the latest build, however I'm going to let the tests run all night be for marking verified. The test cases specific to this bug ran for quite awhile last night before eventually hitting bug 595507. Marking this bug verified in the latest rpms. 2.6.32-59.1.el6.x86_64 lvm2-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 lvm2-libs-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 lvm2-cluster-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 udev-147-2.22.el6 BUILT: Fri Jul 23 07:21:33 CDT 2010 device-mapper-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-libs-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-event-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 device-mapper-event-libs-1.02.53-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 cmirror-2.02.72-8.el6 BUILT: Wed Aug 18 10:41:52 CDT 2010 Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |