Bug 651590
| Summary: | failure to lock LV results in failure to repair mirror after transient error | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | lvm2 | Assignee: | Petr Rockai <prockai> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.6 | CC: | agk, coughlan, dwysocha, heinzm, jbrassow, joe.thornber, maurizio.antillon, mbroz, prajnoha, prockai, syeghiay | ||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | lvm2-2.02.84-6.el5 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-07-21 10:50:57 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Corey Marthaler
2010-11-09 21:13:23 UTC
I wrote a test for the testsuite for this problem: . lib/test aux prepare_vg 5 aux prepare_dmeventd lvcreate -m 3 --ig -L 1 -n 4way $vg lvchange --monitor y $vg/4way aux disable_dev $dev2 $dev4 mkfs.ext3 $DM_DEV_DIR/$vg/4way aux enable_dev $dev2 $dev4 sleep 10 lvs -a -o +devices | not grep unknown check mirror $vg 4way check mirror_legs $vg 4way 2 lvs -a -o +devices | not grep mimage_1 lvs -a -o +devices | not grep mimage_3 ---- If I look at the syslog, I can see the inconsistent metadata messages (even more than you do, it seems). Nevertheless, the repair is actually successful: the thing that breaks down is that mimage_1 is not removed: it is detached from the mirror, but it stays around as a toplevel LV. The mirror is however in a good shape (the right images are removed). Am I correct in assuming you mean the messages in syslog and the dangling mimage to be the bug? Or do you actually get a mirror in bad shape as well? (As in images not removed *from the mirror* correctly, e.g....) Thanks. Correct. The mirror appears to be fine. This bug is for the fact that the repair is listed as having failed in the syslog. I have tracked this down to a somewhat more fundamental problem. Description of that problem and a patch fixing it available at http://www.redhat.com/archives/lvm-devel/2011-April/msg00022.html Bumping to 5.8 and waiting for review upstream. I appear to have reproduced this issue during 5.7 testing. Marking TestBlocker.
2.6.18-256.el5
lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011
lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011
device-mapper-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011
device-mapper-event-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011
cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
Scenario: Kill secondary leg and disk log of synced 4 leg mirror(s)
********* Mirror hash info for this scenario *********
* names: syncd_secondary_log_4legs_1
* sync: 1
* leg devices: /dev/sdc1 /dev/sdh1 /dev/sdf1 /dev/sdg1
* log devices: /dev/sdb1
* no MDA devices:
* failpv(s): /dev/sdh1 /dev/sdb1
* failnode(s): taft-01 taft-02 taft-03 taft-04
* leg fault policy: remove
* log fault policy: allocate
******************************************************
Creating mirror(s) on taft-04...
taft-04: lvcreate -m 3 -n syncd_secondary_log_4legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdh1:0-1000 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sdb1:0-150
PV=/dev/sdh1
syncd_secondary_log_4legs_1_mimage_1: 6
PV=/dev/sdb1
syncd_secondary_log_4legs_1_mlog: 1.2
PV=/dev/sdh1
syncd_secondary_log_4legs_1_mimage_1: 6
PV=/dev/sdb1
syncd_secondary_log_4legs_1_mlog: 1.2
Waiting until all mirrors become fully syncd...
0/1 mirror(s) are fully synced: ( 84.42% )
1/1 mirror(s) are fully synced: ( 100.00% )
Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...
Writing verification files (checkit) to mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Disabling device sdh on taft-01
Disabling device sdb on taft-01
Disabling device sdh on taft-02
Disabling device sdb on taft-02
Disabling device sdh on taft-03
Disabling device sdb on taft-03
Disabling device sdh on taft-04
Disabling device sdb on taft-04
Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]
May 3 13:45:28 taft-01 lvm[7661]: helter_skelter-syncd_secondary_log_4legs_1 is now in-sync.
May 3 13:46:24 taft-01 lvm[7661]: Log device 253:2 has failed (D).
May 3 13:46:24 taft-01 lvm[7661]: Device failure in helter_skelter-syncd_secondary_log_4legs_1.
May 3 13:46:24 taft-01 lvm[7661]: /dev/sdb1: read failed after 0 of 512 at 145669554176: Input/output error
[...]
May 3 13:46:24 taft-01 lvm[7661]: /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
May 3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid a0v0ep-87lx-CWqY-A78n-djT1-ZTdz-TyKgg8.
May 3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid VUwo2f-V8u5-QHF6-94Zu-thas-o03V-fH9vc2.
May 3 13:46:27 taft-01 lvm[7661]: Mirror status: 1 of 4 images failed.
May 3 13:46:27 taft-01 lvm[7661]: Mirror log status: 1 of 1 images failed - switching to core
May 3 13:46:30 taft-01 lvm[7661]: Monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.
May 3 13:46:30 taft-01 lvm[7661]: Another thread is handling an event. Waiting...
May 3 13:46:39 taft-01 lvm[7661]: Error locking on node taft-03: Unable to suspend helter_skelter-syncd_secondary_log_4legs_1 (253:7)
May 3 13:46:39 taft-01 lvm[7661]: Failed to lock syncd_secondary_log_4legs_1
May 3 13:46:39 taft-01 lvm[7661]: Repair of mirrored LV helter_skelter/syncd_secondary_log_4legs_1 failed.
May 3 13:46:39 taft-01 lvm[7661]: Failed to remove faulty devices in helter_skelter-syncd_secondary_log_4legs_1.
May 3 13:46:40 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.
May 3 13:46:42 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.
May 3 13:49:43 taft-01 kernel: INFO: task glock_workqueue:23475 blocked for more than 120 seconds.
May 3 13:49:43 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 3 13:49:43 taft-01 kernel: glock_workque D ffffffff801540f1 0 23475 87 23476 23474 (L-TLB)
May 3 13:49:43 taft-01 kernel: ffff8101b6d3bbe0 0000000000000046 ffff8101b6d3bbe0 000000000000001d
May 3 13:49:43 taft-01 kernel: ffff810206f86480 0000000000000009 ffff8101ffdd30c0 ffff810107798100
May 3 13:49:43 taft-01 kernel: 000046e0377023c8 00000000000089d8 ffff8101ffdd32a8 00000003fedbbbc0
May 3 13:49:43 taft-01 kernel: Call Trace:
May 3 13:49:43 taft-01 kernel: [<ffffffff8006ec3f>] do_gettimeofday+0x40/0x90
May 3 13:49:43 taft-01 kernel: [<ffffffff800155c8>] sync_buffer+0x0/0x3f
May 3 13:49:43 taft-01 kernel: [<ffffffff800637ce>] io_schedule+0x3f/0x67
May 3 13:49:43 taft-01 kernel: [<ffffffff80015603>] sync_buffer+0x3b/0x3f
May 3 13:49:43 taft-01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
May 3 13:49:43 taft-01 kernel: [<ffffffff800155c8>] sync_buffer+0x0/0x3f
May 3 13:49:43 taft-01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78
May 3 13:49:43 taft-01 kernel: [<ffffffff800a2a54>] wake_bit_function+0x0/0x23
May 3 13:49:43 taft-01 kernel: [<ffffffff8001abcf>] submit_bh+0x10d/0x114
May 3 13:49:43 taft-01 kernel: [<ffffffff887f9b67>] :gfs2:gfs2_meta_wait+0x2d/0x83
May 3 13:49:43 taft-01 kernel: [<ffffffff887f9d1a>] :gfs2:gfs2_meta_read+0x46/0x65
May 3 13:49:43 taft-01 kernel: [<ffffffff887fa0ec>] :gfs2:gfs2_meta_indirect_buffer+0xbe/0x162
[...]
Appeared to have hit this during 5.7 mirror stripe failure testing. Attaching log... 2.6.18-266.el5 lvm2-2.02.84-5.el5 BUILT: Fri Jun 10 12:31:19 CDT 2011 lvm2-cluster-2.02.84-5.el5 BUILT: Fri Jun 10 12:30:24 CDT 2011 device-mapper-1.02.63-4.el5 BUILT: Fri Jun 10 12:28:16 CDT 2011 device-mapper-event-1.02.63-4.el5 BUILT: Fri Jun 10 12:28:16 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 Created attachment 504756 [details]
log from taft-01
This appears to be quite an issue. I can't get cmirror repair to work at all anymore with the latest 5.7 rpms, or even earlier 5.7 ones. lvm2-2.02.84-4.el5 BUILT: Thu Jun 2 05:04:32 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 We should still look at this for 5.7. The last few comments occurred while running with subset cluster device failures. So, although this issue still exists, this may not end up being as big a deal since subset failure is not supported with clvm. Backported fix in lvm2-2.02.84-6.el5. Mirror device failure regression testing has passed with this fix. Marking verified in the latest rpms. 2.6.18-266.el5 lvm2-2.02.84-6.el5 BUILT: Thu Jun 16 10:31:15 CDT 2011 lvm2-cluster-2.02.84-6.el5 BUILT: Thu Jun 16 10:32:56 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1071.html An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1071.html |