Description of problem: This bug is for the issue mentioned in comment 32 of bug 568138. When running that failure test, the initial repair attempt failed after a failed lock attempt. + FAILED=sdf + pvcreate /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Physical volume "/dev/sdc1" successfully created Physical volume "/dev/sdd1" successfully created Physical volume "/dev/sde1" successfully created Physical volume "/dev/sdf1" successfully created + vgcreate -c n vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Volume group "vg_bar" successfully created + lvcreate -m 2 -n lv1 -L 600M vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Logical volume "lv1" created + lvs -a -o +devices LV VG Attr LSize Log Copy% Devices lv1 vg_bar mwi-a- 600.00M lv1_mlog 1.33 lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0) [lv1_mimage_0] vg_bar Iwi-ao 600.00M /dev/sdc1(0) [lv1_mimage_1] vg_bar Iwi-ao 600.00M /dev/sdd1(0) [lv1_mimage_2] vg_bar Iwi-ao 600.00M /dev/sde1(0) [lv1_mlog] vg_bar lwi-ao 4.00M /dev/sdf1(0) + echo offline + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 1 + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 2 + dd if=/dev/zero of=/dev/vg_bar/lv1 bs=512 count=1 + echo running + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 141/1200 1 AAA 3 disk 253:2 D vg_bar-lv1_mimage_0: 0 1228800 linear + lvs -a -o +devices 1+0 records in 1+0 records out 512 bytes (512 B) copied, 17.6062 seconds, 0.0 kB/s LV VG Attr LSize Log Copy% Devices lv1 vg_bar mwi-a- 600.00M lv1_mlog 100.00 lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0) [lv1_mimage_0] vg_bar iwi-ao 600.00M /dev/sdc1(0) [lv1_mimage_1] vg_bar iwi-ao 600.00M /dev/sdd1(0) [lv1_mimage_2] vg_bar iwi-ao 600.00M /dev/sde1(0) [lv1_mlog] vg_bar lwi-ao 4.00M /dev/sdf1(0) + vgs VG #PV #LV #SN Attr VSize VFree VolGroup00 1 2 0 wz--n- 68.12G 0 vg_bar 4 1 0 wz--n- 542.66G 540.89G + lvremove -f vg_bar/lv1 Logical volume "lv1" successfully removed + vgremove -ff vg_bar Volume group "vg_bar" successfully removed + pvremove /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Labels on physical volume "/dev/sdc1" successfully wiped Can't open /dev/sdd1 exclusively - not removing. Mounted filesystem? Can't open /dev/sde1 exclusively - not removing. Mounted filesystem? Labels on physical volume "/dev/sdf1" successfully wiped repair log: lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar lvm[8287]: Volume group for uuid not found: lyvMWjSlDSHcStrES2dAYkYjdZzhQN70PLdinAeC7D1aBRMc7dvd4CCxnaw57eeo lvm[8287]: Failed to lock lv1 lvm[8287]: Repair of mirrored LV vg_bar/lv1 failed. lvm[8287]: Failed to remove faulty devices in vg_bar-lv1. lvm[8287]: Log device 253:2 has failed (D). lvm[8287]: Device failure in vg_bar-lv1. lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Trying to up-convert to 3 images, 1 logs. lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Repair of mirrored LV vg_bar/lv1 finished successfully. lvm[8287]: vg_bar-lv1 is now in-sync. Version-Release number of selected component (if applicable): 2.6.18-227.el5 lvm2-2.02.74-2.el5 BUILT: Tue Nov 9 08:03:06 CST 2010 lvm2-cluster-2.02.74-3.el5 BUILT: Tue Nov 9 08:01:59 CST 2010 device-mapper-1.02.55-2.el5 BUILT: Tue Nov 9 06:41:00 CST 2010 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
I wrote a test for the testsuite for this problem: . lib/test aux prepare_vg 5 aux prepare_dmeventd lvcreate -m 3 --ig -L 1 -n 4way $vg lvchange --monitor y $vg/4way aux disable_dev $dev2 $dev4 mkfs.ext3 $DM_DEV_DIR/$vg/4way aux enable_dev $dev2 $dev4 sleep 10 lvs -a -o +devices | not grep unknown check mirror $vg 4way check mirror_legs $vg 4way 2 lvs -a -o +devices | not grep mimage_1 lvs -a -o +devices | not grep mimage_3 ---- If I look at the syslog, I can see the inconsistent metadata messages (even more than you do, it seems). Nevertheless, the repair is actually successful: the thing that breaks down is that mimage_1 is not removed: it is detached from the mirror, but it stays around as a toplevel LV. The mirror is however in a good shape (the right images are removed). Am I correct in assuming you mean the messages in syslog and the dangling mimage to be the bug? Or do you actually get a mirror in bad shape as well? (As in images not removed *from the mirror* correctly, e.g....) Thanks.
Correct. The mirror appears to be fine. This bug is for the fact that the repair is listed as having failed in the syslog.
I have tracked this down to a somewhat more fundamental problem. Description of that problem and a patch fixing it available at http://www.redhat.com/archives/lvm-devel/2011-April/msg00022.html
Bumping to 5.8 and waiting for review upstream.
I appear to have reproduced this issue during 5.7 testing. Marking TestBlocker. 2.6.18-256.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011 device-mapper-event-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 Scenario: Kill secondary leg and disk log of synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_secondary_log_4legs_1 * sync: 1 * leg devices: /dev/sdc1 /dev/sdh1 /dev/sdf1 /dev/sdg1 * log devices: /dev/sdb1 * no MDA devices: * failpv(s): /dev/sdh1 /dev/sdb1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg fault policy: remove * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-04... taft-04: lvcreate -m 3 -n syncd_secondary_log_4legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdh1:0-1000 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sdb1:0-150 PV=/dev/sdh1 syncd_secondary_log_4legs_1_mimage_1: 6 PV=/dev/sdb1 syncd_secondary_log_4legs_1_mlog: 1.2 PV=/dev/sdh1 syncd_secondary_log_4legs_1_mimage_1: 6 PV=/dev/sdb1 syncd_secondary_log_4legs_1_mlog: 1.2 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 84.42% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Creating gfs2 on top of mirror(s) on taft-01... Mounting mirrored gfs2 filesystems on taft-01... Mounting mirrored gfs2 filesystems on taft-02... Mounting mirrored gfs2 filesystems on taft-03... Mounting mirrored gfs2 filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdh on taft-01 Disabling device sdb on taft-01 Disabling device sdh on taft-02 Disabling device sdb on taft-02 Disabling device sdh on taft-03 Disabling device sdb on taft-03 Disabling device sdh on taft-04 Disabling device sdb on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-01 [DEADLOCK] May 3 13:45:28 taft-01 lvm[7661]: helter_skelter-syncd_secondary_log_4legs_1 is now in-sync. May 3 13:46:24 taft-01 lvm[7661]: Log device 253:2 has failed (D). May 3 13:46:24 taft-01 lvm[7661]: Device failure in helter_skelter-syncd_secondary_log_4legs_1. May 3 13:46:24 taft-01 lvm[7661]: /dev/sdb1: read failed after 0 of 512 at 145669554176: Input/output error [...] May 3 13:46:24 taft-01 lvm[7661]: /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error May 3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid a0v0ep-87lx-CWqY-A78n-djT1-ZTdz-TyKgg8. May 3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid VUwo2f-V8u5-QHF6-94Zu-thas-o03V-fH9vc2. May 3 13:46:27 taft-01 lvm[7661]: Mirror status: 1 of 4 images failed. May 3 13:46:27 taft-01 lvm[7661]: Mirror log status: 1 of 1 images failed - switching to core May 3 13:46:30 taft-01 lvm[7661]: Monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events. May 3 13:46:30 taft-01 lvm[7661]: Another thread is handling an event. Waiting... May 3 13:46:39 taft-01 lvm[7661]: Error locking on node taft-03: Unable to suspend helter_skelter-syncd_secondary_log_4legs_1 (253:7) May 3 13:46:39 taft-01 lvm[7661]: Failed to lock syncd_secondary_log_4legs_1 May 3 13:46:39 taft-01 lvm[7661]: Repair of mirrored LV helter_skelter/syncd_secondary_log_4legs_1 failed. May 3 13:46:39 taft-01 lvm[7661]: Failed to remove faulty devices in helter_skelter-syncd_secondary_log_4legs_1. May 3 13:46:40 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events. May 3 13:46:42 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events. May 3 13:49:43 taft-01 kernel: INFO: task glock_workqueue:23475 blocked for more than 120 seconds. May 3 13:49:43 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 3 13:49:43 taft-01 kernel: glock_workque D ffffffff801540f1 0 23475 87 23476 23474 (L-TLB) May 3 13:49:43 taft-01 kernel: ffff8101b6d3bbe0 0000000000000046 ffff8101b6d3bbe0 000000000000001d May 3 13:49:43 taft-01 kernel: ffff810206f86480 0000000000000009 ffff8101ffdd30c0 ffff810107798100 May 3 13:49:43 taft-01 kernel: 000046e0377023c8 00000000000089d8 ffff8101ffdd32a8 00000003fedbbbc0 May 3 13:49:43 taft-01 kernel: Call Trace: May 3 13:49:43 taft-01 kernel: [<ffffffff8006ec3f>] do_gettimeofday+0x40/0x90 May 3 13:49:43 taft-01 kernel: [<ffffffff800155c8>] sync_buffer+0x0/0x3f May 3 13:49:43 taft-01 kernel: [<ffffffff800637ce>] io_schedule+0x3f/0x67 May 3 13:49:43 taft-01 kernel: [<ffffffff80015603>] sync_buffer+0x3b/0x3f May 3 13:49:43 taft-01 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e May 3 13:49:43 taft-01 kernel: [<ffffffff800155c8>] sync_buffer+0x0/0x3f May 3 13:49:43 taft-01 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 May 3 13:49:43 taft-01 kernel: [<ffffffff800a2a54>] wake_bit_function+0x0/0x23 May 3 13:49:43 taft-01 kernel: [<ffffffff8001abcf>] submit_bh+0x10d/0x114 May 3 13:49:43 taft-01 kernel: [<ffffffff887f9b67>] :gfs2:gfs2_meta_wait+0x2d/0x83 May 3 13:49:43 taft-01 kernel: [<ffffffff887f9d1a>] :gfs2:gfs2_meta_read+0x46/0x65 May 3 13:49:43 taft-01 kernel: [<ffffffff887fa0ec>] :gfs2:gfs2_meta_indirect_buffer+0xbe/0x162 [...]
Appeared to have hit this during 5.7 mirror stripe failure testing. Attaching log... 2.6.18-266.el5 lvm2-2.02.84-5.el5 BUILT: Fri Jun 10 12:31:19 CDT 2011 lvm2-cluster-2.02.84-5.el5 BUILT: Fri Jun 10 12:30:24 CDT 2011 device-mapper-1.02.63-4.el5 BUILT: Fri Jun 10 12:28:16 CDT 2011 device-mapper-event-1.02.63-4.el5 BUILT: Fri Jun 10 12:28:16 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
Created attachment 504756 [details] log from taft-01
This appears to be quite an issue. I can't get cmirror repair to work at all anymore with the latest 5.7 rpms, or even earlier 5.7 ones. lvm2-2.02.84-4.el5 BUILT: Thu Jun 2 05:04:32 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
We should still look at this for 5.7.
The last few comments occurred while running with subset cluster device failures. So, although this issue still exists, this may not end up being as big a deal since subset failure is not supported with clvm.
Backported fix in lvm2-2.02.84-6.el5.
Mirror device failure regression testing has passed with this fix. Marking verified in the latest rpms. 2.6.18-266.el5 lvm2-2.02.84-6.el5 BUILT: Thu Jun 16 10:31:15 CDT 2011 lvm2-cluster-2.02.84-6.el5 BUILT: Thu Jun 16 10:32:56 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1071.html