Bug 651590 - failure to lock LV results in failure to repair mirror after transient error
Summary: failure to lock LV results in failure to repair mirror after transient error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2
Version: 5.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Petr Rockai
QA Contact: Corey Marthaler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-09 21:13 UTC by Corey Marthaler
Modified: 2013-03-11 20:25 UTC (History)
11 users (show)

Fixed In Version: lvm2-2.02.84-6.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 10:50:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log from taft-01 (25.72 KB, text/plain)
2011-06-14 20:02 UTC, Corey Marthaler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1071 0 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2011-07-21 10:50:01 UTC

Description Corey Marthaler 2010-11-09 21:13:23 UTC
Description of problem:
This bug is for the issue mentioned in comment 32 of bug 568138. When running that failure test, the initial repair attempt failed after a failed lock attempt.

+ FAILED=sdf
+ pvcreate /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
  Physical volume "/dev/sdc1" successfully created
  Physical volume "/dev/sdd1" successfully created
  Physical volume "/dev/sde1" successfully created
  Physical volume "/dev/sdf1" successfully created
+ vgcreate -c n vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
  Volume group "vg_bar" successfully created
+ lvcreate -m 2 -n lv1 -L 600M vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
  Logical volume "lv1" created
+ lvs -a -o +devices
  LV             VG         Attr   LSize   Log      Copy%  Devices
  lv1            vg_bar     mwi-a- 600.00M lv1_mlog  1.33 
lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0)
  [lv1_mimage_0] vg_bar     Iwi-ao 600.00M                 /dev/sdc1(0)
  [lv1_mimage_1] vg_bar     Iwi-ao 600.00M                 /dev/sdd1(0)
  [lv1_mimage_2] vg_bar     Iwi-ao 600.00M                 /dev/sde1(0)
  [lv1_mlog]     vg_bar     lwi-ao   4.00M                 /dev/sdf1(0)
+ echo offline
+ dmsetup status
vg_bar-lv1_mlog: 0 8192 linear
vg_bar-lv1_mimage_2: 0 1228800 linear
vg_bar-lv1_mimage_1: 0 1228800 linear
vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A
vg_bar-lv1_mimage_0: 0 1228800 linear
+ sleep 1
+ dmsetup status
vg_bar-lv1_mlog: 0 8192 linear
vg_bar-lv1_mimage_2: 0 1228800 linear
vg_bar-lv1_mimage_1: 0 1228800 linear
vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A
vg_bar-lv1_mimage_0: 0 1228800 linear
+ sleep 2
+ dd if=/dev/zero of=/dev/vg_bar/lv1 bs=512 count=1
+ echo running
+ dmsetup status
vg_bar-lv1_mlog: 0 8192 linear
vg_bar-lv1_mimage_2: 0 1228800 linear
vg_bar-lv1_mimage_1: 0 1228800 linear
vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 141/1200 1 AAA 3 disk 253:2 D
vg_bar-lv1_mimage_0: 0 1228800 linear
+ lvs -a -o +devices
1+0 records in
1+0 records out
512 bytes (512 B) copied, 17.6062 seconds, 0.0 kB/s
  LV             VG         Attr   LSize   Log      Copy%  Devices
  lv1            vg_bar     mwi-a- 600.00M lv1_mlog 100.00
lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0)
  [lv1_mimage_0] vg_bar     iwi-ao 600.00M                 /dev/sdc1(0)
  [lv1_mimage_1] vg_bar     iwi-ao 600.00M                 /dev/sdd1(0)
  [lv1_mimage_2] vg_bar     iwi-ao 600.00M                 /dev/sde1(0)
  [lv1_mlog]     vg_bar     lwi-ao   4.00M                 /dev/sdf1(0)
+ vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  VolGroup00   1   2   0 wz--n-  68.12G      0
  vg_bar       4   1   0 wz--n- 542.66G 540.89G
+ lvremove -f vg_bar/lv1
  Logical volume "lv1" successfully removed
+ vgremove -ff vg_bar
  Volume group "vg_bar" successfully removed
+ pvremove /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
  Labels on physical volume "/dev/sdc1" successfully wiped
  Can't open /dev/sdd1 exclusively - not removing. Mounted filesystem?
  Can't open /dev/sde1 exclusively - not removing. Mounted filesystem?
  Labels on physical volume "/dev/sdf1" successfully wiped


repair log:
lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core
lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar
lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar
lvm[8287]: Volume group for uuid not found:
lyvMWjSlDSHcStrES2dAYkYjdZzhQN70PLdinAeC7D1aBRMc7dvd4CCxnaw57eeo
lvm[8287]: Failed to lock lv1
lvm[8287]: Repair of mirrored LV vg_bar/lv1 failed.
lvm[8287]: Failed to remove faulty devices in vg_bar-lv1.
lvm[8287]: Log device 253:2 has failed (D).
lvm[8287]: Device failure in vg_bar-lv1.
lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core
lvm[8287]: Monitoring mirror device vg_bar-lv1 for events.
lvm[8287]: Another thread is handling an event. Waiting...
lvm[8287]: Monitoring mirror device vg_bar-lv1 for events.
lvm[8287]: Another thread is handling an event. Waiting...
lvm[8287]: Trying to up-convert to 3 images, 1 logs.
lvm[8287]: Monitoring mirror device vg_bar-lv1 for events.
lvm[8287]: Another thread is handling an event. Waiting...
lvm[8287]: Repair of mirrored LV vg_bar/lv1 finished successfully.
lvm[8287]: vg_bar-lv1 is now in-sync.


Version-Release number of selected component (if applicable):
2.6.18-227.el5
lvm2-2.02.74-2.el5    BUILT: Tue Nov  9 08:03:06 CST 2010
lvm2-cluster-2.02.74-3.el5    BUILT: Tue Nov  9 08:01:59 CST 2010
device-mapper-1.02.55-2.el5    BUILT: Tue Nov  9 06:41:00 CST 2010
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 1 Petr Rockai 2011-01-05 14:38:55 UTC
I wrote a test for the testsuite for this problem:

. lib/test

aux prepare_vg 5
aux prepare_dmeventd

lvcreate -m 3 --ig -L 1 -n 4way $vg
lvchange --monitor y $vg/4way
aux disable_dev $dev2 $dev4
mkfs.ext3 $DM_DEV_DIR/$vg/4way
aux enable_dev $dev2 $dev4
sleep 10
lvs -a -o +devices | not grep unknown
check mirror $vg 4way
check mirror_legs $vg 4way 2
lvs -a -o +devices | not grep mimage_1
lvs -a -o +devices | not grep mimage_3

----

If I look at the syslog, I can see the inconsistent metadata messages (even more than you do, it seems). Nevertheless, the repair is actually successful: the thing that breaks down is that mimage_1 is not removed: it is detached from the mirror, but it stays around as a toplevel LV. The mirror is however in a good shape (the right images are removed).

Am I correct in assuming you mean the messages in syslog and the dangling mimage to be the bug? Or do you actually get a mirror in bad shape as well? (As in images not removed *from the mirror* correctly, e.g....)

Thanks.

Comment 2 Corey Marthaler 2011-01-11 00:22:24 UTC
Correct. The mirror appears to be fine. This bug is for the fact that the repair is listed as having failed in the syslog.

Comment 3 Petr Rockai 2011-04-10 21:31:49 UTC
I have tracked this down to a somewhat more fundamental problem. Description of that problem and a patch fixing it available at http://www.redhat.com/archives/lvm-devel/2011-April/msg00022.html

Comment 4 Petr Rockai 2011-04-11 15:39:21 UTC
Bumping to 5.8 and waiting for review upstream.

Comment 5 Corey Marthaler 2011-05-03 20:56:43 UTC
I appear to have reproduced this issue during 5.7 testing. Marking TestBlocker.

2.6.18-256.el5

lvm2-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:24 CDT 2011
lvm2-cluster-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:43 CDT 2011
device-mapper-1.02.63-2.el5    BUILT: Fri Mar  4 10:23:17 CST 2011
device-mapper-event-1.02.63-2.el5    BUILT: Fri Mar  4 10:23:17 CST 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009



Scenario: Kill secondary leg and disk log of synced 4 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_log_4legs_1
* sync:               1
* leg devices:        /dev/sdc1 /dev/sdh1 /dev/sdf1 /dev/sdg1
* log devices:        /dev/sdb1
* no MDA devices:     
* failpv(s):          /dev/sdh1 /dev/sdb1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate -m 3 -n syncd_secondary_log_4legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdh1:0-1000 /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sdb1:0-150

PV=/dev/sdh1
        syncd_secondary_log_4legs_1_mimage_1: 6
PV=/dev/sdb1
        syncd_secondary_log_4legs_1_mlog: 1.2
PV=/dev/sdh1
        syncd_secondary_log_4legs_1_mimage_1: 6
PV=/dev/sdb1
        syncd_secondary_log_4legs_1_mlog: 1.2

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 84.42% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdh on taft-01
Disabling device sdb on taft-01
Disabling device sdh on taft-02
Disabling device sdb on taft-02
Disabling device sdh on taft-03
Disabling device sdb on taft-03
Disabling device sdh on taft-04
Disabling device sdb on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]



May  3 13:45:28 taft-01 lvm[7661]: helter_skelter-syncd_secondary_log_4legs_1 is now in-sync.
May  3 13:46:24 taft-01 lvm[7661]: Log device 253:2 has failed (D).
May  3 13:46:24 taft-01 lvm[7661]: Device failure in helter_skelter-syncd_secondary_log_4legs_1.
May  3 13:46:24 taft-01 lvm[7661]: /dev/sdb1: read failed after 0 of 512 at 145669554176: Input/output error
[...]
May  3 13:46:24 taft-01 lvm[7661]: /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
May  3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid a0v0ep-87lx-CWqY-A78n-djT1-ZTdz-TyKgg8.
May  3 13:46:25 taft-01 lvm[7661]: Couldn't find device with uuid VUwo2f-V8u5-QHF6-94Zu-thas-o03V-fH9vc2.
May  3 13:46:27 taft-01 lvm[7661]: Mirror status: 1 of 4 images failed.
May  3 13:46:27 taft-01 lvm[7661]: Mirror log status: 1 of 1 images failed - switching to core
May  3 13:46:30 taft-01 lvm[7661]: Monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.
May  3 13:46:30 taft-01 lvm[7661]: Another thread is handling an event. Waiting...
May  3 13:46:39 taft-01 lvm[7661]: Error locking on node taft-03: Unable to suspend helter_skelter-syncd_secondary_log_4legs_1 (253:7)
May  3 13:46:39 taft-01 lvm[7661]: Failed to lock syncd_secondary_log_4legs_1
May  3 13:46:39 taft-01 lvm[7661]: Repair of mirrored LV helter_skelter/syncd_secondary_log_4legs_1 failed.
May  3 13:46:39 taft-01 lvm[7661]: Failed to remove faulty devices in helter_skelter-syncd_secondary_log_4legs_1.
May  3 13:46:40 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.
May  3 13:46:42 taft-01 lvm[7661]: No longer monitoring mirror device helter_skelter-syncd_secondary_log_4legs_1 for events.

May  3 13:49:43 taft-01 kernel: INFO: task glock_workqueue:23475 blocked for more than 120 seconds.
May  3 13:49:43 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  3 13:49:43 taft-01 kernel: glock_workque D ffffffff801540f1     0 23475     87         23476 23474 (L-TLB)
May  3 13:49:43 taft-01 kernel:  ffff8101b6d3bbe0 0000000000000046 ffff8101b6d3bbe0 000000000000001d
May  3 13:49:43 taft-01 kernel:  ffff810206f86480 0000000000000009 ffff8101ffdd30c0 ffff810107798100
May  3 13:49:43 taft-01 kernel:  000046e0377023c8 00000000000089d8 ffff8101ffdd32a8 00000003fedbbbc0
May  3 13:49:43 taft-01 kernel: Call Trace:
May  3 13:49:43 taft-01 kernel:  [<ffffffff8006ec3f>] do_gettimeofday+0x40/0x90
May  3 13:49:43 taft-01 kernel:  [<ffffffff800155c8>] sync_buffer+0x0/0x3f
May  3 13:49:43 taft-01 kernel:  [<ffffffff800637ce>] io_schedule+0x3f/0x67
May  3 13:49:43 taft-01 kernel:  [<ffffffff80015603>] sync_buffer+0x3b/0x3f
May  3 13:49:43 taft-01 kernel:  [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
May  3 13:49:43 taft-01 kernel:  [<ffffffff800155c8>] sync_buffer+0x0/0x3f
May  3 13:49:43 taft-01 kernel:  [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78
May  3 13:49:43 taft-01 kernel:  [<ffffffff800a2a54>] wake_bit_function+0x0/0x23
May  3 13:49:43 taft-01 kernel:  [<ffffffff8001abcf>] submit_bh+0x10d/0x114
May  3 13:49:43 taft-01 kernel:  [<ffffffff887f9b67>] :gfs2:gfs2_meta_wait+0x2d/0x83
May  3 13:49:43 taft-01 kernel:  [<ffffffff887f9d1a>] :gfs2:gfs2_meta_read+0x46/0x65
May  3 13:49:43 taft-01 kernel:  [<ffffffff887fa0ec>] :gfs2:gfs2_meta_indirect_buffer+0xbe/0x162
[...]

Comment 6 Corey Marthaler 2011-06-14 20:02:11 UTC
Appeared to have hit this during 5.7 mirror stripe failure testing. Attaching log...


2.6.18-266.el5

lvm2-2.02.84-5.el5    BUILT: Fri Jun 10 12:31:19 CDT 2011
lvm2-cluster-2.02.84-5.el5    BUILT: Fri Jun 10 12:30:24 CDT 2011
device-mapper-1.02.63-4.el5    BUILT: Fri Jun 10 12:28:16 CDT 2011
device-mapper-event-1.02.63-4.el5    BUILT: Fri Jun 10 12:28:16 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 7 Corey Marthaler 2011-06-14 20:02:45 UTC
Created attachment 504756 [details]
log from taft-01

Comment 8 Corey Marthaler 2011-06-14 22:01:24 UTC
This appears to be quite an issue. I can't get cmirror repair to work at all anymore with the latest 5.7 rpms, or even earlier 5.7 ones.


lvm2-2.02.84-4.el5    BUILT: Thu Jun  2 05:04:32 CDT 2011
lvm2-cluster-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:43 CDT 2011
device-mapper-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
device-mapper-event-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 9 Corey Marthaler 2011-06-14 22:09:43 UTC
We should still look at this for 5.7.

Comment 10 Corey Marthaler 2011-06-15 18:33:37 UTC
The last few comments occurred while running with subset cluster device failures. So, although this issue still exists, this may not end up being as big a deal since subset failure is not supported with clvm.

Comment 11 Milan Broz 2011-06-16 15:30:50 UTC
Backported fix in lvm2-2.02.84-6.el5.

Comment 13 Corey Marthaler 2011-06-27 18:39:44 UTC
Mirror device failure regression testing has passed with this fix. Marking verified in the latest rpms.

2.6.18-266.el5

lvm2-2.02.84-6.el5    BUILT: Thu Jun 16 10:31:15 CDT 2011
lvm2-cluster-2.02.84-6.el5    BUILT: Thu Jun 16 10:32:56 CDT 2011
device-mapper-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
device-mapper-event-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 14 errata-xmlrpc 2011-07-21 10:50:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1071.html

Comment 15 errata-xmlrpc 2011-07-21 12:28:56 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1071.html


Note You need to log in before you can comment on or make changes to this bug.