Description of problem: lvm mirror: I/O hungup at medium error Version-Release number of selected component (if applicable): 2.6.18-187.el5 How reproducible: See the following steps. Steps to Reproduce: This issue is related to an failure of phyical device, but we can reproduce a similar situation. 1. create two way mirror # dmsetup ls vg00-lv00_mimage_1 (253, 2) vg00-lv00_mimage_0 (253, 1) vg00-lv00_mlog (253, 0) vg00-lv00 (253, 3) 2. disable one of the mirror leg # echo offline > /sys/block/<dev>/device/state 3. kill dmeventd # ps -ef | grep dmeventd root 3378 1 0 11:49 ? 00:00:00 [dmeventd] # kill -9 3378 4. write I/O to the mirror device # dd if=/dev/zero of=/dev/vg00/lv00 bs=4096 count=1 *** This command doesn't return. *** 5. check a status of the mirror device # dmsetup status vg00-lv00_mimage_1: 0 24576 linear vg00-lv00_mimage_0: 0 24576 linear vg00-lv00_mlog: 0 8192 linear vg00-lv00: 0 24576 mirror 2 253:1 253:2 23/24 1 DA 3 disk 253:0 A 6. enable the mirror leg which was disabled in step 2 # echo online > /sys/block/<dev>/device/state 7. Execute lvconvert command on behalf of dmeventd # lvconvert --config devices{ignore_suspended_devices=1} --repair \ --use-policies vg00/lv00 The mirror is consistent. Nothing to repair. 8. Check a status of the mirror device again # dmsetup status vg00-lv00_mimage_1: 0 24576 linear vg00-lv00_mimage_0: 0 24576 linear vg00-lv00_mlog: 0 8192 linear vg00-lv00: 0 24576 mirror 2 253:1 253:2 23/24 1 DA 3 disk 253:0 A *** Error hasn't been recovered *** Actual results: An error of mirror device is not recovered by dmeventd (lvconvert) and I/Os issued by application hungup and never finish. Expected results: dmeventd (lvconvert) removes the failed mirror device and I/Os issued by applicaion completes. Additional info: - I/O hungup doesn't happen on the 5.4 kernel (2.6.18-164.el5) (Introduced by the patch fixsing bug 502927) - This issue was reported on dm-devel in November, 2009 https://www.redhat.com/archives/dm-devel/2009-November/msg00288.html
This bug is related to bug 568139.
*** Bug 568139 has been marked as a duplicate of this bug. ***
ok, my understanding is that 1) medium/transient error happens (some sector on data area) 2) kernel dm-mirror marks device with 'D' status, sends event and waits for userspace to handle problem 3) device reappears _or_ the lvm metadata area is still readable without error 4) dmeventd parses status, identifies failed device and run lvconvert 5) but because lvconvert see all devices and metadata on them, no action happens and mirror is left in frozen state
(In reply to comment #5) Thank you for describing the sequence. Yes, it is correct. lvm command can't handle medium/transient error because it can check only lvm metadata area. I'm thinking how to fix, but the only way I have is to remove/replace the failed device *forcibly* if lvconvert failed to recover and 'D' state remained.
what is really needed here is a suspend/resume to kick things off again.
I am able to reproduce this on -164 kernel too, removing regression keyword. It is for sure bug, but not regression by its definition. (And yes, suspend/resume helps.)
ok, so we could either recheck the 'dmsetup status' from dmeventd and if the 'lvconvert --repair' was successful and the status line still shows problems, we could issue another LVM command to "refresh" the LV. It may be simpler to assume that 'lvconvert --repair' was called for a reason, and even though nothing has changed, a suspend/resume is in order. Then, dmeventd would not need to perform the extra work. The suspend/resume would have to affect the entire cluster.
Suspend/resume fixes the hung I/O but it still leaves the mirror device in a state when it thinks that the leg is failed. You need to reload completely.
do you think we should reset those when the device is resumed (would require a kernel patch)?
Resetting the error flags would break snapshots of mirrors. They suspend the origin too and we don't want to forget errors in this case.
(In reply to comment #8) > I am able to reproduce this on -164 kernel too, removing regression keyword. > It is for sure bug, but not regression by its definition. Could you really reproduced this issue? There is no mechanism holding bio in -164 and I/O successfully completes, so I/Os (step 4) aren't blocked in -164. This issue is "regression." Let me add the keyword, "regression," again.
(In reply to comment #11) > do you think we should reset those when the device is resumed (would require a > kernel patch)? "reset" would solve "transient" error but won't solve medium error. In case of medium error, I/Os will be blocked again and again and the issue will never be solved. we should remove/replace the device marked 'D' *forcibly* according to configuration policy, shouldn't we?
Well, my script simulates transient log device fail, and there is even no 'D' status in the end, just waiting proceses. #!/bin/bash -x FAILED=sdf pvcreate /dev/sd[cdef] vgcreate -c n vg_bar /dev/sd[cdef] lvcreate -m 2 -n lv1 -L 600M vg_bar /dev/sd[cdef] echo "offline">/sys/block/$FAILED/device/state dmsetup status { sleep 1 ; dd if=/dev/zero of=/dev/vg_bar/lv1 bs=512 count=1; } & dmsetup status sleep 2 echo "running">/sys/block/$FAILED/device/state dmsetup status vgs lvremove -f vg_bar/lv1 vgremove -ff vg_bar pvremove /dev/sd[cdef] Produces: # /mirror_transient + FAILED=sdf + pvcreate /dev/sdc /dev/sdd /dev/sde /dev/sdf Physical volume "/dev/sdc" successfully created Physical volume "/dev/sdd" successfully created Physical volume "/dev/sde" successfully created Physical volume "/dev/sdf" successfully created + vgcreate -c n vg_bar /dev/sdc /dev/sdd /dev/sde /dev/sdf Non-clustered volume group "vg_bar" successfully created + lvcreate -m 2 -n lv1 -L 600M vg_bar /dev/sdc /dev/sdd /dev/sde /dev/sdf Logical volume "lv1" created + echo offline + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 176/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 1 + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 176/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 2 + dd if=/dev/zero of=/dev/vg_bar/lv1 bs=512 count=1 + echo running + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 223/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + vgs <HANG> syslog: Feb 27 09:35:07 bar-01 dmeventd[3622]: dmeventd ready for processing. Feb 27 09:35:12 bar-01 lvm[3622]: Monitoring mirror device vg_bar-lv1 for events Feb 27 09:35:12 bar-01 kernel: sd 1:0:0:3: rejecting I/O to offline device Feb 27 09:35:12 bar-01 lvm[3622]: Log device, 253:2, has failed. Feb 27 09:35:12 bar-01 lvm[3622]: Device failure in vg_bar-lv1 Feb 27 09:35:12 bar-01 kernel: sd 1:0:0:3: rejecting I/O to offline device Feb 27 09:35:15 bar-01 last message repeated 52 times Feb 27 09:35:29 bar-01 lvm[3622]: Repair of mirrored LV vg_bar/lv1 finished successfully. ps ax: 3598 pts/0 S+ 0:00 /bin/bash -x /mirror_transient 3609 ? S< 0:00 [kmirrord] 3610 ? S< 0:00 [kcopyd] 3622 ? S<Lsl 0:00 [dmeventd] 3626 pts/0 S+ 0:00 /bin/bash -x /mirror_transient 3630 pts/0 D+ 0:00 dd if /dev/zero of /dev/vg_bar/lv1 bs 512 count 1 3633 pts/0 D+ 0:00 vgs # uname -a Linux bar-01 2.6.18-164.11.1.el5 #1 SMP Wed Jan 6 13:26:31 EST 2010 i686 i686 i386 GNU/Linux Everything continues after # dmsetup suspend vg_bar-lv1 # dmsetup resume vg_bar-lv1 I see this still like old bug not regression, just now it is worse with mirror images probably? Can anyone reproduce that too?
(In reply to comment #15) > I see this still like old bug not regression, just now it is worse with mirror > images probably? Thank you for the explanation. I/Os are blocked if a log device got broken, but it isn't the point of *regression* here. It might be easier to think core log. In 5.4, transient/medium errors aren't recovered, but systems can keep running with a device marked failed if there is still valid mirror legs. It is an expected behavior for mirroring. On the other hand, in 5.5 beta1, application which issued I/Os is stuck and system can't keep running. This happens because dm-raid1 was updated so that I/Os are blocked until a failure is recovered. But yes, as you mentioned, I compared behaviors in the case that mirror images failed. > Can anyone reproduce that too? I'm trying but I couldn't reproduce it. I continue trying, but please tell me if there is some configuration required.
> # uname -a > Linux bar-01 2.6.18-164.11.1.el5 #1 SMP Wed Jan 6 13:26:31 EST 2010 i686 i686 > i386 GNU/Linux I was very surprised to see the big difference between 2.6.18-164.11.1.el5 kernel and 2.6.18-187.el5 kernel. Is the 2.6.18-164.11.1.el5 kernel a regular kernel? > + vgcreate -c n vg_bar /dev/sdc /dev/sdd /dev/sde /dev/sdf > Non-clustered volume group "vg_bar" successfully created Your lvm2 is likely newer than one in RHEL5.4. lvm2-2.02.46-8.el5 doesn't show such a message. Is it a lvm2 in RHEL5.5 beta1?
(In reply to comment #17) > I was very surprised to see the big difference between 2.6.18-164.11.1.el5 > kernel and 2.6.18-187.el5 kernel. Is the 2.6.18-164.11.1.el5 kernel a regular > kernel? This is latest EUS 5.4.z kernel from RHN. > Your lvm2 is likely newer than one in RHEL5.4. lvm2-2.02.46-8.el5 doesn't show > such a message. Is it a lvm2 in RHEL5.5 beta1? > Non-clustered volume group "vg_bar" successfully created Of course I am using beta1 packages. But this message appears only when you have cluster locking switched on. (We added that to code because clustered VG bit is by default set when using cluster locking and many peple were confused when they created clustered VG later.) Anyway, if you run your testcase on -164 kernel, what is the dmsetup status after that testcase finished? I guess there still must be 'D' for mirror image because lvconvert did not changed anything. (So the mirror was not repaired even on -164 kernel, just application did not hang in IO, am I correct?)
(In reply to comment #18) > Anyway, if you run your testcase on -164 kernel, what is the dmsetup status > after that testcase finished? I guess there still must be 'D' for mirror image > because lvconvert did not changed anything. (So the mirror was not repaired > even on -164 kernel, just application did not hang in IO, am I correct?) Yes, you are correct. When I run my testcase in -164 kernel, - the mirror was not repaired - apllication did not hang in IO - the status of the mirror image was 'D' The mirror of -164 kernel can process IO if there is some valid mirror legs. (It can't process in case of log error and "mirrored" log is very important feature to enhance mirror's availability.)
(In reply to comment #15) > Well, my script simulates transient log device fail, and there is even no 'D' > status in the end, just waiting proceses. In RHEL5 kernel (-164, 176), log status can be restored if log device processed IO successfully. (see dm-log.c:disk_flush()) In your log, I guess that the status of log device was showed after it was restored. On the other hand, in upstream kernel (2.6.33), the log status is never restored and remains in the failed state 'D'. The behavior you encounter in RHEL5 kernel is better to be fixed because it is very confusing. Users will also get confused.
(In reply to comment #20) I opened bug 570583 for mismatch between a log state and its behavior.
I am working on the lvconvert --repair extension needed for this to work. Re-titling bug to reflect what is needed.
How close is this to solved?
We are now quite close to merging the required patch upstream. Hopefully today, a couple days at most.
Has been merged.
Fix in lvm2-2.02.73-1.el5.
I ran the test mentioned in comment #15 and it passed, however there were issues with failing to lock the failed vg which may have caused one of the repair attempts to fail. I'll file another bug for that issue and mark this one verified. 2.6.18-227.el5 lvm2-2.02.74-2.el5 BUILT: Tue Nov 9 08:03:06 CST 2010 lvm2-cluster-2.02.74-3.el5 BUILT: Tue Nov 9 08:01:59 CST 2010 device-mapper-1.02.55-2.el5 BUILT: Tue Nov 9 06:41:00 CST 2010 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 + FAILED=sdf + pvcreate /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Physical volume "/dev/sdc1" successfully created Physical volume "/dev/sdd1" successfully created Physical volume "/dev/sde1" successfully created Physical volume "/dev/sdf1" successfully created + vgcreate -c n vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Volume group "vg_bar" successfully created + lvcreate -m 2 -n lv1 -L 600M vg_bar /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Logical volume "lv1" created + lvs -a -o +devices LV VG Attr LSize Log Copy% Devices lv1 vg_bar mwi-a- 600.00M lv1_mlog 1.33 lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0) [lv1_mimage_0] vg_bar Iwi-ao 600.00M /dev/sdc1(0) [lv1_mimage_1] vg_bar Iwi-ao 600.00M /dev/sdd1(0) [lv1_mimage_2] vg_bar Iwi-ao 600.00M /dev/sde1(0) [lv1_mlog] vg_bar lwi-ao 4.00M /dev/sdf1(0) + echo offline + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 1 + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 20/1200 1 AAA 3 disk 253:2 A vg_bar-lv1_mimage_0: 0 1228800 linear + sleep 2 + dd if=/dev/zero of=/dev/vg_bar/lv1 bs=512 count=1 + echo running + dmsetup status vg_bar-lv1_mlog: 0 8192 linear vg_bar-lv1_mimage_2: 0 1228800 linear vg_bar-lv1_mimage_1: 0 1228800 linear vg_bar-lv1: 0 1228800 mirror 3 253:3 253:4 253:5 141/1200 1 AAA 3 disk 253:2 D vg_bar-lv1_mimage_0: 0 1228800 linear + lvs -a -o +devices 1+0 records in 1+0 records out 512 bytes (512 B) copied, 17.6062 seconds, 0.0 kB/s LV VG Attr LSize Log Copy% Devices lv1 vg_bar mwi-a- 600.00M lv1_mlog 100.00 lv1_mimage_0(0),lv1_mimage_1(0),lv1_mimage_2(0) [lv1_mimage_0] vg_bar iwi-ao 600.00M /dev/sdc1(0) [lv1_mimage_1] vg_bar iwi-ao 600.00M /dev/sdd1(0) [lv1_mimage_2] vg_bar iwi-ao 600.00M /dev/sde1(0) [lv1_mlog] vg_bar lwi-ao 4.00M /dev/sdf1(0) + vgs VG #PV #LV #SN Attr VSize VFree VolGroup00 1 2 0 wz--n- 68.12G 0 vg_bar 4 1 0 wz--n- 542.66G 540.89G + lvremove -f vg_bar/lv1 Logical volume "lv1" successfully removed + vgremove -ff vg_bar Volume group "vg_bar" successfully removed + pvremove /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 Labels on physical volume "/dev/sdc1" successfully wiped Can't open /dev/sdd1 exclusively - not removing. Mounted filesystem? Can't open /dev/sde1 exclusively - not removing. Mounted filesystem? Labels on physical volume "/dev/sdf1" successfully wiped repair log: lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar lvm[8287]: Inconsistent pre-commit metadata copies for volume group vg_bar lvm[8287]: Volume group for uuid not found: lyvMWjSlDSHcStrES2dAYkYjdZzhQN70PLdinAeC7D1aBRMc7dvd4CCxnaw57eeo lvm[8287]: Failed to lock lv1 lvm[8287]: Repair of mirrored LV vg_bar/lv1 failed. lvm[8287]: Failed to remove faulty devices in vg_bar-lv1. lvm[8287]: Log device 253:2 has failed (D). lvm[8287]: Device failure in vg_bar-lv1. lvm[8287]: Mirror log status: 1 of 1 images failed - switching to core lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Trying to up-convert to 3 images, 1 logs. lvm[8287]: Monitoring mirror device vg_bar-lv1 for events. lvm[8287]: Another thread is handling an event. Waiting... lvm[8287]: Repair of mirrored LV vg_bar/lv1 finished successfully. lvm[8287]: vg_bar-lv1 is now in-sync.
bug 651590 was filed for the issue listed above.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0052.html