Description of problem: Scenario: Kill primary leg of synced core log 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_primary_core_2legs_1 * sync: 1 * disklog: 0 * failpv(s): /dev/sdb1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg devices: /dev/sdb1 /dev/sdh1 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-02... taft-02: lvcreate --corelog -m 1 -n syncd_primary_core_2legs_1 -L 600M helter_skelter /dev/sdb1:0-1000 /dev/sdh1:0-1000 PV=/dev/sdb1 syncd_primary_core_2legs_1_mimage_0: 5: PV=/dev/sdb1 syncd_primary_core_2legs_1_mimage_0: 5: Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 83.08% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Creating gfs on top of mirror(s) on taft-01... Mounting mirrored gfs filesystems on taft-01... Mounting mirrored gfs filesystems on taft-02... Mounting mirrored gfs filesystems on taft-03... Mounting mirrored gfs filesystems on taft-04... Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- <start name="taft-01_syncd_primary_core_2legs_1" pid="13964" time="Thu Apr 29 21:48:24 2010" type="cmd" /> <start name="taft-02_syncd_primary_core_2legs_1" pid="13966" time="Thu Apr 29 21:48:24 2010" type="cmd" /> <start name="taft-03_syncd_primary_core_2legs_1" pid="13968" time="Thu Apr 29 21:48:24 2010" type="cmd" /> <start name="taft-04_syncd_primary_core_2legs_1" pid="13970" time="Thu Apr 29 21:48:24 2010" type="cmd" /> Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdb on taft-01 Disabling device sdb on taft-02 Disabling device sdb on taft-03 Disabling device sdb on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-01 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.095469 seconds, 439 MB/s Verifying current sanity of lvm after the failure /dev/mapper/helter_skelter-syncd_primary_core_2legs_1_mimage_0: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 2048 at 0: Input/output error [...] Verifying FAILED device /dev/sdb1 is *NOT* in the volume(s) /dev/mapper/helter_skelter-syncd_primary_core_2legs_1_mimage_0: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 2048 at 0: Input/output error [...] Verifying LEG device /dev/sdh1 *IS* in the volume(s) /dev/mapper/helter_skelter-syncd_primary_core_2legs_1_mimage_0: read failed after 0 of 4096 at 0: Input/output error /dev/sdb1: read failed after 0 of 2048 at 0: Input/output error verify the newly allocated dm devices were added as a result of the failures Checking EXISTENCE of syncd_primary_core_2legs_1_mimage_0 on: taft-01 syncd_primary_core_2legs_1_mimage_0 on taft-01 should now exist [root@taft-01 ~]# lvs -a -o +devices /dev/sdb1: open failed: No such device or address Couldn't find device with uuid tddjwI-mToJ-OK9x-kxr4-rOxB-tM2z-TkoN4P. [...] LV VG Attr LSize Log Copy% Convert Devices syncd_primary_core_2legs_1 helter_skelter -wi-ao 600.00M /dev/sdh1(0) syncd_primary_core_2legs_1_mimage_0 helter_skelter -wi--- 600.00M unknown device(0) syncd_primary_core_2legs_1_mimage_1 helter_skelter vwi-a- 600.00M # Note how there's not ever an "unknown device" listed for mimage_1 # Repair failed on taft-03: Apr 29 21:49:18 taft-01 lvm[8664]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 finished successfully. Apr 29 21:47:50 taft-02 lvm[3729]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 finished successfully. Apr 29 21:48:42 taft-03 lvm[31801]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 failed. Apr 29 21:49:22 taft-04 lvm[18291]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 finished successfully. Version-Release number of selected component (if applicable): 2.6.18-194.el5.gtest.87 lvm2-2.02.56-8.el5_5.3 BUILT: Wed Apr 28 03:08:59 CDT 2010 lvm2-cluster-2.02.56-7.el5_5.1 BUILT: Thu Apr 22 07:02:04 CDT 2010 device-mapper-1.02.39-1.el5_5.2 BUILT: Thu Apr 22 04:30:57 CDT 2010 cmirror-1.1.39-8.el5 BUILT: Wed Mar 3 09:31:58 CST 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 I'll try and reproduce this to verify if it's a regression or not
I was unable to reproduce this after attempting another 5 iterations of that specific test case. This appears to be some timing issue rather than a regression.
Reproduced this when restarting cmirror device failure tests (different testcase but same failed results). Scenario: Kill multiple legs of synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_multiple_legs_4legs_1 * sync: 1 * disklog: /dev/sdg1 * failpv(s): /dev/sdh1 /dev/sdd1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg devices: /dev/sdh1 /dev/sdc1 /dev/sde1 /dev/sdd1 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-04... taft-04: lvcreate -m 3 -n syncd_multiple_legs_4legs_1 -L 600M helter_skelter /dev/sdh1:0-1000 /dev/sdc1:0-1000 /dev/sde1:0-1000 /dev/sdd1:0-1000 /dev/sdg1:0-150 [root@taft-01 ~]# grep dm_task_run /var/log/messages May 3 14:49:05 taft-01 lvm[8440]: dm_task_run failed, errno = 6, No such device or address [root@taft-02 ~]# grep dm_task_run /var/log/messages [root@taft-03 ~]# grep dm_task_run /var/log/messages May 3 15:00:55 taft-03 lvm[8231]: dm_task_run failed, errno = 6, No such device or address [root@taft-04 ~]# grep dm_task_run /var/log/messages May 3 15:29:15 taft-01 lvm[8440]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_4legs_1 finished successfully. May 3 15:27:40 taft-02 lvm[8246]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_4legs_1 finished successfully. May 3 15:28:34 taft-03 lvm[8231]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_4legs_1 failed. May 3 15:29:11 taft-04 lvm[8231]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_4legs_1 finished successfully.
I wasn't able to hit this bug with single machine mirroring, though I did eventually end up hitting bug 574448, maybe it's related (though bug 574448 requires 3 4-way mirrors and this one doens't).
Like what I said in bug 574448 -- I can't do much with this kind of failure without seeing the dmeventd logs. So likewise, this needs to wait until we can reproduce with a new version of dmeventd which properly logs messages (errors) from lvconvert --repair.
I was unable to hit this bug with the following previous rpms: lvm2-2.02.56-8.el5_5.1 lvm2-cluster-2.02.56-7.el5 device-mapper-1.02.39-1.el5
I was able to hit this bug with just the previous lvm2-cluster and the latest lvm2 and device-mapper rpms: lvm2-2.02.56-8.el5_5.3 BUILT: Wed Apr 28 03:08:59 CDT 2010 lvm2-cluster-2.02.56-7.el5 BUILT: Mon Feb 8 10:24:29 CST 2010 device-mapper-1.02.39-1.el5_5.2 BUILT: Thu Apr 22 04:30:57 CDT 2010 [root@taft-01 ~]# grep Repair /var/log/messages | grep failed May 5 15:29:38 taft-01 lvm[7863]: Repair of mirrored LV helter_skelter/nonsyncd_log_3legs_1 failed.
Hit this bug with these rpms as well: lvm2-2.02.56-8.el5_5.3 BUILT: Wed Apr 28 03:08:59 CDT 2010 lvm2-cluster-2.02.56-7.el5_5.1 BUILT: Thu Apr 22 07:02:04 CDT 2010 device-mapper-1.02.39-1.el5_5.1 BUILT: Thu Apr 22 04:00:59 CDT 2010
There doesn't seem to be anyway to distill this bug. It happens regardless of whether a primary device is failed or a secondary device; whether the mirror is or is not in sync; whether the mirror has a core log or a disk log.
Even happens if it's the log device that gets failed: May 10 16:47:52 taft-01 lvm[7092]: Repair of mirrored LV helter_skelter/syncd_log_3legs_1 failed. syncd_log_3legs_1 helter_skelter mwi-ao 600.00M 100.00 syncd_log_3legs_1_mimage_0(0),syncd_log_3legs_1_mimage_1(0),syncd_log_3legs_1_mimage_2(0) [syncd_log_3legs_1_mimage_0] helter_skelter iwi-ao 600.00M /dev/sdh1(0) [syncd_log_3legs_1_mimage_1] helter_skelter iwi-ao 600.00M /dev/sdf1(0) [syncd_log_3legs_1_mimage_2] helter_skelter iwi-ao 600.00M /dev/sdg1(0) syncd_log_3legs_1_mlog helter_skelter -wi--- 4.00M unknown device(0)
Actually, Corey, can you check whether dmeventd is logging to syslog at LOG_DEBUG level? It seems that inadvertently, logging was enabled in dmeventd by an unrelated patch, so it may be possible to extract log output even with your current versions. You may need to check your syslog configuration, or /var/log/debug if that exists on your system(s). Thanks!
Attaching the lastest failure complete with debugging turning on.
Created attachment 413222 [details] debug log from the node (taft-02) where the repair failed
Ok, thanks -- this seems to be the error causing the failure: May 11 13:41:55 taft-02 lvm[21556]: Error locking on node taft-03: LV helter_skelter/syncd_multiple_legs_4legs_1_mimage_0 in use: not deactivating so it seems that something on taft-03 is opening the mimage_0 device. This could be a running LVM scan in theory, or something like udev. Could you please try setting up your filter in lvm.conf to ignore all mimage devices and try again? In the case this is related to udev, then I don't know a good way around. Peter Rajnoha may know more...
The most straightforward way to see whether udev is interferering or not is to kill udev daemon while testing, so maybe try this as well (it shouldn't do any harm for dm/lvm at all). Since 5.5 doesn't have udev_sync enabled, it's highly possible that this really is about udev touching devices from within udev rules and keeping them open for a while. I'd need to look what rules are used there exactly..
Hit the same issue with udev killed on all 4 nodes. [root@taft-01 ~]# ps -ef | grep udev root 8995 6922 0 11:26 pts/0 00:00:00 grep udev May 12 11:14:27 taft-01 lvm[7202]: Locking LV ahSj4mi6uPZFmLUoTLzIJELskjjtjYNSj1aoF9cUkafXhO6XM44mXPXO2ff5wdvT NL (LV|NONBLOCK|CLUSTER) (0x98) May 12 11:14:27 taft-01 lvm[7202]: Error locking on node taft-03: LV helter_skelter/nonsyncd_log_3legs_1_mlog in use: not deactivating May 12 11:14:27 taft-01 lvm[7202]: <backtrace>
Hopefully fixed upstream - ignoring internal lvm devices during scan.
That this is indeed the problem can be easily verified without that patch by updating the lvm.conf filter as suggested in comment #13.
I hope that just happened - Corey - please can you elaborate here?
Correct. All the failure scenarios that had previous failed, passed (twice) with the following filter: filter = [ "a|/dev/sd|", "r/.*/" ]
The latest fix failed. Marking FailsQA, and will post the debug log... lvm2-2.02.56-8.el5_5.3.5 BUILT: Thu May 13 15:29:00 CDT 2010 lvm2-cluster-2.02.56-7.el5_5.2.5 BUILT: Thu May 13 15:28:30 CDT 2010 device-mapper-1.02.39-1.el5_5.2 BUILT: Thu Apr 22 04:30:57 CDT 2010
Created attachment 414095 [details] debug log from the node (taft-03) where the repair failed
Reading the log from comment #21 - it seems that it works (inlcuding the patch!) What's the problem now? See: lvm[7670]: /dev/mapper/helter_skelter-syncd_primary_core_2legs_1_mimage_0: Reserved internal LVM device not usable. lvm[7670]: /dev/mapper/helter_skelter-syncd_primary_core_2legs_1_mimage_0: Skipping: Suspended or internal dm device (IOW scan does not lock/scan internal volumes now.) lvm[7664]: Completed: lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies helter_skelter/syncd_primary_core_2legs_1 lvm[7664]: Repair of mirrored LV helter_skelter/syncd_primary_core_2legs_1 finished successfully.
I must have posted the log from the wrong node as this is still easy to reproduce. nonsyncd_secondary_2legs_1 helter_skelter -wi-ao 600.00M /dev/sdd1(0) nonsyncd_secondary_2legs_1_mimage_0 helter_skelter vwi-a- 600.00M nonsyncd_secondary_2legs_1_mimage_1 helter_skelter -wi--- 600.00M unknown device(0) nonsyncd_secondary_2legs_1_mlog helter_skelter -wi-a- 4.00M /dev/sdh1(0) May 17 15:51:01 taft-02 lvm[7131]: Repair of mirrored LV helter_skelter/nonsyncd_secondary_2legs_1 failed. I'll post the log here from taft-02...
Created attachment 414669 [details] debug log from the node (taft-02) where the repair failed
The posted log shows that the node still scan internal devices (so binaries there are without patch). Patch was removed from current 5.5.z lvm2 build, so it is expected to fail. But upstream have now final version of it, so let's retest for 5.6 later.
Fix in lvm2-2.02.73-1.el5.
No repair failures have been found in the latest mirror failure testing. Marking this bug verified. 2.6.18-227.el5 lvm2-2.02.74-1.el5 BUILT: Fri Oct 15 10:26:21 CDT 2010 lvm2-cluster-2.02.74-2.el5 BUILT: Fri Oct 29 07:48:11 CDT 2010 device-mapper-1.02.55-1.el5 BUILT: Fri Oct 15 06:15:55 CDT 2010 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0052.html