Hide Forgot
Description of problem: I feel like this bug may have already been filed but I couldn't find one that didn't involve partial allocation (bug 824159) or mirrored volumes (bug 1016296). In this case there were two devices failed and two free devices in the VG for allocation to work, and it did appear to work, it's just the repair failed. ================================================================================ Iteration 0.1 started at Thu Nov 14 17:36:08 CST 2013 ================================================================================ Scenario kill_multiple_synced_raid1_4legs: Kill multiple legs of synced 4 leg raid1 volume(s) ********* RAID hash info for this scenario ********* * names: synced_multiple_raid1_4legs_1 * sync: 1 * type: raid1 * -m |-i value: 4 * leg devices: /dev/sdf1 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdg1 * failpv(s): /dev/sdf1 /dev/sdb1 * failnode(s): virt-004.cluster-qe.lab.eng.brq.redhat.com * additional snap: /dev/sda1 * lvmetad: 0 * raid fault policy: allocate ****************************************************** Creating raids(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com... virt-004.cluster-qe.lab.eng.brq.redhat.com: lvcreate --type raid1 -m 4 -n synced_multiple_raid1_4legs_1 -L 500M black_bird /dev/sdf1:0-2000 /dev/sda1:0-2000 /dev/sdb1:0-2000 /dev/sdc1:0-2000 /dev/sdg1:0-2000 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices synced_multiple_raid1_4legs_1 rwi-a-r--- 500.00m 0.00 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0) [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m /dev/sdf1(1) [synced_multiple_raid1_4legs_1_rimage_1] Iwi-aor--- 500.00m /dev/sda1(1) [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m /dev/sdb1(1) [synced_multiple_raid1_4legs_1_rimage_3] Iwi-aor--- 500.00m /dev/sdc1(1) [synced_multiple_raid1_4legs_1_rimage_4] Iwi-aor--- 500.00m /dev/sdg1(1) [synced_multiple_raid1_4legs_1_rmeta_0] ewi-aor--- 4.00m /dev/sdf1(0) [synced_multiple_raid1_4legs_1_rmeta_1] ewi-aor--- 4.00m /dev/sda1(0) [synced_multiple_raid1_4legs_1_rmeta_2] ewi-aor--- 4.00m /dev/sdb1(0) [synced_multiple_raid1_4legs_1_rmeta_3] ewi-aor--- 4.00m /dev/sdc1(0) [synced_multiple_raid1_4legs_1_rmeta_4] ewi-aor--- 4.00m /dev/sdg1(0) /dev/sda1 IS in the mirror /dev/sdb1 IS in the mirror /dev/sdc1 IS in the mirror /dev/sde1 is NOT in the mirror /dev/sdf1 IS in the mirror /dev/sdg1 IS in the mirror /dev/sdh1 is NOT in the mirror AVAIL:2 - NEEDED:2 will_alloc_work=yes Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on virt-004.cluster-qe.lab.eng.brq.redhat.com... PV=/dev/sdb1 synced_multiple_raid1_4legs_1_rimage_2: 1.0 synced_multiple_raid1_4legs_1_rmeta_2: 1.0 PV=/dev/sdf1 synced_multiple_raid1_4legs_1_rimage_0: 1.0 synced_multiple_raid1_4legs_1_rmeta_0: 1.0 Creating a snapshot volume of each of the raids Writing verification files (checkit) to mirror(s) on... ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ---- Disabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com Disabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com Getting recovery check start time from /var/log/messages: Nov 15 00:37 Attempting I/O to cause mirror down conversion(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.312702 s, 134 MB/s Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): Couldn't find device with uuid 6913Qo-v4h6-Wa2D-lh2O-pQq5-v5Ii-BiybDm. Couldn't find device with uuid xTH2Ah-DWUp-QBKb-X0fa-nDpd-dk8N-6HqTrr. LV Attr LSize Cpy%Sync Devices bb_snap1 swi-a-s--- 252.00m /dev/sda1(127) synced_multiple_raid1_4legs_1 owi-aor--- 500.00m 53.60 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0) [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m /dev/sde1(1) [synced_multiple_raid1_4legs_1_rimage_1] iwi-aor--- 500.00m /dev/sda1(1) [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m /dev/sdh1(1) [synced_multiple_raid1_4legs_1_rimage_3] iwi-aor--- 500.00m /dev/sdc1(1) [synced_multiple_raid1_4legs_1_rimage_4] iwi-aor--- 500.00m /dev/sdg1(1) [synced_multiple_raid1_4legs_1_rmeta_0] ewi-aor--- 4.00m /dev/sde1(0) [synced_multiple_raid1_4legs_1_rmeta_1] ewi-aor--- 4.00m /dev/sda1(0) [synced_multiple_raid1_4legs_1_rmeta_2] ewi-aor--- 4.00m /dev/sdh1(0) [synced_multiple_raid1_4legs_1_rmeta_3] ewi-aor--- 4.00m /dev/sdc1(0) [synced_multiple_raid1_4legs_1_rmeta_4] ewi-aor--- 4.00m /dev/sdg1(0) Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s) Verifying FAILED device /dev/sdb1 is *NOT* in the volume(s) Verifying IMAGE device /dev/sda1 *IS* in the volume(s) Verifying IMAGE device /dev/sdc1 *IS* in the volume(s) Verifying IMAGE device /dev/sdg1 *IS* in the volume(s) verify the rimage/rmeta dm devices remain after the failures Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com Verify the raid image order is what's expected based on raid fault policy EXPECTED LEG ORDER: unknown /dev/sda1 unknown /dev/sdc1 /dev/sdg1 ACTUAL LEG ORDER: /dev/sde1 /dev/sda1 /dev/sdh1 /dev/sdc1 /dev/sdg1 unknown ne /dev/sde1 /dev/sda1 ne /dev/sda1 unknown ne /dev/sdh1 /dev/sdc1 ne /dev/sdc1 /dev/sdg1 ne /dev/sdg1 Verifying files (checkit) on mirror(s) on... ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ---- Enabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com Enabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com Verify that each of the raid repairs finished successfully repair of raid LV black_bird-synced_multiple_raid1_4legs_1 failed on virt-004.cluster-qe.lab.eng.brq.redhat.com Nov 15 00:37:35 virt-004 qarshd[7370]: Running cmdline: echo offline > /sys/block/sdf/device/state & Nov 15 00:37:36 virt-004 qarshd[7373]: Running cmdline: echo offline > /sys/block/sdb/device/state & [...] Nov 15 00:37:37 virt-004 lvm[5930]: /dev/sdb1: read failed after 0 of 1024 at 4096: Input/output error Nov 15 00:37:37 virt-004 lvm[5930]: Failed to write changes to synced_multiple_raid1_4legs_1 in black_bird Nov 15 00:37:37 virt-004 lvm[5930]: Failed to replace faulty devices in black_bird/synced_multiple_raid1_4legs_1. Nov 15 00:37:37 virt-004 lvm[5930]: Repair of RAID device black_bird-synced_multiple_raid1_4legs_1-real failed. Nov 15 00:37:37 virt-004 lvm[5930]: Failed to process event for black_bird-synced_multiple_raid1_4legs_1-real Nov 15 00:37:42 virt-004 kernel: sd 6:0:0:1: rejecting I/O to offline device Nov 15 00:37:42 virt-004 kernel: md: super_written gets error=-5, uptodate=0 Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Disk failure on dm-7, disabling device. Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Operation continuing on 3 devices. Nov 15 00:37:42 virt-004 lvm[5930]: Device #0 of raid1 array, black_bird-synced_multiple_raid1_4legs_1-real, has failed. [...] Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 2 specified for rebuild: Clearing superblock Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 0 specified for rebuild: Clearing superblock Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: active with 3 out of 5 mirrors Nov 15 00:37:42 virt-004 kernel: created bitmap (1 pages) for device mdX Nov 15 00:37:42 virt-004 kernel: mdX: bitmap initialized from disk: read 1 pages, set 4 of 1000 bits Nov 15 00:37:42 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events. Nov 15 00:37:43 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events. Nov 15 00:37:43 virt-004 lvm[5930]: Faulty devices in black_bird/synced_multiple_raid1_4legs_1 successfully replaced. Version-Release number of selected component (if applicable): 2.6.32-425.el6.x86_64 lvm2-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 lvm2-libs-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 lvm2-cluster-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 udev-147-2.51.el6 BUILT: Thu Oct 17 13:14:34 CEST 2013 device-mapper-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-libs-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-event-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-event-libs-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-persistent-data-0.2.8-2.el6 BUILT: Mon Oct 21 16:14:25 CEST 2013 cmirror-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 How reproducible: Often
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
Using the following command to test: ./black_bird -o bp-01 -l /usr/tests/sts-rhel6.5/ -r /usr/tests/sts-rhel6.5/ -e kill_multiple_synced_raid1_4legs
Seems to run just fine for me (10+ iterations) if lvmetad isn't used. I'm running into issues re-enabling devices when lvmetad is used. I'll work around that and try those tests again.
After making the workaround for re-enabling failed devices, lvmetad seems to work fine also. I am testing the upstream code ATM, so the issue may have been addressed outside RAID code already. I will attempt 6.5 rpm testing and see if I can reproduce.
10 iterations of black_bird with the RHEL6.5 RPMs - no reproduction. Maybe I'll save this for a weekend or overnight run. In the meantime, can QA reproduce it?
I'm closing this one. If it can be reproduced then we'll reopen, but I think we've given this enough consideration.