Bug 1031204 - raid volumes experiencing multiple device failures can see repair failures
raid volumes experiencing multiple device failures can see repair failures
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.5
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-15 16:47 EST by Corey Marthaler
Modified: 2014-12-11 18:07 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1067229 (view as bug list)
Environment:
Last Closed: 2014-05-30 13:12:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2013-11-15 16:47:37 EST
Description of problem:
I feel like this bug may have already been filed but I couldn't find one that didn't involve partial allocation (bug 824159) or mirrored volumes (bug 1016296).

In this case there were two devices failed and two free devices in the VG for allocation to work, and it did appear to work, it's just the repair failed.

================================================================================
Iteration 0.1 started at Thu Nov 14 17:36:08 CST 2013
================================================================================
Scenario kill_multiple_synced_raid1_4legs: Kill multiple legs of synced 4 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_multiple_raid1_4legs_1
* sync:               1
* type:               raid1
* -m |-i value:       4
* leg devices:        /dev/sdf1 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdg1
* failpv(s):          /dev/sdf1 /dev/sdb1
* failnode(s):        virt-004.cluster-qe.lab.eng.brq.redhat.com
* additional snap:    /dev/sda1
* lvmetad:             0
* raid fault policy:   allocate
******************************************************

Creating raids(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com...
virt-004.cluster-qe.lab.eng.brq.redhat.com: lvcreate --type raid1 -m 4 -n synced_multiple_raid1_4legs_1 -L 500M black_bird /dev/sdf1:0-2000 /dev/sda1:0-2000 /dev/sdb1:0-2000 /dev/sdc1:0-2000 /dev/sdg1:0-2000

Current mirror/raid device structure(s):
  LV                                       Attr       LSize   Cpy%Sync Devices
  synced_multiple_raid1_4legs_1            rwi-a-r--- 500.00m     0.00 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0)
  [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdf1(1)
  [synced_multiple_raid1_4legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sda1(1)
  [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdb1(1)
  [synced_multiple_raid1_4legs_1_rimage_3] Iwi-aor--- 500.00m          /dev/sdc1(1)
  [synced_multiple_raid1_4legs_1_rimage_4] Iwi-aor--- 500.00m          /dev/sdg1(1)
  [synced_multiple_raid1_4legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdf1(0)
  [synced_multiple_raid1_4legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sda1(0)
  [synced_multiple_raid1_4legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdb1(0)
  [synced_multiple_raid1_4legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdc1(0)
  [synced_multiple_raid1_4legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)

/dev/sda1 IS in the mirror
/dev/sdb1 IS in the mirror
/dev/sdc1 IS in the mirror
/dev/sde1 is NOT in the mirror
/dev/sdf1 IS in the mirror
/dev/sdg1 IS in the mirror
/dev/sdh1 is NOT in the mirror
AVAIL:2 - NEEDED:2
will_alloc_work=yes

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on virt-004.cluster-qe.lab.eng.brq.redhat.com...

PV=/dev/sdb1
        synced_multiple_raid1_4legs_1_rimage_2: 1.0
        synced_multiple_raid1_4legs_1_rmeta_2: 1.0
PV=/dev/sdf1
        synced_multiple_raid1_4legs_1_rimage_0: 1.0
        synced_multiple_raid1_4legs_1_rmeta_0: 1.0

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Disabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com
Disabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com

Getting recovery check start time from /var/log/messages: Nov 15 00:37
Attempting I/O to cause mirror down conversion(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.312702 s, 134 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid 6913Qo-v4h6-Wa2D-lh2O-pQq5-v5Ii-BiybDm.
  Couldn't find device with uuid xTH2Ah-DWUp-QBKb-X0fa-nDpd-dk8N-6HqTrr.
  LV                                       Attr       LSize   Cpy%Sync Devices
  bb_snap1                                 swi-a-s--- 252.00m          /dev/sda1(127)
  synced_multiple_raid1_4legs_1            owi-aor--- 500.00m    53.60 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0)
  [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sde1(1)
  [synced_multiple_raid1_4legs_1_rimage_1] iwi-aor--- 500.00m          /dev/sda1(1)
  [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdh1(1)
  [synced_multiple_raid1_4legs_1_rimage_3] iwi-aor--- 500.00m          /dev/sdc1(1)
  [synced_multiple_raid1_4legs_1_rimage_4] iwi-aor--- 500.00m          /dev/sdg1(1)
  [synced_multiple_raid1_4legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sde1(0)
  [synced_multiple_raid1_4legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sda1(0)
  [synced_multiple_raid1_4legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdh1(0)
  [synced_multiple_raid1_4legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdc1(0)
  [synced_multiple_raid1_4legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)

Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sdb1 is *NOT* in the volume(s)
Verifying IMAGE device /dev/sda1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdc1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdg1 *IS* in the volume(s)
verify the rimage/rmeta dm devices remain after the failures
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 

Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: unknown /dev/sda1 unknown /dev/sdc1 /dev/sdg1
ACTUAL LEG ORDER: /dev/sde1 /dev/sda1 /dev/sdh1 /dev/sdc1 /dev/sdg1
unknown ne /dev/sde1
/dev/sda1 ne /dev/sda1
unknown ne /dev/sdh1
/dev/sdc1 ne /dev/sdc1
/dev/sdg1 ne /dev/sdg1
Verifying files (checkit) on mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Enabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com
Enabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com

Verify that each of the raid repairs finished successfully
repair of raid LV black_bird-synced_multiple_raid1_4legs_1 failed on virt-004.cluster-qe.lab.eng.brq.redhat.com


Nov 15 00:37:35 virt-004 qarshd[7370]: Running cmdline: echo offline > /sys/block/sdf/device/state &
Nov 15 00:37:36 virt-004 qarshd[7373]: Running cmdline: echo offline > /sys/block/sdb/device/state &
[...]
Nov 15 00:37:37 virt-004 lvm[5930]: /dev/sdb1: read failed after 0 of 1024 at 4096: Input/output error
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to write changes to synced_multiple_raid1_4legs_1 in black_bird
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to replace faulty devices in black_bird/synced_multiple_raid1_4legs_1.
Nov 15 00:37:37 virt-004 lvm[5930]: Repair of RAID device black_bird-synced_multiple_raid1_4legs_1-real failed.
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to process event for black_bird-synced_multiple_raid1_4legs_1-real
Nov 15 00:37:42 virt-004 kernel: sd 6:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: md: super_written gets error=-5, uptodate=0
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Disk failure on dm-7, disabling device.
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Operation continuing on 3 devices.
Nov 15 00:37:42 virt-004 lvm[5930]: Device #0 of raid1 array, black_bird-synced_multiple_raid1_4legs_1-real, has failed.
[...]
Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 2 specified for rebuild: Clearing superblock
Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: active with 3 out of 5 mirrors
Nov 15 00:37:42 virt-004 kernel: created bitmap (1 pages) for device mdX
Nov 15 00:37:42 virt-004 kernel: mdX: bitmap initialized from disk: read 1 pages, set 4 of 1000 bits
Nov 15 00:37:42 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events.
Nov 15 00:37:43 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events.
Nov 15 00:37:43 virt-004 lvm[5930]: Faulty devices in black_bird/synced_multiple_raid1_4legs_1 successfully replaced.




Version-Release number of selected component (if applicable):
2.6.32-425.el6.x86_64

lvm2-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-libs-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-cluster-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
udev-147-2.51.el6    BUILT: Thu Oct 17 13:14:34 CEST 2013
device-mapper-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-persistent-data-0.2.8-2.el6    BUILT: Mon Oct 21 16:14:25 CEST 2013
cmirror-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013


How reproducible:
Often
Comment 2 RHEL Product and Program Management 2013-11-18 17:45:05 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 3 Jonathan Earl Brassow 2014-04-04 18:29:22 EDT
Using the following command to test:

./black_bird -o bp-01 -l /usr/tests/sts-rhel6.5/ -r /usr/tests/sts-rhel6.5/ -e kill_multiple_synced_raid1_4legs
Comment 4 Jonathan Earl Brassow 2014-04-07 15:20:54 EDT
Seems to run just fine for me (10+ iterations) if lvmetad isn't used.

I'm running into issues re-enabling devices when lvmetad is used.  I'll work around that and try those tests again.
Comment 5 Jonathan Earl Brassow 2014-04-07 16:02:17 EDT
After making the workaround for re-enabling failed devices, lvmetad seems to work fine also.


I am testing the upstream code ATM, so the issue may have been addressed outside RAID code already.

I will attempt 6.5 rpm testing and see if I can reproduce.
Comment 6 Jonathan Earl Brassow 2014-04-07 16:48:34 EDT
10 iterations of black_bird with the RHEL6.5 RPMs - no reproduction.

Maybe I'll save this for a weekend or overnight run.

In the meantime, can QA reproduce it?
Comment 9 Jonathan Earl Brassow 2014-05-30 13:12:55 EDT
I'm closing this one.  If it can be reproduced then we'll reopen, but I think we've given this enough consideration.

Note You need to log in before you can comment on or make changes to this bug.