Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1031204

Summary:	raid volumes experiencing multiple device failures can see repair failures
Product:	Red Hat Enterprise Linux 6	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED WORKSFORME	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.5	CC:	agk, cmarthal, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1067229 (view as bug list)		Environment:
Last Closed:	2014-05-30 17:12:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2013-11-15 21:47:37 UTC

Description of problem:
I feel like this bug may have already been filed but I couldn't find one that didn't involve partial allocation (bug 824159) or mirrored volumes (bug 1016296).

In this case there were two devices failed and two free devices in the VG for allocation to work, and it did appear to work, it's just the repair failed.

================================================================================
Iteration 0.1 started at Thu Nov 14 17:36:08 CST 2013
================================================================================
Scenario kill_multiple_synced_raid1_4legs: Kill multiple legs of synced 4 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_multiple_raid1_4legs_1
* sync:               1
* type:               raid1
* -m |-i value:       4
* leg devices:        /dev/sdf1 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdg1
* failpv(s):          /dev/sdf1 /dev/sdb1
* failnode(s):        virt-004.cluster-qe.lab.eng.brq.redhat.com
* additional snap:    /dev/sda1
* lvmetad:             0
* raid fault policy:   allocate
******************************************************

Creating raids(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com...
virt-004.cluster-qe.lab.eng.brq.redhat.com: lvcreate --type raid1 -m 4 -n synced_multiple_raid1_4legs_1 -L 500M black_bird /dev/sdf1:0-2000 /dev/sda1:0-2000 /dev/sdb1:0-2000 /dev/sdc1:0-2000 /dev/sdg1:0-2000

Current mirror/raid device structure(s):
  LV                                       Attr       LSize   Cpy%Sync Devices
  synced_multiple_raid1_4legs_1            rwi-a-r--- 500.00m     0.00 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0)
  [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdf1(1)
  [synced_multiple_raid1_4legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sda1(1)
  [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdb1(1)
  [synced_multiple_raid1_4legs_1_rimage_3] Iwi-aor--- 500.00m          /dev/sdc1(1)
  [synced_multiple_raid1_4legs_1_rimage_4] Iwi-aor--- 500.00m          /dev/sdg1(1)
  [synced_multiple_raid1_4legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdf1(0)
  [synced_multiple_raid1_4legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sda1(0)
  [synced_multiple_raid1_4legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdb1(0)
  [synced_multiple_raid1_4legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdc1(0)
  [synced_multiple_raid1_4legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)

/dev/sda1 IS in the mirror
/dev/sdb1 IS in the mirror
/dev/sdc1 IS in the mirror
/dev/sde1 is NOT in the mirror
/dev/sdf1 IS in the mirror
/dev/sdg1 IS in the mirror
/dev/sdh1 is NOT in the mirror
AVAIL:2 - NEEDED:2
will_alloc_work=yes

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on virt-004.cluster-qe.lab.eng.brq.redhat.com...

PV=/dev/sdb1
        synced_multiple_raid1_4legs_1_rimage_2: 1.0
        synced_multiple_raid1_4legs_1_rmeta_2: 1.0
PV=/dev/sdf1
        synced_multiple_raid1_4legs_1_rimage_0: 1.0
        synced_multiple_raid1_4legs_1_rmeta_0: 1.0

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Disabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com
Disabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com

Getting recovery check start time from /var/log/messages: Nov 15 00:37
Attempting I/O to cause mirror down conversion(s) on virt-004.cluster-qe.lab.eng.brq.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.312702 s, 134 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid 6913Qo-v4h6-Wa2D-lh2O-pQq5-v5Ii-BiybDm.
  Couldn't find device with uuid xTH2Ah-DWUp-QBKb-X0fa-nDpd-dk8N-6HqTrr.
  LV                                       Attr       LSize   Cpy%Sync Devices
  bb_snap1                                 swi-a-s--- 252.00m          /dev/sda1(127)
  synced_multiple_raid1_4legs_1            owi-aor--- 500.00m    53.60 synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0)
  [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sde1(1)
  [synced_multiple_raid1_4legs_1_rimage_1] iwi-aor--- 500.00m          /dev/sda1(1)
  [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdh1(1)
  [synced_multiple_raid1_4legs_1_rimage_3] iwi-aor--- 500.00m          /dev/sdc1(1)
  [synced_multiple_raid1_4legs_1_rimage_4] iwi-aor--- 500.00m          /dev/sdg1(1)
  [synced_multiple_raid1_4legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sde1(0)
  [synced_multiple_raid1_4legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sda1(0)
  [synced_multiple_raid1_4legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdh1(0)
  [synced_multiple_raid1_4legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdc1(0)
  [synced_multiple_raid1_4legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sdg1(0)

Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sdb1 is *NOT* in the volume(s)
Verifying IMAGE device /dev/sda1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdc1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdg1 *IS* in the volume(s)
verify the rimage/rmeta dm devices remain after the failures
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_2 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rimage_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 
Checking EXISTENCE and STATE of synced_multiple_raid1_4legs_1_rmeta_0 on: virt-004.cluster-qe.lab.eng.brq.redhat.com 

Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: unknown /dev/sda1 unknown /dev/sdc1 /dev/sdg1
ACTUAL LEG ORDER: /dev/sde1 /dev/sda1 /dev/sdh1 /dev/sdc1 /dev/sdg1
unknown ne /dev/sde1
/dev/sda1 ne /dev/sda1
unknown ne /dev/sdh1
/dev/sdc1 ne /dev/sdc1
/dev/sdg1 ne /dev/sdg1
Verifying files (checkit) on mirror(s) on...
        ---- virt-004.cluster-qe.lab.eng.brq.redhat.com ----

Enabling device sdf on virt-004.cluster-qe.lab.eng.brq.redhat.com
Enabling device sdb on virt-004.cluster-qe.lab.eng.brq.redhat.com

Verify that each of the raid repairs finished successfully
repair of raid LV black_bird-synced_multiple_raid1_4legs_1 failed on virt-004.cluster-qe.lab.eng.brq.redhat.com


Nov 15 00:37:35 virt-004 qarshd[7370]: Running cmdline: echo offline > /sys/block/sdf/device/state &
Nov 15 00:37:36 virt-004 qarshd[7373]: Running cmdline: echo offline > /sys/block/sdb/device/state &
[...]
Nov 15 00:37:37 virt-004 lvm[5930]: /dev/sdb1: read failed after 0 of 1024 at 4096: Input/output error
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to write changes to synced_multiple_raid1_4legs_1 in black_bird
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to replace faulty devices in black_bird/synced_multiple_raid1_4legs_1.
Nov 15 00:37:37 virt-004 lvm[5930]: Repair of RAID device black_bird-synced_multiple_raid1_4legs_1-real failed.
Nov 15 00:37:37 virt-004 lvm[5930]: Failed to process event for black_bird-synced_multiple_raid1_4legs_1-real
Nov 15 00:37:42 virt-004 kernel: sd 6:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: md: super_written gets error=-5, uptodate=0
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Disk failure on dm-7, disabling device.
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: Operation continuing on 3 devices.
Nov 15 00:37:42 virt-004 lvm[5930]: Device #0 of raid1 array, black_bird-synced_multiple_raid1_4legs_1-real, has failed.
[...]
Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: sd 3:0:0:1: rejecting I/O to offline device
Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 2 specified for rebuild: Clearing superblock
Nov 15 00:37:42 virt-004 kernel: device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
Nov 15 00:37:42 virt-004 kernel: md/raid1:mdX: active with 3 out of 5 mirrors
Nov 15 00:37:42 virt-004 kernel: created bitmap (1 pages) for device mdX
Nov 15 00:37:42 virt-004 kernel: mdX: bitmap initialized from disk: read 1 pages, set 4 of 1000 bits
Nov 15 00:37:42 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events.
Nov 15 00:37:43 virt-004 lvm[5930]: Monitoring RAID device black_bird-synced_multiple_raid1_4legs_1-real for events.
Nov 15 00:37:43 virt-004 lvm[5930]: Faulty devices in black_bird/synced_multiple_raid1_4legs_1 successfully replaced.




Version-Release number of selected component (if applicable):
2.6.32-425.el6.x86_64

lvm2-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-libs-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-cluster-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
udev-147-2.51.el6    BUILT: Thu Oct 17 13:14:34 CEST 2013
device-mapper-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-persistent-data-0.2.8-2.el6    BUILT: Mon Oct 21 16:14:25 CEST 2013
cmirror-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013


How reproducible:
Often

Comment 2 RHEL Program Management 2013-11-18 22:45:05 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 Jonathan Earl Brassow 2014-04-04 22:29:22 UTC

Using the following command to test:

./black_bird -o bp-01 -l /usr/tests/sts-rhel6.5/ -r /usr/tests/sts-rhel6.5/ -e kill_multiple_synced_raid1_4legs

Comment 4 Jonathan Earl Brassow 2014-04-07 19:20:54 UTC

Seems to run just fine for me (10+ iterations) if lvmetad isn't used.

I'm running into issues re-enabling devices when lvmetad is used.  I'll work around that and try those tests again.

Comment 5 Jonathan Earl Brassow 2014-04-07 20:02:17 UTC

After making the workaround for re-enabling failed devices, lvmetad seems to work fine also.


I am testing the upstream code ATM, so the issue may have been addressed outside RAID code already.

I will attempt 6.5 rpm testing and see if I can reproduce.

Comment 6 Jonathan Earl Brassow 2014-04-07 20:48:34 UTC

10 iterations of black_bird with the RHEL6.5 RPMs - no reproduction.

Maybe I'll save this for a weekend or overnight run.

In the meantime, can QA reproduce it?

Comment 9 Jonathan Earl Brassow 2014-05-30 17:12:55 UTC

I'm closing this one.  If it can be reproduced then we'll reopen, but I think we've given this enough consideration.