811669 – Suspend/resume of an out-of-sync RAID LV will cause the sync process to stall

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 811669 - Suspend/resume of an out-of-sync RAID LV will cause the sync process to stall

Summary: Suspend/resume of an out-of-sync RAID LV will cause the sync process to stall

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	739162
TreeView+	depends on / blocked

Reported:	2012-04-11 16:51 UTC by Corey Marthaler
Modified:	2012-06-20 08:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:	kernel-2.6.32-269.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 08:46:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:0862	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update	2012-06-20 12:55:00 UTC

Description Corey Marthaler 2012-04-11 16:51:52 UTC

Description of problem:
Scenario kill_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid1_2legs_1
* sync:               1
* type:               raid1
* -m |-i value:       2
* leg devices:        /dev/sde1 /dev/sdd1 /dev/sdc1
* failpv(s):          /dev/sde1
* failnode(s):        taft-01
* additional snap:    /dev/sdd1
* raid fault policy:   allocate
******************************************************

Creating raids(s) on taft-01...
taft-01: lvcreate --type raid1 -m 2 -n synced_primary_raid1_2legs_1 -L 500M black_bird /dev/sde1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-1000

Creating a snapshot volume of each of the raids

RAID Structure(s):
  LV                                      Attr     LSize   Copy%  Devices
  bb_snap1                                swi-a-s- 252.00m        /dev/sdd1(126)
  synced_primary_raid1_2legs_1            owi-a-m- 500.00m   8.80 synced_primary_raid1_2legs_1_rimage_0(0),synced_primary_raid1_2legs_1_rimage_1(0),synced_primary_raid1_2legs_1_rimage_2(0)
  [synced_primary_raid1_2legs_1_rimage_0] Iwi-aor- 500.00m        /dev/sde1(1)
  [synced_primary_raid1_2legs_1_rimage_1] Iwi-aor- 500.00m        /dev/sdd1(1)
  [synced_primary_raid1_2legs_1_rimage_2] Iwi-aor- 500.00m        /dev/sdc1(1)
  [synced_primary_raid1_2legs_1_rmeta_0]  ewi-aor-   4.00m        /dev/sde1(0)
  [synced_primary_raid1_2legs_1_rmeta_1]  ewi-aor-   4.00m        /dev/sdd1(0)
  [synced_primary_raid1_2legs_1_rmeta_2]  ewi-aor-   4.00m        /dev/sdc1(0)

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )

# SYNC IS STUCK



Version-Release number of selected component (if applicable):
2.6.32-251.el6.x86_64
lvm2-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-libs-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-cluster-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
cmirror-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012

How reproducible:
Everytime

Comment 1 Jonathan Earl Brassow 2012-04-12 13:46:11 UTC

It isn't just limited to snapshots or to RAID1.  This bug affects any RAID type and is induced by the suspend/resume cycle (which happens to occur during a snapshot).

Comment 3 RHEL Program Management 2012-04-18 19:30:00 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 4 Jonathan Earl Brassow 2012-04-18 21:44:03 UTC

Before patch (Testing RAID1, then RAID5):
[root@bp-01 ~]# lvcreate --type raid1 -m2 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Logical volume "lv" created
0 1024000 raid raid1 3 aaa 4096/1024000
0 1024000 raid raid1 3 aaa 4096/1024000

[root@bp-01 ~]# lvcreate --type raid5 -i3 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Logical volume "lv" created
0 1032192 raid raid5_ls 4 aaaa 23352/344064
0 1032192 raid raid5_ls 4 aaaa 23352/344064


After patch (Testing RAID1, then RAID5):
[root@bp-01 ~]# lvcreate --type raid1 -m2 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Logical volume "lv" created
0 1024000 raid raid1 3 aaa 0/1024000
0 1024000 raid raid1 3 AAA 1024000/1024000

[root@bp-01 ~]# lvcreate --type raid5 -i3 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Logical volume "lv" created
0 1032192 raid raid5_ls 4 aaaa 22528/344064
0 1032192 raid raid5_ls 4 AAAA 344064/344064

Comment 6 Jarod Wilson 2012-05-02 16:19:57 UTC

Patch(es) available on kernel-2.6.32-269.el6

Comment 9 Corey Marthaler 2012-05-02 19:20:28 UTC

The raid + snapshot failure cases now work with the latest kernel.

2.6.32-269.el6.x86_64
lvm2-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-libs-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-cluster-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
cmirror-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012

Comment 11 errata-xmlrpc 2012-06-20 08:46:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html

Note You need to log in before you can comment on or make changes to this bug.