Bug 811669

Summary: Suspend/resume of an out-of-sync RAID LV will cause the sync process to stall
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: kernelAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.3CC: agk, dwysocha, heinzm, jbrassow, mbroz, msnitzer, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-269.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 08:46:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 739162    

Description Corey Marthaler 2012-04-11 16:51:52 UTC
Description of problem:
Scenario kill_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid1_2legs_1
* sync:               1
* type:               raid1
* -m |-i value:       2
* leg devices:        /dev/sde1 /dev/sdd1 /dev/sdc1
* failpv(s):          /dev/sde1
* failnode(s):        taft-01
* additional snap:    /dev/sdd1
* raid fault policy:   allocate
******************************************************

Creating raids(s) on taft-01...
taft-01: lvcreate --type raid1 -m 2 -n synced_primary_raid1_2legs_1 -L 500M black_bird /dev/sde1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-1000

Creating a snapshot volume of each of the raids

RAID Structure(s):
  LV                                      Attr     LSize   Copy%  Devices
  bb_snap1                                swi-a-s- 252.00m        /dev/sdd1(126)
  synced_primary_raid1_2legs_1            owi-a-m- 500.00m   8.80 synced_primary_raid1_2legs_1_rimage_0(0),synced_primary_raid1_2legs_1_rimage_1(0),synced_primary_raid1_2legs_1_rimage_2(0)
  [synced_primary_raid1_2legs_1_rimage_0] Iwi-aor- 500.00m        /dev/sde1(1)
  [synced_primary_raid1_2legs_1_rimage_1] Iwi-aor- 500.00m        /dev/sdd1(1)
  [synced_primary_raid1_2legs_1_rimage_2] Iwi-aor- 500.00m        /dev/sdc1(1)
  [synced_primary_raid1_2legs_1_rmeta_0]  ewi-aor-   4.00m        /dev/sde1(0)
  [synced_primary_raid1_2legs_1_rmeta_1]  ewi-aor-   4.00m        /dev/sdd1(0)
  [synced_primary_raid1_2legs_1_rmeta_2]  ewi-aor-   4.00m        /dev/sdc1(0)

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )
   0/1 mirror(s) are fully synced: ( 8.82% )

# SYNC IS STUCK



Version-Release number of selected component (if applicable):
2.6.32-251.el6.x86_64
lvm2-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-libs-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-cluster-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
cmirror-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012

How reproducible:
Everytime

Comment 1 Jonathan Earl Brassow 2012-04-12 13:46:11 UTC
It isn't just limited to snapshots or to RAID1.  This bug affects any RAID type and is induced by the suspend/resume cycle (which happens to occur during a snapshot).

Comment 3 RHEL Program Management 2012-04-18 19:30:00 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 4 Jonathan Earl Brassow 2012-04-18 21:44:03 UTC
Before patch (Testing RAID1, then RAID5):
[root@bp-01 ~]# lvcreate --type raid1 -m2 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Logical volume "lv" created
0 1024000 raid raid1 3 aaa 4096/1024000
0 1024000 raid raid1 3 aaa 4096/1024000

[root@bp-01 ~]# lvcreate --type raid5 -i3 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Logical volume "lv" created
0 1032192 raid raid5_ls 4 aaaa 23352/344064
0 1032192 raid raid5_ls 4 aaaa 23352/344064


After patch (Testing RAID1, then RAID5):
[root@bp-01 ~]# lvcreate --type raid1 -m2 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Logical volume "lv" created
0 1024000 raid raid1 3 aaa 0/1024000
0 1024000 raid raid1 3 AAA 1024000/1024000

[root@bp-01 ~]# lvcreate --type raid5 -i3 -L 500M -n lv vg; sleep 1; dmsetup suspend vg-lv; dmsetup resume vg-lv ; dmsetup status vg-lv; sleep 30; dmsetup status vg-lv
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Logical volume "lv" created
0 1032192 raid raid5_ls 4 aaaa 22528/344064
0 1032192 raid raid5_ls 4 AAAA 344064/344064

Comment 6 Jarod Wilson 2012-05-02 16:19:57 UTC
Patch(es) available on kernel-2.6.32-269.el6

Comment 9 Corey Marthaler 2012-05-02 19:20:28 UTC
The raid + snapshot failure cases now work with the latest kernel.

2.6.32-269.el6.x86_64
lvm2-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-libs-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
lvm2-cluster-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
device-mapper-event-libs-1.02.74-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012
cmirror-2.02.95-7.el6    BUILT: Wed May  2 05:14:03 CDT 2012

Comment 11 errata-xmlrpc 2012-06-20 08:46:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html