Bug 789421

Summary: RAID4 device failure causes mdX_resync to block
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: agk, djansa, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.95-1.el6 Doc Type: Bug Fix
Doc Text:
New Feature to 6.3. No documentation required. Bug 732458 is the bug that requires a release note for the RAID features. Other documentation is found in the LVM manual. Operational bugs need no documentation because they are being fixed before their initial release.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 15:01:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Corey Marthaler 2012-02-10 18:40:22 UTC
Description of problem:
This may be related to bug 789408.

Scenario kill_primary_synced_raid4_3legs: Kill primary leg of synced 3 leg raid4 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid4_3legs_1
* sync:               1
* type:               raid4
* -m or -i value:     3
* leg devices:        /dev/sdg1 /dev/sdd1 /dev/sdh1 /dev/sdf1
* failpv(s):          /dev/sdg1
* failnode(s):        taft-01
* raid fault policy:   warn
******************************************************

Creating raids(s) on taft-01...
taft-01: lvcreate --type raid4 -i 3 -n synced_primary_raid4_3legs_1 -L 500M black_bird /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdf1:0-1000

RAID Structure(s):
 LV                                      Attr     LSize     Devices
 synced_primary_raid4_3legs_1            rwi-a-r- 504.00m   synced_primary_raid4_3legs_1_rimage_0(0),synced_primary_raid4_3legs_1_rimage_1(0),synced_primary_raid4_3legs_1_rimage_2(0),synced_primary_raid4_3legs_1_rimage_3(0)
 [synced_primary_raid4_3legs_1_rimage_0] Iwi-aor- 168.00m   /dev/sdg1(1)
 [synced_primary_raid4_3legs_1_rimage_1] Iwi-aor- 168.00m   /dev/sdd1(1)
 [synced_primary_raid4_3legs_1_rimage_2] Iwi-aor- 168.00m   /dev/sdh1(1)
 [synced_primary_raid4_3legs_1_rimage_3] Iwi-aor- 168.00m   /dev/sdf1(1)
 [synced_primary_raid4_3legs_1_rmeta_0]  ewi-aor-   4.00m   /dev/sdg1(0)
 [synced_primary_raid4_3legs_1_rmeta_1]  ewi-aor-   4.00m   /dev/sdd1(0)
 [synced_primary_raid4_3legs_1_rmeta_2]  ewi-aor-   4.00m   /dev/sdh1(0)
 [synced_primary_raid4_3legs_1_rmeta_3]  ewi-aor-   4.00m   /dev/sdf1(0)

PV=/dev/sdg1
        synced_primary_raid4_3legs_1_rimage_0: 2
        synced_primary_raid4_3legs_1_rmeta_0: 2

Disabling device sdg on taft-01

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]


qarshd[5787]: Running cmdline: echo offline > /sys/block/sdg/device/state &
lvm[1256]: Device #0 of raid4 array, black_bird-synced_primary_raid4_3legs_1, has failed.
kernel: md/raid:mdX: Disk failure on dm-3, disabling device.
kernel: md/raid:mdX: Operation continuing on 3 devices.
kernel: md/raid:mdX: read error not correctable (sector 126760 on dm-3).
[...]
kernel: md/raid:mdX: read error not correctable (sector 126832 on dm-3).
kernel: md: mdX: resync done.
lvm[1256]: /dev/sdg1: read failed after 0 of 512 at 145669554176: Input/output error
[...]
lvm[1256]: /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
lvm[1256]: Couldn't find device with uuid 403agt-g0GQ-LPZ0-zcYq-3PTc-3R6A-efKAfT.
qarshd[5790]: Running cmdline: pvs -a
qarshd[5792]: Running cmdline: dd if=/dev/zero of=/dev/black_bird/synced_primary_raid4_3legs_1 count=1
kernel: INFO: task mdX_resync:5760 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: mdX_resync    D 0000000000000002     0  5760      2 0x00000080
kernel: ffff88021729dcd0 0000000000000046 0000000000000000 ffff880217fa5c00
kernel: ffff880217fa5e20 0000000000000286 ffff880217fa5d28 ffff8802175e1028
kernel: ffff880218c89038 ffff88021729dfd8 000000000000f4e8 ffff880218c89038
kernel: Call Trace:
kernel: [<ffffffff813eaf52>] md_do_sync+0xaf2/0xbe0
kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff813eb2d6>] md_thread+0x116/0x150
kernel: [<ffffffff813eb1c0>] ? md_thread+0x0/0x150
kernel: [<ffffffff81090886>] kthread+0x96/0xa0
kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
kernel: INFO: task dd:5793 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dd            D 0000000000000001     0  5793   5792 0x00000080
kernel: ffff880216a57bf8 0000000000000086 0000000000000000 0000000000000001
kernel: 0000000000008460 ffff880216a57c88 ffff880216a57d00 0000000000000286
kernel: ffff880216d89a78 ffff880216a57fd8 000000000000f4e8 ffff880216d89a78
kernel: Call Trace:
kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
kernel: [<ffffffff814edb9f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffff81110d03>] wait_on_page_bit+0x73/0x80
kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffff811271a5>] ? pagevec_lookup_tag+0x25/0x40
kernel: [<ffffffff8111111b>] wait_on_page_writeback_range+0xfb/0x190
kernel: [<ffffffff81126324>] ? generic_writepages+0x24/0x30
kernel: [<ffffffff81126351>] ? do_writepages+0x21/0x40
kernel: [<ffffffff8111126b>] ? __filemap_fdatawrite_range+0x5b/0x60
kernel: [<ffffffff811111df>] filemap_fdatawait+0x2f/0x40
kernel: [<ffffffff811117c4>] filemap_write_and_wait+0x44/0x60
kernel: [<ffffffff811afa74>] __sync_blockdev+0x24/0x50
kernel: [<ffffffff811afab3>] sync_blockdev+0x13/0x20
kernel: [<ffffffff811afb68>] __blkdev_put+0xa8/0x190
kernel: [<ffffffff811afc60>] blkdev_put+0x10/0x20
kernel: [<ffffffff811afca3>] blkdev_close+0x33/0x60
kernel: [<ffffffff81177e85>] __fput+0xf5/0x210
kernel: [<ffffffff81177fc5>] fput+0x25/0x30
kernel: [<ffffffff81173a0d>] filp_close+0x5d/0x90
kernel: [<ffffffff81173ae5>] sys_close+0xa5/0x100
kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


[root@taft-01 ~]# dmsetup status
black_bird-synced_primary_raid4_3legs_1_rimage_1: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1: 0 1032192 raid raid4 4 DAAA 107560/344064
black_bird-synced_primary_raid4_3legs_1_rimage_0: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_3: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_2: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_1: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_0: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rimage_3: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1_rimage_2: 0 344064 linear 

[root@taft-01 ~]# dmsetup table
black_bird-synced_primary_raid4_3legs_1_rimage_1: 0 344064 linear 8:49 10240
black_bird-synced_primary_raid4_3legs_1: 0 1032192 raid raid4 3 128 region_size 1024 4 253:2 253:3 253:4 253:5 253:6 253:7 253:8 253:9
black_bird-synced_primary_raid4_3legs_1_rimage_0: 0 344064 linear 8:97 10240
black_bird-synced_primary_raid4_3legs_1_rmeta_3: 0 8192 linear 8:81 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_2: 0 8192 linear 8:113 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_1: 0 8192 linear 8:49 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_0: 0 8192 linear 8:97 2048
black_bird-synced_primary_raid4_3legs_1_rimage_3: 0 344064 linear 8:81 10240
black_bird-synced_primary_raid4_3legs_1_rimage_2: 0 344064 linear 8:113 10240


Version:
2.6.32-220.el6.x86_64

lvm2-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-libs-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-cluster-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
cmirror-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012

Comment 1 Corey Marthaler 2012-02-10 21:35:26 UTC
This too is reproducible.

Comment 2 Jonathan Earl Brassow 2012-02-20 19:33:09 UTC
Fix by latest rhel6 kernel (2.6.32-236)

Comment 4 Corey Marthaler 2012-02-20 23:57:11 UTC
Verified fixed in the latest kernel + scratch lvm builds.

2.6.32-236.el6.x86_64

lvm2-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-libs-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-cluster-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST
2012
cmirror-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012

Comment 7 Jonathan Earl Brassow 2012-04-23 18:29:02 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New Feature to 6.3.  No documentation required.

Bug 732458 is the bug that requires a release note for the RAID features.  Other documentation is found in the LVM manual.

Operational bugs need no documentation because they are being fixed before their initial release.

Comment 9 errata-xmlrpc 2012-06-20 15:01:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html