Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 789421

Summary: RAID4 device failure causes mdX_resync to block
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: agk, djansa, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.95-1.el6 Doc Type: Bug Fix
Doc Text:
New Feature to 6.3. No documentation required. Bug 732458 is the bug that requires a release note for the RAID features. Other documentation is found in the LVM manual. Operational bugs need no documentation because they are being fixed before their initial release.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 15:01:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2012-02-10 18:40:22 UTC
Description of problem:
This may be related to bug 789408.

Scenario kill_primary_synced_raid4_3legs: Kill primary leg of synced 3 leg raid4 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid4_3legs_1
* sync:               1
* type:               raid4
* -m or -i value:     3
* leg devices:        /dev/sdg1 /dev/sdd1 /dev/sdh1 /dev/sdf1
* failpv(s):          /dev/sdg1
* failnode(s):        taft-01
* raid fault policy:   warn
******************************************************

Creating raids(s) on taft-01...
taft-01: lvcreate --type raid4 -i 3 -n synced_primary_raid4_3legs_1 -L 500M black_bird /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdf1:0-1000

RAID Structure(s):
 LV                                      Attr     LSize     Devices
 synced_primary_raid4_3legs_1            rwi-a-r- 504.00m   synced_primary_raid4_3legs_1_rimage_0(0),synced_primary_raid4_3legs_1_rimage_1(0),synced_primary_raid4_3legs_1_rimage_2(0),synced_primary_raid4_3legs_1_rimage_3(0)
 [synced_primary_raid4_3legs_1_rimage_0] Iwi-aor- 168.00m   /dev/sdg1(1)
 [synced_primary_raid4_3legs_1_rimage_1] Iwi-aor- 168.00m   /dev/sdd1(1)
 [synced_primary_raid4_3legs_1_rimage_2] Iwi-aor- 168.00m   /dev/sdh1(1)
 [synced_primary_raid4_3legs_1_rimage_3] Iwi-aor- 168.00m   /dev/sdf1(1)
 [synced_primary_raid4_3legs_1_rmeta_0]  ewi-aor-   4.00m   /dev/sdg1(0)
 [synced_primary_raid4_3legs_1_rmeta_1]  ewi-aor-   4.00m   /dev/sdd1(0)
 [synced_primary_raid4_3legs_1_rmeta_2]  ewi-aor-   4.00m   /dev/sdh1(0)
 [synced_primary_raid4_3legs_1_rmeta_3]  ewi-aor-   4.00m   /dev/sdf1(0)

PV=/dev/sdg1
        synced_primary_raid4_3legs_1_rimage_0: 2
        synced_primary_raid4_3legs_1_rmeta_0: 2

Disabling device sdg on taft-01

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]


qarshd[5787]: Running cmdline: echo offline > /sys/block/sdg/device/state &
lvm[1256]: Device #0 of raid4 array, black_bird-synced_primary_raid4_3legs_1, has failed.
kernel: md/raid:mdX: Disk failure on dm-3, disabling device.
kernel: md/raid:mdX: Operation continuing on 3 devices.
kernel: md/raid:mdX: read error not correctable (sector 126760 on dm-3).
[...]
kernel: md/raid:mdX: read error not correctable (sector 126832 on dm-3).
kernel: md: mdX: resync done.
lvm[1256]: /dev/sdg1: read failed after 0 of 512 at 145669554176: Input/output error
[...]
lvm[1256]: /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
lvm[1256]: Couldn't find device with uuid 403agt-g0GQ-LPZ0-zcYq-3PTc-3R6A-efKAfT.
qarshd[5790]: Running cmdline: pvs -a
qarshd[5792]: Running cmdline: dd if=/dev/zero of=/dev/black_bird/synced_primary_raid4_3legs_1 count=1
kernel: INFO: task mdX_resync:5760 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: mdX_resync    D 0000000000000002     0  5760      2 0x00000080
kernel: ffff88021729dcd0 0000000000000046 0000000000000000 ffff880217fa5c00
kernel: ffff880217fa5e20 0000000000000286 ffff880217fa5d28 ffff8802175e1028
kernel: ffff880218c89038 ffff88021729dfd8 000000000000f4e8 ffff880218c89038
kernel: Call Trace:
kernel: [<ffffffff813eaf52>] md_do_sync+0xaf2/0xbe0
kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff813eb2d6>] md_thread+0x116/0x150
kernel: [<ffffffff813eb1c0>] ? md_thread+0x0/0x150
kernel: [<ffffffff81090886>] kthread+0x96/0xa0
kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
kernel: INFO: task dd:5793 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dd            D 0000000000000001     0  5793   5792 0x00000080
kernel: ffff880216a57bf8 0000000000000086 0000000000000000 0000000000000001
kernel: 0000000000008460 ffff880216a57c88 ffff880216a57d00 0000000000000286
kernel: ffff880216d89a78 ffff880216a57fd8 000000000000f4e8 ffff880216d89a78
kernel: Call Trace:
kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
kernel: [<ffffffff814edb9f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffff81110d03>] wait_on_page_bit+0x73/0x80
kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffff811271a5>] ? pagevec_lookup_tag+0x25/0x40
kernel: [<ffffffff8111111b>] wait_on_page_writeback_range+0xfb/0x190
kernel: [<ffffffff81126324>] ? generic_writepages+0x24/0x30
kernel: [<ffffffff81126351>] ? do_writepages+0x21/0x40
kernel: [<ffffffff8111126b>] ? __filemap_fdatawrite_range+0x5b/0x60
kernel: [<ffffffff811111df>] filemap_fdatawait+0x2f/0x40
kernel: [<ffffffff811117c4>] filemap_write_and_wait+0x44/0x60
kernel: [<ffffffff811afa74>] __sync_blockdev+0x24/0x50
kernel: [<ffffffff811afab3>] sync_blockdev+0x13/0x20
kernel: [<ffffffff811afb68>] __blkdev_put+0xa8/0x190
kernel: [<ffffffff811afc60>] blkdev_put+0x10/0x20
kernel: [<ffffffff811afca3>] blkdev_close+0x33/0x60
kernel: [<ffffffff81177e85>] __fput+0xf5/0x210
kernel: [<ffffffff81177fc5>] fput+0x25/0x30
kernel: [<ffffffff81173a0d>] filp_close+0x5d/0x90
kernel: [<ffffffff81173ae5>] sys_close+0xa5/0x100
kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


[root@taft-01 ~]# dmsetup status
black_bird-synced_primary_raid4_3legs_1_rimage_1: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1: 0 1032192 raid raid4 4 DAAA 107560/344064
black_bird-synced_primary_raid4_3legs_1_rimage_0: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_3: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_2: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_1: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rmeta_0: 0 8192 linear 
black_bird-synced_primary_raid4_3legs_1_rimage_3: 0 344064 linear 
black_bird-synced_primary_raid4_3legs_1_rimage_2: 0 344064 linear 

[root@taft-01 ~]# dmsetup table
black_bird-synced_primary_raid4_3legs_1_rimage_1: 0 344064 linear 8:49 10240
black_bird-synced_primary_raid4_3legs_1: 0 1032192 raid raid4 3 128 region_size 1024 4 253:2 253:3 253:4 253:5 253:6 253:7 253:8 253:9
black_bird-synced_primary_raid4_3legs_1_rimage_0: 0 344064 linear 8:97 10240
black_bird-synced_primary_raid4_3legs_1_rmeta_3: 0 8192 linear 8:81 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_2: 0 8192 linear 8:113 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_1: 0 8192 linear 8:49 2048
black_bird-synced_primary_raid4_3legs_1_rmeta_0: 0 8192 linear 8:97 2048
black_bird-synced_primary_raid4_3legs_1_rimage_3: 0 344064 linear 8:81 10240
black_bird-synced_primary_raid4_3legs_1_rimage_2: 0 344064 linear 8:113 10240


Version:
2.6.32-220.el6.x86_64

lvm2-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-libs-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-cluster-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
cmirror-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012

Comment 1 Corey Marthaler 2012-02-10 21:35:26 UTC
This too is reproducible.

Comment 2 Jonathan Earl Brassow 2012-02-20 19:33:09 UTC
Fix by latest rhel6 kernel (2.6.32-236)

Comment 4 Corey Marthaler 2012-02-20 23:57:11 UTC
Verified fixed in the latest kernel + scratch lvm builds.

2.6.32-236.el6.x86_64

lvm2-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-libs-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-cluster-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST
2012
cmirror-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012

Comment 7 Jonathan Earl Brassow 2012-04-23 18:29:02 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New Feature to 6.3.  No documentation required.

Bug 732458 is the bug that requires a release note for the RAID features.  Other documentation is found in the LVM manual.

Operational bugs need no documentation because they are being fixed before their initial release.

Comment 9 errata-xmlrpc 2012-06-20 15:01:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html