Bug 789408 - RAID5 device failure causes dmeventd to block
Summary: RAID5 device failure causes dmeventd to block
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.3
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-10 17:33 UTC by Corey Marthaler
Modified: 2012-06-20 15:01 UTC (History)
10 users (show)

Fixed In Version: lvm2-2.02.95-1.el6
Doc Type: Bug Fix
Doc Text:
New Feature to 6.3. No documentation required. Bug 732458 is the bug that requires a release note for the RAID features. Other documentation is found in the LVM manual. Operational bugs need no documentation because they are being fixed before their initial release.
Clone Of:
Environment:
Last Closed: 2012-06-20 15:01:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0962 0 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2012-06-19 21:12:11 UTC

Description Corey Marthaler 2012-02-10 17:33:46 UTC
Description of problem:
Scenario kill_primary_synced_raid5_3legs: Kill primary leg of synced 3 leg raid5 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid5_3legs_1
* sync:               1
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdg1
* failpv(s):          /dev/sdc1
* failnode(s):        taft-01
* raid fault policy:   warn
******************************************************

Creating raids(s) on taft-01...
taft-01: lvcreate --type raid5 -i 3 -n synced_primary_raid5_3legs_1 -L 500M black_bird /dev/sdc1:0-1000 /dev/sdd1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-1000

RAID Structure(s):
 LV                                      Attr     LSize   Copy%  Devices
 synced_primary_raid5_3legs_1            rwi-a-r- 504.00m        synced_primary_raid5_3legs_1_rimage_0(0),synced_primary_raid5_3legs_1_rimage_1(0),synced_primary_raid5_3legs_1_rimage_2(0),synced_primary_raid5_3legs_1_rimage_3(0)
 [synced_primary_raid5_3legs_1_rimage_0] Iwi-aor- 168.00m        /dev/sdc1(1)
 [synced_primary_raid5_3legs_1_rimage_1] Iwi-aor- 168.00m        /dev/sdd1(1)
 [synced_primary_raid5_3legs_1_rimage_2] Iwi-aor- 168.00m        /dev/sde1(1)
 [synced_primary_raid5_3legs_1_rimage_3] Iwi-aor- 168.00m        /dev/sdg1(1)
 [synced_primary_raid5_3legs_1_rmeta_0]  ewi-aor-   4.00m        /dev/sdc1(0)
 [synced_primary_raid5_3legs_1_rmeta_1]  ewi-aor-   4.00m        /dev/sdd1(0)
 [synced_primary_raid5_3legs_1_rmeta_2]  ewi-aor-   4.00m        /dev/sde1(0)
 [synced_primary_raid5_3legs_1_rmeta_3]  ewi-aor-   4.00m        /dev/sdg1(0)

PV=/dev/sdc1
        synced_primary_raid5_3legs_1_rimage_0: 2
        synced_primary_raid5_3legs_1_rmeta_0: 2

Continuing on without fully syncd raid1 mirror(s), currently at...
        ( 6.25% )

Disabling device sdc on taft-01
[DEADLOCK]




qarshd[3131]: Running cmdline: echo offline > /sys/block/sdc/device/state &
kernel: sd 3:0:0:2: rejecting I/O to offline device
kernel: sd 3:0:0:2: rejecting I/O to offline device
kernel: md/raid:mdX: Disk failure on dm-3, disabling device.
kernel: md/raid:mdX: Operation continuing on 3 devices.
kernel: md: mdX: resync done.
kernel: md: checkpointing resync of mdX.
lvm[1153]: Device #0 of raid5_ls array, black_bird-synced_primary_raid5_3legs_1, has failed.
qarshd[3134]: Running cmdline: pvs -a
kernel: INFO: task dmeventd:3108 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dmeventd      D 0000000000000003     0  3108      1 0x00000080
kernel: ffff880218c37b18 0000000000000086 0000000000000000 ffffffffa000422e
kernel: ffff880218c37ae8 00000000bd278ab4 ffff880218c37b08 ffff880219021980
kernel: ffff880216ea3ab8 ffff880218c37fd8 000000000000f4e8 ffff880216ea3ab8
kernel: Call Trace:
kernel: [<ffffffffa000422e>] ? dm_table_unplug_all+0x8e/0x100 [dm_mod]
kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
kernel: [<ffffffff811b1a2e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
kernel: [<ffffffff8125821d>] ? get_disk+0x7d/0xf0
kernel: [<ffffffff811b1f1e>] __blockdev_direct_IO+0x5e/0xd0
kernel: [<ffffffff811ae820>] ? blkdev_get_blocks+0x0/0xc0
kernel: [<ffffffff8126cd7a>] ? kobject_get+0x1a/0x30
kernel: [<ffffffff811af687>] blkdev_direct_IO+0x57/0x60
kernel: [<ffffffff811ae820>] ? blkdev_get_blocks+0x0/0xc0
kernel: [<ffffffff811128db>] generic_file_aio_read+0x6bb/0x700
kernel: [<ffffffff81213a31>] ? avc_has_perm+0x71/0x90
kernel: [<ffffffff8120d52f>] ? security_inode_permission+0x1f/0x30
kernel: [<ffffffff8117641a>] do_sync_read+0xfa/0x140
kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40


[root@taft-01 ~]# dmsetup status
black_bird-synced_primary_raid5_3legs_1_rimage_3: 0 344064 linear 
black_bird-synced_primary_raid5_3legs_1_rimage_2: 0 344064 linear 
black_bird-synced_primary_raid5_3legs_1_rimage_1: 0 344064 linear 
black_bird-synced_primary_raid5_3legs_1_rimage_0: 0 344064 linear 
black_bird-synced_primary_raid5_3legs_1: 0 1032192 raid raid5_ls 4 DAAA 150584/344064
black_bird-synced_primary_raid5_3legs_1_rmeta_3: 0 8192 linear 
black_bird-synced_primary_raid5_3legs_1_rmeta_2: 0 8192 linear 
black_bird-synced_primary_raid5_3legs_1_rmeta_1: 0 8192 linear 
black_bird-synced_primary_raid5_3legs_1_rmeta_0: 0 8192 linear 


[root@taft-01 ~]# dmsetup table
black_bird-synced_primary_raid5_3legs_1_rimage_3: 0 344064 linear 8:97 10240
black_bird-synced_primary_raid5_3legs_1_rimage_2: 0 344064 linear 8:65 10240
black_bird-synced_primary_raid5_3legs_1_rimage_1: 0 344064 linear 8:49 10240
black_bird-synced_primary_raid5_3legs_1_rimage_0: 0 344064 linear 8:33 10240
black_bird-synced_primary_raid5_3legs_1: 0 1032192 raid raid5_ls 3 128 region_size 1024 4 253:2 253:3 253:4 253:5 253:6 253:7 253:8 253:9
black_bird-synced_primary_raid5_3legs_1_rmeta_3: 0 8192 linear 8:97 2048
black_bird-synced_primary_raid5_3legs_1_rmeta_2: 0 8192 linear 8:65 2048
black_bird-synced_primary_raid5_3legs_1_rmeta_1: 0 8192 linear 8:49 2048
black_bird-synced_primary_raid5_3legs_1_rmeta_0: 0 8192 linear 8:33 2048


Version-Release number of selected component (if applicable):
2.6.32-220.el6.x86_64

lvm2-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-libs-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
lvm2-cluster-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
device-mapper-event-libs-1.02.69-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012
cmirror-2.02.90-0.25.el6    BUILT: Sat Jan 28 18:03:08 CST 2012


How reproducible:
Everytime

Comment 1 Jonathan Earl Brassow 2012-02-20 19:29:13 UTC
Seems to be fixed by the latest version of the rhel6 kernel (2.6.32-236.el6).

However, I did notice that the helpful message that RAID1 prints when a device is lost is not printed for higher raid.  This is not a problem with the kernel or dmeventd, but the lvconvert command run by dmeventd.  Perhaps this might be worth another bug?

Comment 3 Corey Marthaler 2012-02-20 23:56:31 UTC
Verified fixed in the latest kernel + scratch lvm builds.

2.6.32-236.el6.x86_64

lvm2-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-libs-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-cluster-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
cmirror-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012

Comment 6 Jonathan Earl Brassow 2012-04-23 18:28:56 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
New Feature to 6.3.  No documentation required.

Bug 732458 is the bug that requires a release note for the RAID features.  Other documentation is found in the LVM manual.

Operational bugs need no documentation because they are being fixed before their initial release.

Comment 8 errata-xmlrpc 2012-06-20 15:01:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html


Note You need to log in before you can comment on or make changes to this bug.