Bug 1319316

Summary: multiple raid device failures may lead to deadlock
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac
Version: 6.8   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-04 01:12:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Corey Marthaler 2016-03-18 19:24:15 UTC
Description of problem:
================================================================================
Iteration 1.24 started at Thu Mar 17 22:46:26 CDT 2016
================================================================================
Scenario kill_multiple_synced_raid1_4legs: Kill multiple legs of synced 4 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_multiple_raid1_4legs_1 synced_multiple_raid1_4legs_2
* sync:               1
* type:               raid1
* -m |-i value:       4
* leg devices:        /dev/sdg1 /dev/sdf1 /dev/sdc1 /dev/sde1 /dev/sda1
* spanned legs:        0
* failpv(s):          /dev/sdf1 /dev/sde1
* failnode(s):        host-113.virt.lab.msp.redhat.com
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Creating raids(s) on host-113.virt.lab.msp.redhat.com...
host-113.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 4 -n synced_multiple_raid1_4legs_1 -L 500M black_bird /dev/sdg1:0-2400 /dev/sdf1:0-2400 /dev/sdc1:0-2400 /dev/sde1:0-2400 /dev/sda1:0-2400
host-113.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 4 -n synced_multiple_raid1_4legs_2 -L 500M black_bird /dev/sdg1:0-2400 /dev/sdf1:0-2400 /dev/sdc1:0-2400 /dev/sde1:0-2400 /dev/sda1:0-2400

Current mirror/raid device structure(s):
  LV                                       Attr       LSize   Cpy%Sync Devices
   synced_multiple_raid1_4legs_1            rwi-a-r--- 500.00m 5.60     synced_multiple_raid1_4legs_1_rimage_0(0),synced_multiple_raid1_4legs_1_rimage_1(0),synced_multiple_raid1_4legs_1_rimage_2(0),synced_multiple_raid1_4legs_1_rimage_3(0),synced_multiple_raid1_4legs_1_rimage_4(0)
   [synced_multiple_raid1_4legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdg1(1)
   [synced_multiple_raid1_4legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sdf1(1)
   [synced_multiple_raid1_4legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdc1(1)
   [synced_multiple_raid1_4legs_1_rimage_3] Iwi-aor--- 500.00m          /dev/sde1(1)
   [synced_multiple_raid1_4legs_1_rimage_4] Iwi-aor--- 500.00m          /dev/sda1(1)
   [synced_multiple_raid1_4legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdg1(0)
   [synced_multiple_raid1_4legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdf1(0)
   [synced_multiple_raid1_4legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdc1(0)
   [synced_multiple_raid1_4legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sde1(0)
   [synced_multiple_raid1_4legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sda1(0)
   synced_multiple_raid1_4legs_2            rwi-a-r--- 500.00m 0.00     synced_multiple_raid1_4legs_2_rimage_0(0),synced_multiple_raid1_4legs_2_rimage_1(0),synced_multiple_raid1_4legs_2_rimage_2(0),synced_multiple_raid1_4legs_2_rimage_3(0),synced_multiple_raid1_4legs_2_rimage_4(0)
   [synced_multiple_raid1_4legs_2_rimage_0] Iwi-aor--- 500.00m          /dev/sdg1(127)
   [synced_multiple_raid1_4legs_2_rimage_1] Iwi-aor--- 500.00m          /dev/sdf1(127)
   [synced_multiple_raid1_4legs_2_rimage_2] Iwi-aor--- 500.00m          /dev/sdc1(127)
   [synced_multiple_raid1_4legs_2_rimage_3] Iwi-aor--- 500.00m          /dev/sde1(127)
   [synced_multiple_raid1_4legs_2_rimage_4] Iwi-aor--- 500.00m          /dev/sda1(127)
   [synced_multiple_raid1_4legs_2_rmeta_0]  ewi-aor---   4.00m          /dev/sdg1(126)
   [synced_multiple_raid1_4legs_2_rmeta_1]  ewi-aor---   4.00m          /dev/sdf1(126)
   [synced_multiple_raid1_4legs_2_rmeta_2]  ewi-aor---   4.00m          /dev/sdc1(126)
   [synced_multiple_raid1_4legs_2_rmeta_3]  ewi-aor---   4.00m          /dev/sde1(126)
   [synced_multiple_raid1_4legs_2_rmeta_4]  ewi-aor---   4.00m          /dev/sda1(126)


Waiting until all mirror|raid volumes become fully syncd...
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on host-113.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-113.virt.lab.msp.redhat.com...

PV=/dev/sdf1
        synced_multiple_raid1_4legs_1_rimage_1: 1.0
        synced_multiple_raid1_4legs_1_rmeta_1: 1.0
        synced_multiple_raid1_4legs_2_rimage_1: 1.0
        synced_multiple_raid1_4legs_2_rmeta_1: 1.0
PV=/dev/sde1
        synced_multiple_raid1_4legs_1_rimage_3: 1.0
        synced_multiple_raid1_4legs_1_rmeta_3: 1.0
        synced_multiple_raid1_4legs_2_rimage_3: 1.0
        synced_multiple_raid1_4legs_2_rmeta_3: 1.0

Writing verification files (checkit) to mirror(s) on...
        ---- host-113.virt.lab.msp.redhat.com ----

<start name="host-113.virt.lab.msp.redhat.com_synced_multiple_raid1_4legs_1" pid="9562" time="Thu Mar 17 22:47:12 2016" type="cmd" />
<start name="host-113.virt.lab.msp.redhat.com_synced_multiple_raid1_4legs_2" pid="9564" time="Thu Mar 17 22:47:12 2016" type="cmd" />
Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-113.virt.lab.msp.redhat.com ----

Disabling device sdf on host-113.virt.lab.msp.redhat.com
Disabling device sde on host-113.virt.lab.msp.redhat.com

[DEADLOCK]



Mar 17 22:47:41 host-113 qarshd[8170]: Running cmdline: echo offline > /sys/block/sdf/device/state
Mar 17 22:47:41 host-113 qarshd[8172]: Talking to peer 10.15.80.224:47832 (IPv4)
Mar 17 22:47:41 host-113 kernel: sd 7:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: md: super_written gets error=-5, uptodate=0
Mar 17 22:47:41 host-113 kernel: md/raid1:mdX: Disk failure on dm-5, disabling device.
Mar 17 22:47:41 host-113 kernel: md/raid1:mdX: Operation continuing on 4 devices.
Mar 17 22:47:41 host-113 lvm[4069]: Device #1 of raid1 array, black_bird-synced_multiple_raid1_4legs_1, has failed.
Mar 17 22:47:41 host-113 qarshd[8172]: Running cmdline: echo offline > /sys/block/sde/device/state
Mar 17 22:47:41 host-113 lvm[4069]: /dev/sdf1: read failed after 0 of 4096 at 0: Input/output error
Mar 17 22:47:41 host-113 kernel: sd 7:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: md/raid1:mdX: dm-9: rescheduling sector 0
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: md/raid1:mdX: Disk failure on dm-9, disabling device.
Mar 17 22:47:41 host-113 kernel: md/raid1:mdX: Operation continuing on 3 devices.
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:41 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:42 host-113 kernel: sd 6:0:0:1: rejecting I/O to offline device
Mar 17 22:47:42 host-113 kernel: md: super_written gets error=-5, uptodate=0
Mar 17 22:47:42 host-113 kernel: md/raid1:mdX: Disk failure on dm-20, disabling device.
Mar 17 22:47:42 host-113 kernel: md/raid1:mdX: Operation continuing on 4 devices.
Mar 17 22:47:42 host-113 lvm[4069]: Device #3 of raid1 array, black_bird-synced_multiple_raid1_4legs_2, has failed.
Mar 17 22:47:42 host-113 kernel: sd 7:0:0:1: rejecting I/O to offline device
Mar 17 22:47:42 host-113 kernel: md: super_written gets error=-5, uptodate=0
Mar 17 22:47:42 host-113 kernel: md/raid1:mdX: Disk failure on dm-16, disabling device.
Mar 17 22:47:42 host-113 kernel: md/raid1:mdX: Operation continuing on 3 devices.
Mar 17 22:47:46 host-113 qarshd[8174]: Talking to peer 10.15.80.224:47836 (IPv4)
Mar 17 22:47:46 host-113 qarshd[8174]: Running cmdline: pvs -a
Mar 17 22:50:25 host-113 kernel: INFO: task dmeventd:7954 blocked for more than 120 seconds.
Mar 17 22:50:25 host-113 kernel:      Not tainted 2.6.32-616.el6.x86_64 #1
Mar 17 22:50:25 host-113 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 22:50:25 host-113 kernel: dmeventd      D 0000000000000000     0  7954      1 0x00000080
Mar 17 22:50:25 host-113 kernel: ffff88003b8f7a18 0000000000000082 ffff880000000000 000db58a99167f2b
Mar 17 22:50:25 host-113 kernel: ffff88003b8f7978 ffff88001b65ed00 0000000001bff72a ffffffffa914c87e
Mar 17 22:50:25 host-113 kernel: 00000000390ca2cc ffffffff81aa8340 ffff880039f9bad8 ffff88003b8f7fd8
Mar 17 22:50:25 host-113 kernel: Call Trace:
Mar 17 22:50:25 host-113 kernel: [<ffffffff81547063>] io_schedule+0x73/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff811daa5d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d63c0>] ? blkdev_get_block+0x0/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff811db1c7>] __blockdev_direct_IO+0x77/0xe0
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d63c0>] ? blkdev_get_block+0x0/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d7447>] blkdev_direct_IO+0x57/0x60
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d63c0>] ? blkdev_get_block+0x0/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff8113048b>] generic_file_aio_read+0x6bb/0x700
Mar 17 22:50:25 host-113 kernel: [<ffffffff812422b1>] ? avc_has_perm+0x71/0x90
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d6931>] blkdev_aio_read+0x51/0x80
Mar 17 22:50:25 host-113 kernel: [<ffffffff81199afa>] do_sync_read+0xfa/0x140
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff811d675c>] ? block_ioctl+0x3c/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff811af5d2>] ? vfs_ioctl+0x22/0xa0
Mar 17 22:50:25 host-113 kernel: [<ffffffff81247deb>] ? selinux_file_permission+0xfb/0x150
Mar 17 22:50:25 host-113 kernel: [<ffffffff8123aa76>] ? security_file_permission+0x16/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff8119a3f5>] vfs_read+0xb5/0x1a0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8119b1a6>] ? fget_light_pos+0x16/0x50
Mar 17 22:50:25 host-113 kernel: [<ffffffff8119a741>] sys_read+0x51/0xb0
Mar 17 22:50:25 host-113 kernel: [<ffffffff810ee45e>] ? __audit_syscall_exit+0x25e/0x290
Mar 17 22:50:25 host-113 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Mar 17 22:50:25 host-113 kernel: INFO: task mdX_raid1:7947 blocked for more than 120 seconds.
Mar 17 22:50:25 host-113 kernel:      Not tainted 2.6.32-616.el6.x86_64 #1
Mar 17 22:50:25 host-113 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 22:50:25 host-113 kernel: mdX_raid1     D 0000000000000000     0  7947      2 0x00000080
Mar 17 22:50:25 host-113 kernel: ffff88003dee7c90 0000000000000046 ffff880039c69a00 0000000300000001
Mar 17 22:50:25 host-113 kernel: ffff880024828800 ffff88003d2bbed0 0000000000000282 0000000000000003
Mar 17 22:50:25 host-113 kernel: 0000000000000001 0000000000000282 ffff88003a569068 ffff88003dee7fd8
Mar 17 22:50:25 host-113 kernel: Call Trace:
Mar 17 22:50:25 host-113 kernel: [<ffffffff814290f9>] ? md_wakeup_thread+0x39/0x80
Mar 17 22:50:25 host-113 kernel: [<ffffffffa04944e5>] freeze_array+0xa5/0x100 [raid1]
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffffa0496528>] raid1d+0xb08/0x18c0 [raid1]
Mar 17 22:50:25 host-113 kernel: [<ffffffff8108f14c>] ? lock_timer_base+0x3c/0x70
Mar 17 22:50:25 host-113 kernel: [<ffffffff8154771a>] ? schedule_timeout+0x19a/0x2e0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8108f260>] ? process_timeout+0x0/0x10
Mar 17 22:50:25 host-113 kernel: [<ffffffff8142fc25>] md_thread+0x115/0x150
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff8142fb10>] ? md_thread+0x0/0x150
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a662e>] kthread+0x9e/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Mar 17 22:50:25 host-113 kernel: INFO: task flush-253:12:8127 blocked for more than 120 seconds.
Mar 17 22:50:25 host-113 kernel:      Not tainted 2.6.32-616.el6.x86_64 #1
Mar 17 22:50:25 host-113 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 17 22:50:25 host-113 kernel: flush-253:12  D 0000000000000000     0  8127      2 0x00000080
Mar 17 22:50:25 host-113 kernel: ffff8800375979a0 0000000000000046 ffff880000000000 000db595d5b4ae7d
Mar 17 22:50:25 host-113 kernel: ffff880037597960 ffff88001b65eec0 0000000001c00bfc ffffffffa914c87e
Mar 17 22:50:25 host-113 kernel: 00000000390ca2cc ffffffff81aa8340 ffff8800248785f8 ffff880037597fd8
Mar 17 22:50:25 host-113 kernel: Call Trace:
Mar 17 22:50:25 host-113 kernel: [<ffffffff8112e5f0>] ? sync_page+0x0/0x50
Mar 17 22:50:25 host-113 kernel: [<ffffffff81547063>] io_schedule+0x73/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8112e62d>] sync_page+0x3d/0x50
Mar 17 22:50:25 host-113 kernel: [<ffffffff8154791a>] __wait_on_bit_lock+0x5a/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffffa04d8d00>] ? buffer_unmapped+0x0/0x20 [ext3]
Mar 17 22:50:25 host-113 kernel: [<ffffffff8112e5c7>] __lock_page+0x67/0x70
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6b40>] ? wake_bit_function+0x0/0x50
Mar 17 22:50:25 host-113 kernel: [<ffffffff81142437>] ? __writepage+0x17/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff811438cd>] write_cache_pages+0x3cd/0x4c0
Mar 17 22:50:25 host-113 kernel: [<ffffffffa0498d00>] ? raid1_end_write_request+0x0/0x2b0 [raid1]
Mar 17 22:50:25 host-113 kernel: [<ffffffff81142420>] ? __writepage+0x0/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff81431470>] ? mddev_bio_destructor+0x0/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff811439e4>] generic_writepages+0x24/0x30
Mar 17 22:50:25 host-113 kernel: [<ffffffff81143a25>] do_writepages+0x35/0x40
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c6eed>] writeback_single_inode+0xdd/0x290
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c72ed>] writeback_sb_inodes+0xbd/0x170
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c744b>] writeback_inodes_wb+0xab/0x1b0
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c7843>] wb_writeback+0x2f3/0x410
Mar 17 22:50:25 host-113 kernel: [<ffffffff8108fdd2>] ? del_timer_sync+0x22/0x30
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c7b05>] wb_do_writeback+0x1a5/0x240
Mar 17 22:50:25 host-113 kernel: [<ffffffff811c7c03>] bdi_writeback_task+0x63/0x1b0
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6947>] ? bit_waitqueue+0x17/0xd0
Mar 17 22:50:25 host-113 kernel: [<ffffffff81152a90>] ? bdi_start_fn+0x0/0x100
Mar 17 22:50:25 host-113 kernel: [<ffffffff81152b16>] bdi_start_fn+0x86/0x100
Mar 17 22:50:25 host-113 kernel: [<ffffffff81152a90>] ? bdi_start_fn+0x0/0x100
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a662e>] kthread+0x9e/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Mar 17 22:50:25 host-113 kernel: [<ffffffff810a6590>] ? kthread+0x0/0xc0
Mar 17 22:50:25 host-113 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20


Version-Release number of selected component (if applicable):
2.6.32-616.el6.x86_64

lvm2-2.02.143-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
lvm2-libs-2.02.143-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
lvm2-cluster-2.02.143-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
udev-147-2.72.el6    BUILT: Tue Mar  1 06:14:05 CST 2016
device-mapper-1.02.117-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
device-mapper-libs-1.02.117-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
device-mapper-event-1.02.117-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
device-mapper-event-libs-1.02.117-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016
device-mapper-persistent-data-0.6.2-0.1.rc5.el6    BUILT: Wed Feb 24 07:07:09 CST 2016
cmirror-2.02.143-2.el6    BUILT: Wed Mar 16 08:30:42 CDT 2016


How reproducible:
Once