Bug 1309484 - device failure on exclusively active raid w/ snapshot (on top of cluster VG) may lead to deadlock
device failure on exclusively active raid w/ snapshot (on top of cluster VG) ...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.8
x86_64 Linux
unspecified Severity low
: rc
: ---
Assigned To: LVM and device-mapper development team
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-17 18:39 EST by Corey Marthaler
Modified: 2017-10-03 21:00 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-03 21:00:59 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2016-02-17 18:39:08 EST
Description of problem:
This seems similar to bug 1219222 and was found while attempting to verify it. However bug 1219222 was nearly 100% reproducible and this was only seen once in hundreds of iterations.

================================================================================
Iteration 4.17 started at Wed Feb 17 07:06:21 CST 2016
================================================================================
Scenario kill_second_spanned_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_spanned_primary_raid1_2legs_1
* sync:               1
* type:               raid1
* -m |-i value:       1
* leg devices:        /dev/sdf1 /dev/sdc1 /dev/sda1 /dev/sde1
* spanned legs:        1
* failpv(s):          /dev/sda1
* additional snap:    /dev/sdf1
* failnode(s):        host-030.virt.lab.msp.redhat.com
* lvmetad:            0
* raid fault policy:  warn
******************************************************

Creating raids(s) on host-030.virt.lab.msp.redhat.com...
host-030.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 1 -n synced_spanned_primary_raid1_2legs_1 -L 500M black_bird /dev/sdf1:0-62 /dev/sdc1:0-62 /dev/sda1:0-62 /dev/sde1:0-62
EXCLUSIVELY ACTIVATING RAID on host-030.virt.lab.msp.redhat.com

Current mirror/raid device structure(s):
  LV                                              Attr       LSize   Cpy%Sync Devices
  synced_spanned_primary_raid1_2legs_1            rwi-a-r--- 500.00m 0.80     synced_spanned_primary_raid1_2legs_1_rimage_0(0),synced_spanned_primary_raid1_2legs_1_rimage_1(0)
  [synced_spanned_primary_raid1_2legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdf1(1)
  [synced_spanned_primary_raid1_2legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sda1(0)
  [synced_spanned_primary_raid1_2legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sdc1(1)
  [synced_spanned_primary_raid1_2legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sde1(0)
  [synced_spanned_primary_raid1_2legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdf1(0)
  [synced_spanned_primary_raid1_2legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdc1(0)


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on host-030.virt.lab.msp.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-030.virt.lab.msp.redhat.com...

PV=/dev/sda1
        synced_spanned_primary_raid1_2legs_1_rimage_0: 2

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
        ---- host-030.virt.lab.msp.redhat.com ----

<start name="host-030.virt.lab.msp.redhat.com_synced_spanned_primary_raid1_2legs_1" pid="2580" time="Wed Feb 17 07:06:57 2016" type="cmd" />
Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-030.virt.lab.msp.redhat.com ----


Disabling device sda on host-030.virt.lab.msp.redhat.com

simple pvs failed



Feb 17 07:07:19 host-030 qarshd[28377]: Running cmdline: echo offline > /sys/block/sda/device/state
Feb 17 07:07:19 host-030 xinetd[3157]: EXIT: qarsh status=0 pid=28377 duration=0(sec)
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 kernel: md/raid1:mdX: dm-3: rescheduling sector 507904
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 kernel: md/raid1:mdX: dm-3: rescheduling sector 509072
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 kernel: md/raid1:mdX: dm-3: rescheduling sector 884736
Feb 17 07:07:26 host-030 kernel: md/raid1:mdX: dm-3: rescheduling sector 884744
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 kernel: md/raid1:mdX: dm-3: rescheduling sector 917504
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 lvm[13214]: Device #0 of raid1 array, black_bird-synced_spanned_primary_raid1_2legs_1-real, has failed.
Feb 17 07:07:26 host-030 lvm[13214]: /dev/sda1: read failed after 0 of 2048 at 0: Input/output error
Feb 17 07:07:26 host-030 kernel: sd 2:0:0:1: rejecting I/O to offline device
Feb 17 07:07:26 host-030 lvm[13214]: /dev/sda1: read failed after 0 of 512 at 21467824128: Input/output error
[...]
Feb 17 07:07:34 host-030 qarshd[28380]: Running cmdline: pvs -a
[...]
Feb 17 07:09:34 host-030 xinetd[3157]: EXIT: qarsh status=0 pid=28380 duration=120(sec)
Feb 17 07:09:54 host-030 kernel: INFO: task clvmd:7838 blocked for more than 120 seconds.
Feb 17 07:09:54 host-030 kernel:      Not tainted 2.6.32-615.el6.x86_64 #1
Feb 17 07:09:54 host-030 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 17 07:09:54 host-030 kernel: clvmd         D 0000000000000000     0  7838      1 0x00000080
Feb 17 07:09:54 host-030 kernel: ffff88003661b5c8 0000000000000086 0000000000000000 0000000300000001
Feb 17 07:09:54 host-030 kernel: 0000000000000000 ffff880014da3e50 0000000000000282 0000000000000003
Feb 17 07:09:54 host-030 kernel: 0000000000000001 0000000000000282 ffff88003a5b7068 ffff88003661bfd8
Feb 17 07:09:54 host-030 kernel: Call Trace:
Feb 17 07:09:54 host-030 kernel: [<ffffffff814290a9>] ? md_wakeup_thread+0x39/0x80
Feb 17 07:09:54 host-030 kernel: [<ffffffffa052267b>] wait_barrier+0xdb/0x260 [raid1]
Feb 17 07:09:54 host-030 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffffa052536f>] make_request+0x8f/0xe50 [raid1]
Feb 17 07:09:54 host-030 kernel: [<ffffffff81130d55>] ? mempool_alloc_slab+0x15/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff81130ef3>] ? mempool_alloc+0x63/0x140
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d611b>] ? bio_alloc_bioset+0x5b/0xf0
Feb 17 07:09:54 host-030 kernel: [<ffffffffa053e018>] raid_map+0x18/0x20 [dm_raid]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0002568>] __map_bio+0x48/0x140 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0002cec>] __split_and_process_bio+0x4bc/0x5e0 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d87ca>] ? dio_bio_end_io+0x5a/0x90
Feb 17 07:09:54 host-030 kernel: [<ffffffffa00043d7>] dm_make_request+0x117/0x180 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffff8127ddd0>] generic_make_request+0x240/0x5a0
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d946c>] ? do_direct_IO+0x57c/0xfa0
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0001fe3>] ? dm_merge_bvec+0x63/0x100 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa05ba0db>] ? origin_merge+0x3b/0x50 [dm_snapshot]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa000205c>] ? dm_merge_bvec+0xdc/0x100 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffff8127e1a0>] submit_bio+0x70/0x120
Feb 17 07:09:54 host-030 kernel: [<ffffffff811dab0d>] __blockdev_direct_IO_newtrunc+0xc7d/0x1270
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff811db177>] __blockdev_direct_IO+0x77/0xe0
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d73f7>] blkdev_direct_IO+0x57/0x60
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff8113043b>] generic_file_aio_read+0x6bb/0x700
Feb 17 07:09:54 host-030 kernel: [<ffffffff81242261>] ? avc_has_perm+0x71/0x90
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d68e1>] blkdev_aio_read+0x51/0x80
Feb 17 07:09:54 host-030 kernel: [<ffffffff81199aaa>] do_sync_read+0xfa/0x140
Feb 17 07:09:54 host-030 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d670c>] ? block_ioctl+0x3c/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffff811af582>] ? vfs_ioctl+0x22/0xa0
Feb 17 07:09:54 host-030 kernel: [<ffffffff81247d9b>] ? selinux_file_permission+0xfb/0x150
Feb 17 07:09:54 host-030 kernel: [<ffffffff8123aa26>] ? security_file_permission+0x16/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119a3a5>] vfs_read+0xb5/0x1a0
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119b156>] ? fget_light_pos+0x16/0x50
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119a6f1>] sys_read+0x51/0xb0
Feb 17 07:09:54 host-030 kernel: [<ffffffff810ee45e>] ? __audit_syscall_exit+0x25e/0x290
Feb 17 07:09:54 host-030 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Feb 17 07:09:54 host-030 kernel: INFO: task dmeventd:28347 blocked for more than 120 seconds.
Feb 17 07:09:54 host-030 kernel:      Not tainted 2.6.32-615.el6.x86_64 #1
Feb 17 07:09:54 host-030 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 17 07:09:54 host-030 kernel: dmeventd      D 0000000000000000     0 28347      1 0x00000080
Feb 17 07:09:54 host-030 kernel: ffff88002aacf5c8 0000000000000086 0000000000000000 0000000300000001
Feb 17 07:09:54 host-030 kernel: 0000000000000000 ffff880014da3e50 0000000000000282 0000000000000003
Feb 17 07:09:54 host-030 kernel: 0000000000000001 0000000000000282 ffff880022ab9ad8 ffff88002aacffd8
Feb 17 07:09:54 host-030 kernel: Call Trace:
Feb 17 07:09:54 host-030 kernel: [<ffffffff814290a9>] ? md_wakeup_thread+0x39/0x80
Feb 17 07:09:54 host-030 kernel: [<ffffffffa052267b>] wait_barrier+0xdb/0x260 [raid1]
Feb 17 07:09:54 host-030 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffffa052536f>] make_request+0x8f/0xe50 [raid1]
Feb 17 07:09:54 host-030 kernel: [<ffffffff81130d55>] ? mempool_alloc_slab+0x15/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff81130ef3>] ? mempool_alloc+0x63/0x140
Feb 17 07:09:54 host-030 kernel: [<ffffffff81296f79>] ? cfq_set_request+0x329/0x560
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d611b>] ? bio_alloc_bioset+0x5b/0xf0
Feb 17 07:09:54 host-030 kernel: [<ffffffffa053e018>] raid_map+0x18/0x20 [dm_raid]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0002568>] __map_bio+0x48/0x140 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0002cec>] __split_and_process_bio+0x4bc/0x5e0 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa00043d7>] dm_make_request+0x117/0x180 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffff8127ddd0>] generic_make_request+0x240/0x5a0
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d946c>] ? do_direct_IO+0x57c/0xfa0
Feb 17 07:09:54 host-030 kernel: [<ffffffffa0001fe3>] ? dm_merge_bvec+0x63/0x100 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa05ba0db>] ? origin_merge+0x3b/0x50 [dm_snapshot]
Feb 17 07:09:54 host-030 kernel: [<ffffffffa000205c>] ? dm_merge_bvec+0xdc/0x100 [dm_mod]
Feb 17 07:09:54 host-030 kernel: [<ffffffff8127e1a0>] submit_bio+0x70/0x120
Feb 17 07:09:54 host-030 kernel: [<ffffffff811dab0d>] __blockdev_direct_IO_newtrunc+0xc7d/0x1270
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff811db177>] __blockdev_direct_IO+0x77/0xe0
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d73f7>] blkdev_direct_IO+0x57/0x60
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d6370>] ? blkdev_get_block+0x0/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff8113043b>] generic_file_aio_read+0x6bb/0x700
Feb 17 07:09:54 host-030 kernel: [<ffffffff81242261>] ? avc_has_perm+0x71/0x90
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d68e1>] blkdev_aio_read+0x51/0x80
Feb 17 07:09:54 host-030 kernel: [<ffffffff81199aaa>] do_sync_read+0xfa/0x140
Feb 17 07:09:54 host-030 kernel: [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffff811d670c>] ? block_ioctl+0x3c/0x40
Feb 17 07:09:54 host-030 kernel: [<ffffffff811af582>] ? vfs_ioctl+0x22/0xa0
Feb 17 07:09:54 host-030 kernel: [<ffffffff81247d9b>] ? selinux_file_permission+0xfb/0x150
Feb 17 07:09:54 host-030 kernel: [<ffffffff8123aa26>] ? security_file_permission+0x16/0x20
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119a3a5>] vfs_read+0xb5/0x1a0
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119b156>] ? fget_light_pos+0x16/0x50
Feb 17 07:09:54 host-030 kernel: [<ffffffff8119a6f1>] sys_read+0x51/0xb0
Feb 17 07:09:54 host-030 kernel: [<ffffffff810ee45e>] ? __audit_syscall_exit+0x25e/0x290
Feb 17 07:09:54 host-030 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
2.6.32-615.el6.x86_64

lvm2-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-libs-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-cluster-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
udev-147-2.71.el6    BUILT: Wed Feb 10 07:07:17 CST 2016
device-mapper-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-persistent-data-0.6.2-0.1.rc1.el6    BUILT: Wed Feb 10 09:52:15 CST 2016
cmirror-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016

How reproducible:
Only once so far in many iterations

Note You need to log in before you can comment on or make changes to this bug.