Bug 1219222

Summary: device failure on exclusively active raid w/ snapshot (on top of cluster VG) leads to deadlock
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agk, cluster-qe, heinzm, jbrassow, lvm-team, msnitzer, prajnoha, prockai, salmy, tlavigne, zkabelac
Version: 6.7Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.140-1.el6 Doc Type: Bug Fix
Doc Text:
Previously, if an exclusively activated RAID logical volume with a snapshot was being altered and other LVM commands were simultaneously run, it could cause I/O and LVM commands to hang. This could happen, for example, if 'pvs' was run at the same time as a repair of the RAID logical volume. The cause was traced to LVM commands attempting to read logical volumes while they were being altered. This issue has been resolved.
Story Points: ---
Clone Of: 1193222
: 1236690 (view as bug list) Environment:
Last Closed: 2016-05-11 01:16:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1193222    
Bug Blocks: 1236690, 1268411    
Attachments:
Description Flags
messages from 124
none
strace of clvmd -d2 -f on 122
none
strace of clvmd -d2 -f on 123
none
strace of clvmd -d2 -f on 124
none
clvmd log for raid1 with snapshot repair after device failure - pvs run in the middle - hang
none
clvmd log for raid1 with snapshot repair after device failure - pvs run after all processing - pvs not hung none

Description Corey Marthaler 2015-05-06 20:58:46 UTC
+++ This bug was initially created as a clone of Bug #1193222 +++

Description of problem:
This appears to be 100% reproducible when the exclusively active raid volume contains a snaphot, and isn't reproducible when a snaphot is not present. This deadlock causes the CPU load to continue increase.


[root@host-127 ~]# pcs status
Cluster name: STSRHTS16897
Last updated: Mon Feb 16 16:07:07 2015
Last change: Mon Feb 16 15:11:31 2015
Stack: corosync
Current DC: host-129 (3) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
9 Resources configured


Online: [ host-127 host-128 host-129 ]

Full list of resources:

 fence-host-127 (stonith:fence_xvm):    Started host-127 
 fence-host-128 (stonith:fence_xvm):    Started host-128 
 fence-host-129 (stonith:fence_xvm):    Started host-129 
 Clone Set: dlm-clone [dlm]
     Started: [ host-127 host-128 host-129 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ host-127 host-128 host-129 ]

PCSD Status:
  host-127: Online
  host-128: Online
  host-129: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled




black_bird -R $sts_resource_file -E host-127.virt.lab.msp.redhat.com -i 1 -F -w EXT

creating lvm devices...
Create 7 PV(s) for black_bird on host-127
Create VG black_bird on host-127
Enabling raid allocate fault policies on: host-127
================================================================================
Iteration 0.1 started at Mon Feb 16 15:12:13 CST 2015
================================================================================
Scenario kill_random_synced_raid1_2legs: Kill random leg of synced 2 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_random_raid1_2legs_1
* sync:               1
* type:               raid1
* -m |-i value:       2
* leg devices:        /dev/sdd1 /dev/sda1 /dev/sdh1
* spanned legs:        0
* failpv(s):          /dev/sdh1
* additional snap:    /dev/sdd1
* failnode(s):        host-127
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Creating raids(s) on host-127...
host-127: lvcreate --type raid1 -m 2 -n synced_random_raid1_2legs_1 -L 500M black_bird /dev/sdd1:0-2400 /dev/sda1:0-2400 /dev/sdh1:0-2400
EXCLUSIVELY ACTIVATING RAID on host-127

Current mirror/raid device structure(s):
  LV                                     Attr       LSize   Cpy%Sync Devices
  synced_random_raid1_2legs_1            rwi-a-r--- 500.00m 0.80     synced_random_raid1_2legs_1_rimage_0(0),synced_random_raid1_2legs_1_rimage_1(0),synced_random_raid1_2legs_1_rimage_2(0)
  [synced_random_raid1_2legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/sdd1(1)
  [synced_random_raid1_2legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/sda1(1)
  [synced_random_raid1_2legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/sdh1(1)
  [synced_random_raid1_2legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdd1(0)
  [synced_random_raid1_2legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sda1(0)
  [synced_random_raid1_2legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sdh1(0)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on host-127...
mke2fs 1.42.9 (28-Dec-2013)
Mounting mirrored ext filesystems on host-127...

PV=/dev/sdh1
   synced_random_raid1_2legs_1_rimage_2: 1.0
   synced_random_raid1_2legs_1_rmeta_2: 1.0

Creating a snapshot volume of each of the raids
host-127: lvcreate -L 250M -n bb_snap1 -s black_bird/synced_random_raid1_2legs_1 /dev/sdd1
Writing verification files (checkit) to mirror(s) on...
   ---- host-127 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
   ---- host-127 ----

Disabling device sdh on host-127

simple pvs failed/segfaulted possible regression of BZ 571963
FI_engine: inject() method failed
[DEADLOCK]




Feb 16 15:12:58 host-127 qarshd[4627]: Running cmdline: echo offline > /sys/block/sdh/device/state
Feb 16 15:12:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 1.090000
Feb 16 15:13:01 host-127 kernel: sd 8:0:0:1: rejecting I/O to offline device
Feb 16 15:13:01 host-127 kernel: md: super_written gets error=-5, uptodate=0
Feb 16 15:13:01 host-127 kernel: md/raid1:mdX: Disk failure on dm-7, disabling device.
md/raid1:mdX: Operation continuing on 2 devices.
Feb 16 15:13:01 host-127 lvm[4293]: Device #2 of raid1 array, black_bird-synced_random_raid1_2legs_1-real, has failed.
Feb 16 15:13:01 host-127 kernel: sd 8:0:0:1: rejecting I/O to offline device
Feb 16 15:13:01 host-127 lvm[4293]: /dev/sdh1: read failed after 0 of 4096 at 26838958080: Input/output error
Feb 16 15:13:01 host-127 lvm[4293]: /dev/sdh1: read failed after 0 of 4096 at 26839048192: Input/output error
Feb 16 15:13:01 host-127 lvm[4293]: /dev/sdh1: read failed after 0 of 4096 at 0: Input/output error
Feb 16 15:13:01 host-127 lvm[4293]: /dev/sdh1: read failed after 0 of 4096 at 4096: Input/output error
Feb 16 15:13:01 host-127 lvm[4293]: Couldn't find device with uuid CvXsDs-3qf5-UQzm-UjPt-GbZc-evEe-QLYbp8.
Feb 16 15:13:01 host-127 kernel: sd 8:0:0:1: rejecting I/O to offline device
Feb 16 15:13:02 host-127 kernel: sd 8:0:0:1: rejecting I/O to offline device
Feb 16 15:13:02 host-127 kernel: sd 8:0:0:1: rejecting I/O to offline device
Feb 16 15:13:02 host-127 kernel: device-mapper: raid: Device 2 specified for rebuild: Clearing superblock
Feb 16 15:13:02 host-127 kernel: md/raid1:mdX: active with 2 out of 3 mirrors
Feb 16 15:13:02 host-127 kernel: created bitmap (1 pages) for device mdX
Feb 16 15:13:03 host-127 qarshd[4655]: Running cmdline: pvs -a
Feb 16 15:13:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 1.670000
Feb 16 15:13:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 2.480000
Feb 16 15:14:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.080000
Feb 16 15:14:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.440000
Feb 16 15:15:10 host-127 lvm[4293]: Error locking on node 1: Command timed out
Feb 16 15:15:22 host-127 kernel: INFO: task clvmd:3096 blocked for more than 120 seconds.
Feb 16 15:15:22 host-127 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 15:15:22 host-127 kernel: clvmd           D ffff88003fc13680     0  3096      1 0x00000080
Feb 16 15:15:22 host-127 kernel: ffff88002349fa70 0000000000000086 ffff8800234471c0 ffff88002349ffd8
Feb 16 15:15:22 host-127 kernel: ffff88002349ffd8 ffff88002349ffd8 ffff8800234471c0 ffff88003fc13f48
Feb 16 15:15:22 host-127 kernel: ffff88001edbaf00 ffff8800234471c0 0000000000000000 0000000000000000
Feb 16 15:15:22 host-127 kernel: Call Trace:
Feb 16 15:15:22 host-127 kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130
Feb 16 15:15:22 host-127 kernel: [<ffffffff81203ea3>] do_blockdev_direct_IO+0xc03/0x2620
Feb 16 15:15:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:15:22 host-127 kernel: [<ffffffff81205915>] __blockdev_direct_IO+0x55/0x60
Feb 16 15:15:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:15:22 host-127 kernel: [<ffffffff812000b7>] blkdev_direct_IO+0x57/0x60
Feb 16 15:15:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:15:22 host-127 kernel: [<ffffffff81158213>] generic_file_aio_read+0x6d3/0x750
Feb 16 15:15:22 host-127 kernel: [<ffffffff811e61de>] ? mntput_no_expire+0x3e/0x120
Feb 16 15:15:22 host-127 kernel: [<ffffffff811e62e4>] ? mntput+0x24/0x40
Feb 16 15:15:22 host-127 kernel: [<ffffffff8120063c>] blkdev_aio_read+0x4c/0x70
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c5d6d>] do_sync_read+0x8d/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c644c>] vfs_read+0x9c/0x170
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c6f78>] SyS_read+0x58/0xb0
Feb 16 15:15:22 host-127 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
Feb 16 15:15:22 host-127 kernel: INFO: task xdoio:4598 blocked for more than 120 seconds.
Feb 16 15:15:22 host-127 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 15:15:22 host-127 kernel: xdoio           D ffff88003fc13680     0  4598   4597 0x00000080
Feb 16 15:15:22 host-127 kernel: ffff88001edf7978 0000000000000082 ffff88001f0ec440 ffff88001edf7fd8
Feb 16 15:15:22 host-127 kernel: ffff88001edf7fd8 ffff88001edf7fd8 ffff88001f0ec440 ffff88003fc13f48
Feb 16 15:15:22 host-127 kernel: ffff88001edf7a00 0000000000000002 ffffffff81155f10 ffff88001edf79f0
Feb 16 15:15:22 host-127 kernel: Call Trace:
Feb 16 15:15:22 host-127 kernel: [<ffffffff81155f10>] ? wait_on_page_read+0x60/0x60
Feb 16 15:15:22 host-127 kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130
Feb 16 15:15:22 host-127 kernel: [<ffffffff81155f1e>] sleep_on_page+0xe/0x20
Feb 16 15:15:22 host-127 kernel: [<ffffffff8160746b>] __wait_on_bit_lock+0x5b/0xc0
Feb 16 15:15:22 host-127 kernel: [<ffffffff81156038>] __lock_page+0x78/0xa0
Feb 16 15:15:22 host-127 kernel: [<ffffffff81098270>] ? autoremove_wake_function+0x40/0x40
Feb 16 15:15:22 host-127 kernel: [<ffffffff81156974>] __find_lock_page+0x54/0x70
Feb 16 15:15:22 host-127 kernel: [<ffffffff81157462>] grab_cache_page_write_begin+0x62/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffffa035723c>] ext4_write_begin+0x9c/0x420 [ext4]
Feb 16 15:15:22 host-127 kernel: [<ffffffff8115648d>] generic_file_buffered_write+0x11d/0x290
Feb 16 15:15:22 host-127 kernel: [<ffffffff811585f5>] __generic_file_aio_write+0x1d5/0x3e0
Feb 16 15:15:22 host-127 kernel: [<ffffffff810abe58>] ? sched_clock_cpu+0xa8/0x100
Feb 16 15:15:22 host-127 kernel: [<ffffffff8115885d>] generic_file_aio_write+0x5d/0xc0
Feb 16 15:15:22 host-127 kernel: [<ffffffffa034cb75>] ext4_file_write+0xb5/0x460 [ext4]
Feb 16 15:15:22 host-127 kernel: [<ffffffff811e22fb>] ? iput+0x3b/0x180
Feb 16 15:15:22 host-127 kernel: [<ffffffff810d1e05>] ? drop_futex_key_refs.isra.13+0x35/0x70
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c5ef9>] do_sync_readv_writev+0x79/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c73ee>] do_readv_writev+0xce/0x260
Feb 16 15:15:22 host-127 kernel: [<ffffffffa034cac0>] ? ext4_file_mmap+0x30/0x30 [ext4]
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c5db0>] ? do_sync_read+0xd0/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffff81052b0f>] ? kvm_clock_get_cycles+0x1f/0x30
Feb 16 15:15:22 host-127 kernel: [<ffffffff810c895a>] ? __getnstimeofday+0x3a/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c7615>] vfs_writev+0x35/0x60
Feb 16 15:15:22 host-127 kernel: [<ffffffff811c776c>] SyS_writev+0x5c/0xd0
Feb 16 15:15:22 host-127 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
Feb 16 15:15:27 host-127 systemd: Starting Session 3 of user root.
Feb 16 15:15:27 host-127 systemd: Started Session 3 of user root.
Feb 16 15:15:27 host-127 systemd-logind: New session 3 of user root.
Feb 16 15:15:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.660000
Feb 16 15:15:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.790000
Feb 16 15:16:11 host-127 lvm[4293]: Error locking on node 1: Command timed out
Feb 16 15:16:11 host-127 lvm[4293]: Problem reactivating logical volume black_bird/synced_random_raid1_2legs_1.
Feb 16 15:16:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.880000
Feb 16 15:16:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.920000
Feb 16 15:17:22 host-127 kernel: INFO: task kworker/u2:2:263 blocked for more than 120 seconds.
Feb 16 15:17:22 host-127 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 15:17:22 host-127 kernel: kworker/u2:2    D ffff88003fc13680     0   263      2 0x00000000
Feb 16 15:17:22 host-127 kernel: Workqueue: writeback bdi_writeback_workfn (flush-253:8)
Feb 16 15:17:22 host-127 kernel: ffff880036cab8d8 0000000000000046 ffff880036c8c440 ffff880036cabfd8
Feb 16 15:17:22 host-127 kernel: ffff880036cabfd8 ffff880036cabfd8 ffff880036c8c440 ffff88003fc13f48
Feb 16 15:17:22 host-127 kernel: ffff880036cab960 0000000000000002 ffffffff81155f10 ffff880036cab950
Feb 16 15:17:22 host-127 kernel: Call Trace:
Feb 16 15:17:22 host-127 kernel: [<ffffffff81155f10>] ? wait_on_page_read+0x60/0x60
Feb 16 15:17:22 host-127 kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130
Feb 16 15:17:22 host-127 kernel: [<ffffffff81155f1e>] sleep_on_page+0xe/0x20
Feb 16 15:17:22 host-127 kernel: [<ffffffff8160746b>] __wait_on_bit_lock+0x5b/0xc0
Feb 16 15:17:22 host-127 kernel: [<ffffffff81156038>] __lock_page+0x78/0xa0
Feb 16 15:17:22 host-127 kernel: [<ffffffff81098270>] ? autoremove_wake_function+0x40/0x40
Feb 16 15:17:22 host-127 kernel: [<ffffffffa0351f90>] mpage_prepare_extent_to_map+0x2d0/0x2e0 [ext4]
Feb 16 15:17:22 host-127 kernel: [<ffffffffa0356043>] ext4_writepages+0x463/0xd60 [ext4]
Feb 16 15:17:22 host-127 kernel: [<ffffffff81161ac8>] ? generic_writepages+0x58/0x80
Feb 16 15:17:22 host-127 kernel: [<ffffffff81162b6e>] do_writepages+0x1e/0x40
Feb 16 15:17:22 host-127 kernel: [<ffffffff811efe10>] __writeback_single_inode+0x40/0x220
Feb 16 15:17:22 host-127 kernel: [<ffffffff811f0b0e>] writeback_sb_inodes+0x25e/0x420
Feb 16 15:17:22 host-127 kernel: [<ffffffff811f0d6f>] __writeback_inodes_wb+0x9f/0xd0
Feb 16 15:17:22 host-127 kernel: [<ffffffff811f15b3>] wb_writeback+0x263/0x2f0
Feb 16 15:17:22 host-127 kernel: [<ffffffff811e027c>] ? get_nr_inodes+0x4c/0x70
Feb 16 15:17:22 host-127 kernel: [<ffffffff811f2beb>] bdi_writeback_workfn+0x2cb/0x460
Feb 16 15:17:22 host-127 kernel: [<ffffffff8108f0ab>] process_one_work+0x17b/0x470
Feb 16 15:17:22 host-127 kernel: [<ffffffff8108fe8b>] worker_thread+0x11b/0x400
Feb 16 15:17:22 host-127 kernel: [<ffffffff8108fd70>] ? rescuer_thread+0x400/0x400
Feb 16 15:17:22 host-127 kernel: [<ffffffff8109726f>] kthread+0xcf/0xe0
Feb 16 15:17:22 host-127 kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Feb 16 15:17:22 host-127 kernel: [<ffffffff81613cfc>] ret_from_fork+0x7c/0xb0
Feb 16 15:17:22 host-127 kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Feb 16 15:17:22 host-127 kernel: INFO: task clvmd:3096 blocked for more than 120 seconds.
Feb 16 15:17:22 host-127 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 15:17:22 host-127 kernel: clvmd           D ffff88003fc13680     0  3096      1 0x00000080
Feb 16 15:17:22 host-127 kernel: ffff88002349fa70 0000000000000086 ffff8800234471c0 ffff88002349ffd8
Feb 16 15:17:22 host-127 kernel: ffff88002349ffd8 ffff88002349ffd8 ffff8800234471c0 ffff88003fc13f48
Feb 16 15:17:22 host-127 kernel: ffff88001edbaf00 ffff8800234471c0 0000000000000000 0000000000000000
Feb 16 15:17:22 host-127 kernel: Call Trace:
Feb 16 15:17:22 host-127 kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130
Feb 16 15:17:22 host-127 kernel: [<ffffffff81203ea3>] do_blockdev_direct_IO+0xc03/0x2620
Feb 16 15:17:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:17:22 host-127 kernel: [<ffffffff81205915>] __blockdev_direct_IO+0x55/0x60
Feb 16 15:17:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:17:22 host-127 kernel: [<ffffffff812000b7>] blkdev_direct_IO+0x57/0x60
Feb 16 15:17:22 host-127 kernel: [<ffffffff811ffa60>] ? I_BDEV+0x10/0x10
Feb 16 15:17:22 host-127 kernel: [<ffffffff81158213>] generic_file_aio_read+0x6d3/0x750
Feb 16 15:17:22 host-127 kernel: [<ffffffff811e61de>] ? mntput_no_expire+0x3e/0x120
Feb 16 15:17:22 host-127 kernel: [<ffffffff811e62e4>] ? mntput+0x24/0x40
Feb 16 15:17:22 host-127 kernel: [<ffffffff8120063c>] blkdev_aio_read+0x4c/0x70
Feb 16 15:17:22 host-127 kernel: [<ffffffff811c5d6d>] do_sync_read+0x8d/0xd0
Feb 16 15:17:22 host-127 kernel: [<ffffffff811c644c>] vfs_read+0x9c/0x170
Feb 16 15:17:22 host-127 kernel: [<ffffffff811c6f78>] SyS_read+0x58/0xb0
Feb 16 15:17:22 host-127 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b
Feb 16 15:17:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.950000
Feb 16 15:17:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.970000
Feb 16 15:18:11 host-127 lvm[4293]: Error locking on node 1: Command timed out
Feb 16 15:18:11 host-127 lvm[4293]: Failed to replace faulty devices in black_bird/synced_random_raid1_2legs_1.
Feb 16 15:18:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.980000
Feb 16 15:18:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.990000
Feb 16 15:19:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 3.990000
Feb 16 15:19:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 4.000000
Feb 16 15:20:11 host-127 lvm[4293]: Error locking on node 1: Command timed out
Feb 16 15:20:28 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 4.000000
Feb 16 15:20:58 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 4.000000

[...]

Feb 16 15:41:59 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 5.000000
Feb 16 15:42:29 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 5.000000
Feb 16 15:42:59 host-127 crmd[2523]: notice: throttle_handle_load: High CPU load detected: 5.000000




Version-Release number of selected component (if applicable):
3.10.0-229.el7.x86_64
lvm2-2.02.115-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
lvm2-libs-2.02.115-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
lvm2-cluster-2.02.115-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
device-mapper-1.02.93-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
device-mapper-libs-1.02.93-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
device-mapper-event-1.02.93-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
device-mapper-event-libs-1.02.93-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 12:39:46 CST 2014
cmirror-2.02.115-3.el7    BUILT: Wed Jan 28 09:59:01 CST 2015


How reproducible:
Everytime

--- Additional comment from Corey Marthaler on 2015-02-16 17:13:39 EST ---

Comment 1 Corey Marthaler 2015-05-06 21:05:06 UTC
This exists in rhel6.7 as well. 


2.6.32-554.el6.x86_64
lvm2-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
lvm2-libs-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
lvm2-cluster-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
udev-147-2.61.el6    BUILT: Mon Mar  2 05:08:11 CST 2015
device-mapper-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-libs-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-event-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-event-libs-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 08:43:06 CDT 2014
cmirror-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015

[root@host-122 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     92   2015-05-06 15:48:58  host-122
   2   M    100   2015-05-06 15:48:59  host-123
   3   M     96   2015-05-06 15:48:59  host-124

================================================================================
Iteration 0.1 started at Wed May  6 14:44:30 CDT 2015
================================================================================
Scenario kill_primary_synced_raid4_2legs: Kill primary leg of synced 2 leg raid4 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid4_2legs_1
* sync:               1
* type:               raid4
* -m |-i value:       2
* leg devices:        /dev/sda1 /dev/sdb1 /dev/sde1
* spanned legs:        0
* failpv(s):          /dev/sda1
* additional snap:    /dev/sdb1
* failnode(s):        host-122
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Creating raids(s) on host-122...
host-122: lvcreate --type raid4 -i 2 -n synced_primary_raid4_2legs_1 -L 500M black_bird /dev/sda1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-2400
EXCLUSIVELY ACTIVATING RAID on host-122

Current mirror/raid device structure(s):
  LV                                      Attr       LSize   Cpy%Sync Devices
   synced_primary_raid4_2legs_1            rwi-a-r--- 504.00m 3.17     synced_primary_raid4_2legs_1_rimage_0(0),synced_primary_raid4_2legs_1_rimage_1(0),synced_primary_raid4_2legs_1_rimage_2(0)
   [synced_primary_raid4_2legs_1_rimage_0] Iwi-aor--- 252.00m          /dev/sda1(1)
   [synced_primary_raid4_2legs_1_rimage_1] Iwi-aor--- 252.00m          /dev/sdb1(1)
   [synced_primary_raid4_2legs_1_rimage_2] Iwi-aor--- 252.00m          /dev/sde1(1)
   [synced_primary_raid4_2legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sda1(0)
   [synced_primary_raid4_2legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdb1(0)
   [synced_primary_raid4_2legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sde1(0)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on host-122...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on host-122...

PV=/dev/sda1
   synced_primary_raid4_2legs_1_rimage_0: 1.0
   synced_primary_raid4_2legs_1_rmeta_0: 1.0

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
   ---- host-122 ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
   ---- host-122 ----

Disabling device sda on host-122

simple pvs failed
FI_engine: inject() method failed


host-122 qarshd[3711]: Running cmdline: pvs -a
host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device
host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device
host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device
host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device
host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device
host-122 xinetd[2395]: EXIT: qarsh status=0 pid=3711 duration=67(sec)
host-122 lvm[15257]: Error locking on node host-122: Command timed out
host-122 kernel: INFO: task clvmd:29382 blocked for more than 120 seconds.
host-122 kernel:      Not tainted 2.6.32-554.el6.x86_64 #1
host-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
host-122 kernel: clvmd         D 0000000000000000     0 29382      1 0x00000080
host-122 kernel: ffff88003a53fa18 0000000000000082 ffff880000000000 0003af8a169e9404
host-122 kernel: ffff88000591b6e0 ffff88002166f130 0000000005e3ab6c ffffffffaabbf2f5
host-122 kernel: 0000000012b294b7 ffffffff81aa6600 ffff880039cc7ad8 ffff88003a53ffd8
host-122 kernel: Call Trace:
host-122 kernel: [<ffffffff81537bc3>] io_schedule+0x73/0xc0
host-122 kernel: [<ffffffff811d1e8d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270
host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20
host-122 kernel: [<ffffffff811d25f7>] __blockdev_direct_IO+0x77/0xe0
host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20
host-122 kernel: [<ffffffff811ce877>] blkdev_direct_IO+0x57/0x60
host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20
host-122 kernel: [<ffffffff811290db>] generic_file_aio_read+0x6bb/0x700
host-122 kernel: [<ffffffff812386d1>] ? avc_has_perm+0x71/0x90
host-122 kernel: [<ffffffff8123a582>] ? selinux_inode_permission+0x72/0xb0
host-122 kernel: [<ffffffff811cdd61>] blkdev_aio_read+0x51/0x80
host-122 kernel: [<ffffffff811913fa>] do_sync_read+0xfa/0x140
host-122 kernel: [<ffffffff810a14a0>] ? autoremove_wake_function+0x0/0x40
host-122 kernel: [<ffffffff811cdb8c>] ? block_ioctl+0x3c/0x40
host-122 kernel: [<ffffffff811a6dd2>] ? vfs_ioctl+0x22/0xa0
host-122 kernel: [<ffffffff8123e13b>] ? selinux_file_permission+0xfb/0x150
host-122 kernel: [<ffffffff81230ea6>] ? security_file_permission+0x16/0x20
host-122 kernel: [<ffffffff81191cf5>] vfs_read+0xb5/0x1a0
host-122 kernel: [<ffffffff81192aa6>] ? fget_light_pos+0x16/0x50
host-122 kernel: [<ffffffff81192041>] sys_read+0x51/0xb0
host-122 kernel: [<ffffffff810e882e>] ? __audit_syscall_exit+0x25e/0x290
host-122 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
host-122 kernel: INFO: task flush-253:8:3585 blocked for more than 120 seconds.
host-122 kernel:      Not tainted 2.6.32-554.el6.x86_64 #1
host-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
host-122 kernel: flush-253:8   D 0000000000000000     0  3585      2 0x00000080
host-122 kernel: ffff88000b8779a0 0000000000000046 ffff880000000000 0003af909dd2e65a
host-122 kernel: ffff88000b877960 ffff880007b3dbb0 0000000005e3bb62 ffffffffaabbf2f5
host-122 kernel: 0000000012b294b7 ffffffff81aa6600 ffff880037e9e5f8 ffff88000b877fd8
host-122 kernel: Call Trace:
host-122 kernel: [<ffffffff81127240>] ? sync_page+0x0/0x50
host-122 kernel: [<ffffffff81537bc3>] io_schedule+0x73/0xc0
host-122 kernel: [<ffffffff8112727d>] sync_page+0x3d/0x50
host-122 kernel: [<ffffffff8153845a>] __wait_on_bit_lock+0x5a/0xc0
host-122 kernel: [<ffffffff81127217>] __lock_page+0x67/0x70
host-122 kernel: [<ffffffff810a1520>] ? wake_bit_function+0x0/0x50
host-122 kernel: [<ffffffff81127357>] ? unlock_page+0x27/0x30
host-122 kernel: [<ffffffff8113c51d>] write_cache_pages+0x3cd/0x4c0
host-122 kernel: [<ffffffff8113b070>] ? __writepage+0x0/0x40
host-122 kernel: [<ffffffff8113c634>] generic_writepages+0x24/0x30
host-122 kernel: [<ffffffff8113c675>] do_writepages+0x35/0x40
host-122 kernel: [<ffffffff811be40d>] writeback_single_inode+0xdd/0x290
host-122 kernel: [<ffffffff811be80d>] writeback_sb_inodes+0xbd/0x170
host-122 kernel: [<ffffffff811be96b>] writeback_inodes_wb+0xab/0x1b0
host-122 kernel: [<ffffffff811bed63>] wb_writeback+0x2f3/0x410
host-122 kernel: [<ffffffff8108a772>] ? del_timer_sync+0x22/0x30
host-122 kernel: [<ffffffff811bf025>] wb_do_writeback+0x1a5/0x240
host-122 kernel: [<ffffffff811bf123>] bdi_writeback_task+0x63/0x1b0
host-122 kernel: [<ffffffff810a1327>] ? bit_waitqueue+0x17/0xd0
host-122 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100
host-122 kernel: [<ffffffff8114b406>] bdi_start_fn+0x86/0x100
host-122 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100
host-122 kernel: [<ffffffff810a100e>] kthread+0x9e/0xc0
host-122 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
host-122 kernel: [<ffffffff810a0f70>] ? kthread+0x0/0xc0
host-122 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Comment 4 Corey Marthaler 2015-05-26 21:04:59 UTC
FWIW, this is reproducible on physical machines.

[root@taft-02 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     68   2015-05-26 14:21:30  taft-01
   2   M     68   2015-05-26 14:21:30  taft-02
   4   M     68   2015-05-26 14:21:30  taft-04

[root@taft-02 ~]# cman_tool services
fence domain
member count  3
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 4 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 3 joined 1 remove 0 failed 0 seq 2,2
members       1 2 4 

./black_bird -R ../../var/share/resource_files/taft.xml -l /home/msp/cmarthal/work/sts/sts-root -r /usr/tests/sts-rhel6.7  -E taft-02 -i 2 -F -w EXT
8 disk(s) to be used:
        taft-01=/dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathe /dev/mapper/mpathh /dev/mapper/mpathb /dev/mapper/mpatha /dev/mapper/mpathc /dev/mapper/mpathd
        taft-02=/dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathe /dev/mapper/mpathh /dev/mapper/mpathc /dev/mapper/mpathd /dev/mapper/mpathb /dev/mapper/mpatha
        taft-04=/dev/mapper/mpathh /dev/mapper/mpathe /dev/mapper/mpathd /dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathc /dev/mapper/mpathb /dev/mapper/mpatha
[...]
creating lvm devices...
Create 7 PV(s) for black_bird on taft-01
Create VG black_bird on taft-01
Enabling raid allocate fault policies on: taft-02
================================================================================
Iteration 0.1 started at Tue May 26 15:52:49 CDT 2015
================================================================================
Scenario kill_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_primary_raid1_2legs_1
* sync:               1
* type:               raid1
* -m |-i value:       2
* leg devices:        /dev/mapper/mpathbp1 /dev/mapper/mpathcp1 /dev/mapper/mpathdp1
* spanned legs:        0
* failpv(s):          /dev/mapper/mpathbp1
* additional snap:    /dev/mapper/mpathcp1
* failnode(s):        taft-02
* lvmetad:            0
* raid fault policy:  allocate
******************************************************

Creating raids(s) on taft-02...
taft-02: lvcreate --type raid1 -m 2 -n synced_primary_raid1_2legs_1 -L 500M black_bird /dev/mapper/mpathbp1:0-2400 /dev/mapper/mpathcp1:0-2400 /dev/mapper/mpathdp1:0-2400
EXCLUSIVELY ACTIVATING RAID on taft-02

Current mirror/raid device structure(s):
  LV                                      Attr       LSize   Cpy%Sync Devices
  synced_primary_raid1_2legs_1            rwi-a-r--- 500.00m 1.60     synced_primary_raid1_2legs_1_rimage_0(0),synced_primary_raid1_2legs_1_rimage_1(0),synced_primary_raid1_2legs_1_rimage_2(0)
  [synced_primary_raid1_2legs_1_rimage_0] Iwi-aor--- 500.00m          /dev/mapper/mpathbp1(1)
  [synced_primary_raid1_2legs_1_rimage_1] Iwi-aor--- 500.00m          /dev/mapper/mpathcp1(1)
  [synced_primary_raid1_2legs_1_rimage_2] Iwi-aor--- 500.00m          /dev/mapper/mpathdp1(1)
  [synced_primary_raid1_2legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/mapper/mpathbp1(0)
  [synced_primary_raid1_2legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/mapper/mpathcp1(0)
  [synced_primary_raid1_2legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/mapper/mpathdp1(0)


Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on taft-02...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on taft-02...

PV=/dev/mapper/mpathbp1
        synced_primary_raid1_2legs_1_rimage_0: 1.0
        synced_primary_raid1_2legs_1_rmeta_0: 1.0

Creating a snapshot volume of each of the raids
Writing verification files (checkit) to mirror(s) on...
        ---- taft-02 ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-02 ----

Failing path sdc on taft-02
Failing path sdk on taft-02
Failing path sdaa on taft-02
Failing path sds on taft-02

simple pvs failed
[DEADLOCK]

May 26 15:57:28 taft-02 kernel: INFO: task clvmd:10589 blocked for more than 120 seconds.
May 26 15:57:28 taft-02 kernel:      Not tainted 2.6.32-562.el6.x86_64 #1
May 26 15:57:28 taft-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 26 15:57:28 taft-02 kernel: clvmd         D 0000000000000001     0 10589      1 0x00000080
May 26 15:57:28 taft-02 kernel: ffff8802196fba18 0000000000000082 ffff8802196fb978 ffffffff8126f104
May 26 15:57:28 taft-02 kernel: ffff880217202d00 ffff880219d88900 ffff8802196fb9e8 ffffffffa0004daf
May 26 15:57:28 taft-02 kernel: ffff8802196fb9d8 ffff8802196fb9d8 ffff880207c5dad8 ffff8802196fbfd8
May 26 15:57:28 taft-02 kernel: Call Trace:
May 26 15:57:28 taft-02 kernel: [<ffffffff8126f104>] ? blk_unplug+0x34/0x70
May 26 15:57:28 taft-02 kernel: [<ffffffffa0004daf>] ? dm_table_unplug_all+0x5f/0x100 [dm_mod]
May 26 15:57:28 taft-02 kernel: [<ffffffff81538313>] io_schedule+0x73/0xc0
May 26 15:57:28 taft-02 kernel: [<ffffffff811d205d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270
May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20
May 26 15:57:28 taft-02 kernel: [<ffffffff811d27c7>] __blockdev_direct_IO+0x77/0xe0
May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20
May 26 15:57:28 taft-02 kernel: [<ffffffff811cea47>] blkdev_direct_IO+0x57/0x60
May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20
May 26 15:57:28 taft-02 kernel: [<ffffffff8112910b>] generic_file_aio_read+0x6bb/0x700
May 26 15:57:28 taft-02 kernel: [<ffffffff812388c1>] ? avc_has_perm+0x71/0x90
May 26 15:57:28 taft-02 kernel: [<ffffffff8123a772>] ? selinux_inode_permission+0x72/0xb0
May 26 15:57:28 taft-02 kernel: [<ffffffff811cdf31>] blkdev_aio_read+0x51/0x80
May 26 15:57:28 taft-02 kernel: [<ffffffff811915ca>] do_sync_read+0xfa/0x140
May 26 15:57:28 taft-02 kernel: [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
May 26 15:57:28 taft-02 kernel: [<ffffffff811cdd5c>] ? block_ioctl+0x3c/0x40
May 26 15:57:28 taft-02 kernel: [<ffffffff811a6fa2>] ? vfs_ioctl+0x22/0xa0
May 26 15:57:28 taft-02 kernel: [<ffffffff8123e32b>] ? selinux_file_permission+0xfb/0x150
May 26 15:57:28 taft-02 kernel: [<ffffffff81231096>] ? security_file_permission+0x16/0x20
May 26 15:57:28 taft-02 kernel: [<ffffffff81191ec5>] vfs_read+0xb5/0x1a0
May 26 15:57:28 taft-02 kernel: [<ffffffff81192c76>] ? fget_light_pos+0x16/0x50
May 26 15:57:28 taft-02 kernel: [<ffffffff81192211>] sys_read+0x51/0xb0
May 26 15:57:28 taft-02 kernel: [<ffffffff810e885e>] ? __audit_syscall_exit+0x25e/0x290
May 26 15:57:28 taft-02 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Comment 11 Corey Marthaler 2015-06-05 17:20:49 UTC
Created attachment 1035388 [details]
messages from 124

Comment 12 Corey Marthaler 2015-06-05 17:22:05 UTC
Created attachment 1035389 [details]
strace of clvmd -d2 -f on 122

Comment 13 Corey Marthaler 2015-06-05 17:22:35 UTC
Created attachment 1035390 [details]
strace of clvmd -d2 -f on 123

Comment 14 Corey Marthaler 2015-06-05 17:23:09 UTC
Created attachment 1035391 [details]
strace of clvmd -d2 -f on 124

Comment 18 Peter Rajnoha 2015-06-15 11:07:26 UTC
I've managed to reproduce with this:

1) /etc/init.d/clvmd stop
2) clvmd -f -d1
3) vgcreate vg /dev/sd{a..p}
4) lvcreate --type raid1 -m 2 -L 100M -aey vg
5) lvcreate -L50m -s vg/lvol0
6) echo 1 > /sys/block/sda/device/delete
7) dd if=/dev/zero of=/dev/sda count=1 (sda had rimage0 and rmeta0)

There's one line in clvmd log pointing to something incorrectly done:

"Inconsistent pre-commit metadata copies for volume group vg"

The clvmd is processing requests and if I run pvs in the middle of this operation, I get a hang.

If I wait for the clvmd to finish all the processing, the pvs is fine (though there's still the "inconsistent pre-commit metadata" in the log).

So something is wrong here, seems like unprotected operations run in parallel. Will try to check the logs to get to the source of the problem.

Comment 19 Peter Rajnoha 2015-06-15 11:08:28 UTC
Created attachment 1038894 [details]
clvmd log for raid1 with snapshot repair after device failure - pvs run in the middle - hang

Comment 20 Peter Rajnoha 2015-06-15 11:09:18 UTC
Created attachment 1038895 [details]
clvmd log for raid1 with snapshot repair after device failure - pvs run after all processing - pvs not hung

Comment 21 Peter Rajnoha 2015-06-15 11:15:09 UTC
I've attached the logs from clvmd when the pvs fails (that happens when I run pvs in the middle of clvmd processing operations, presumably coming from dmeventd). That pvs fails on trying to obtain the global lock from clvmd. The other log shows the situation in which I waited for all the clvmd operations to finish (and hence pvs passed just fine in this case). I haven't looked at these logs in detail yet, so I'll need to go through them still... Hopefully that will point us to the source of the problem.

Comment 22 Peter Rajnoha 2015-06-15 12:59:16 UTC
The dmeventd thread is hung here:

#0  0x00007fb585bab7dd in read () from /lib64/libpthread.so.0
#1  0x00007fb58464d002 in _send_request (inbuf=0x7fb58632bfd0 "3", inlen=24, retbuf=0x7fb58632c058) at locking/cluster_locking.c:120
#2  0x00007fb58464d4a7 in _cluster_request (clvmd_cmd=51 '3', node=0x7fb5846cb29d "", data=0x7fb58632c0c0, len=7, response=0x7fb58632c108, num=0x7fb58632c120) at locking/cluster_locking.c:224
#3  0x00007fb58464da99 in _lock_for_cluster (cmd=0x7fb5886b0d10, clvmd_cmd=51 '3', flags=292, name=0x7fb58632c200 "P_vg") at locking/cluster_locking.c:383
#4  0x00007fb58464e23c in _lock_resource (cmd=0x7fb5886b0d10, resource=0x7fb58632d2b0 "vg", flags=292, lv=0x0) at locking/cluster_locking.c:515
#5  0x00007fb5845e6036 in _lock_vol (cmd=0x7fb5886b0d10, resource=0x7fb58632d2b0 "vg", flags=292, lv_op=LV_NOOP, lv=0x0) at locking/locking.c:274
#6  0x00007fb5845e6372 in lock_vol (cmd=0x7fb5886b0d10, vol=0x7fb57c066a0a "vg", flags=292, lv=0x0) at locking/locking.c:354
#7  0x00007fb5846122df in vg_commit (vg=0x7fb57c0668c0) at metadata/metadata.c:2979
#8  0x00007fb584600717 in _lv_update_and_reload (lv=0x7fb57c068198, origin_only=1) at metadata/lv_manip.c:5968
#9  0x00007fb584600969 in lv_update_and_reload_origin (lv=0x7fb57c068198) at metadata/lv_manip.c:5995
#10 0x00007fb58462c916 in lv_raid_replace (lv=0x7fb57c068198, remove_pvs=0x7fb57c069c98, allocate_pvs=0x7fb57c066988) at metadata/raid_manip.c:1788
#11 0x00007fb584672cd8 in _lvconvert_raid (lv=0x7fb57c068198, lp=0x7fb58632d710) at lvconvert.c:1949
#12 0x00007fb58467a3de in _lvconvert_single (cmd=0x7fb5886b0d10, lv=0x7fb57c068198, handle=0x7fb58632d710) at lvconvert.c:3372
#13 0x00007fb58467aa0e in lvconvert_single (cmd=0x7fb5886b0d10, lp=0x7fb58632d710) at lvconvert.c:3476
#14 0x00007fb58467ae45 in lvconvert (cmd=0x7fb5886b0d10, argc=1, argv=0x7fb58632d908) at lvconvert.c:3566
#15 0x00007fb584688920 in lvm_run_command (cmd=0x7fb5886b0d10, argc=1, argv=0x7fb58632d908) at lvmcmdline.c:1519
#16 0x00007fb5846a83f4 in lvm2_run (handle=0x7fb5886b0d10, cmdline=0x7fb58632db70 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg/lvol0") at lvmcmdlib.c:86
#17 0x00007fb58492ae8b in dmeventd_lvm2_run (cmdline=0x7fb58632db70 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg/lvol0") at dmeventd_lvm.c:146
#18 0x00007fb58438fbae in run_repair (device=0x7fb57c000a20 "vg-lvol0-real") at dmeventd_raid.c:50
#19 0x00007fb58438fd3b in _process_raid_event (params=0x7fb57c004a30 "raid1", device=0x7fb57c000a20 "vg-lvol0-real") at dmeventd_raid.c:105
#20 0x00007fb58438fec4 in process_event	(dmt=0x7fb57c0008c0, event=DM_EVENT_DEVICE_ERROR, unused=0x7fb588700eb8) at dmeventd_raid.c:147
#21 0x00007fb5863eb518 in _do_process_event (thread=0x7fb588700e40, task=0x7fb57c0008c0) at dmeventd.c:804
#22 0x00007fb5863eb844 in _monitor_thread (arg=0x7fb588700e40) at dmeventd.c:928
#23 0x00007fb585ba4a51 in start_thread () from /lib64/libpthread.so.0
#24 0x00007fb58548296d in clone () from /lib64/libc.so.6


So this dmeventd processing and pvs fight with each other for a lock.

Comment 23 Peter Rajnoha 2015-06-15 13:17:45 UTC
The pvs and its request for global lock causes "Releasing activation in critical section." error to be hit in clvmd (this does not happen in case dmeventd as well as pvs passes).

Comment 24 Peter Rajnoha 2015-06-15 13:53:34 UTC
One of clvmd threads is trying to access the top-level raid (vg/lvol0 from my test which is raid1 LV) volume which is not suspended - I still need to find why exactly (full scan I suspect), from the strace:

stat("/dev/vg/lvol0", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 8), ...}) = 0
open("/dev/vg/lvol0", O_RDONLY|O_DIRECT|O_NOATIME) = 17
fstat(17, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 8), ...}) = 0
ioctl(17, BLKBSZGET, 4096)              = 0
ioctl(17, BLKPBSZGET, 512)              = 0
lseek(17, 0, SEEK_SET)                  = 0
read(17,

But this volume is just in the middle of repair with some of its non-top-level components still suspended:

[root@rhel6-b ~]# dmsetup info -c -o name,attr  
Name              Stat
vg-lvol0_rimage_0 L-sw
vg-lvol0_rmeta_0  L-sw
vg-lvol1          L--w
vg-lvol0          L--w
vg-lvol0-real     LIsw
vg-lvol0_rimage_3 L--w
vg-lvol0_rmeta_3  L--w
vg-lvol0_rimage_2 L-sw
vg-lvol1-cow      L--w
vg-lvol0_rmeta_2  L-sw
vg-lvol0_rimage_1 L-sw
vg-lvol0_rmeta_1  L-sw

Which leads to the problem then as such top-level raid device is still not usable and it should be skipped during any scans. I think this is not properly covered by filter-usable.c code.

We already have a check for the old mirror devices (the _ignore_blocked_mirror_devices fn to decide that and called from filter-usable.c code), but we don't have such deeper check for raid devices. I'll try to inspect this part a bit more, but it seems this is the source of the problem exactly.

Comment 25 Peter Rajnoha 2015-06-15 14:02:30 UTC
Ah, I'm starting to recall this one now:
https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/activate/dev_manager.c#n548

And this related commit:

https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=c0e17bca90326104020ba31b00fb9f66abf278b6

OK, so I need to find better check here for the situation we're hitting in this bug report. We can't use the original check, we need to use better one - the reason is explained in the comment in https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/activate/dev_manager.c#n548.

Comment 26 Peter Rajnoha 2015-06-15 14:27:36 UTC
I tried to reinstate the original check (just to see if this is the issue), but I can still reproduce. But this time, the hang occurs on vg/lvol1 in my test (which is the snapshot, before it was on the origin). So it seems the problem is located, but those checks in "is device usable" filter need to be enhanced to also cover more complext situations as described in this bz report. I'm taking this BZ.

Comment 27 Jonathan Earl Brassow 2015-06-15 15:17:57 UTC
I've notice that the 'lv' isn't the device that is suspended, but 'lv-real'.  Could this be part of the problem?

Should it be allowable that 'pvs' can get in the middle of a suspend/resume cycle?

Be careful that mirror "blocked I/O" is not the same as "suspended".  RAID does not have a "blocked I/O" issue.  You only have to deal with "suspended" with RAID.

Comment 28 Peter Rajnoha 2015-06-16 07:18:18 UTC
(In reply to Jonathan Earl Brassow from comment #27)
> I've notice that the 'lv' isn't the device that is suspended, but 'lv-real'.
> Could this be part of the problem?

Yes, that's exactly the case here - we're checking just the top-level snapshot/snapshot origin LV which is not suspended here, but the LVs underneath are! That's why we need to extend the existing check for suspended devs...

> 
> Should it be allowable that 'pvs' can get in the middle of a suspend/resume
> cycle?

Indeed! Well, in my opinion, this shouldn't be allowed, but I'm just about to check the code in this area. I suspect there's a bug exactly here which allows this interleave of operations.

> 
> Be careful that mirror "blocked I/O" is not the same as "suspended".  RAID
> does not have a "blocked I/O" issue.  You only have to deal with "suspended"
> with RAID.

Sure, I just meant this to be a similar situation, but not exactly the same... RAIDs need to be handled in different way here.

Comment 29 Peter Rajnoha 2015-06-16 11:18:14 UTC
I've tried to run lvconvert --repair manually together with pvs in parallel and all seems to be running fine - lvconvert takes VG write lock and any parallel pvs is paused until lvconvert finishes and I don't see a place for any interruption here. It looks like something is done differently when lvconvert is run from within dmeventd...

Comment 30 Peter Rajnoha 2015-06-16 13:31:28 UTC
(In reply to Peter Rajnoha from comment #29)
> I've tried to run lvconvert --repair manually together with pvs in parallel
> and all seems to be running fine

Well, there's a small window in which I can make it hang!

Take the stacktrace of the lvconvert --repair from comment #22.

If we're in _lv_update_and_reload fn, we suspend devices:

        if (!(origin_only ? suspend_lv_origin(vg->cmd, lock_lv) : suspend_lv(vg->cmd, lock_lv))) {
                log_error("Failed to lock logical volume %s.",
                          display_lvname(lock_lv));
                vg_revert(vg);
        } else if (!(r = vg_commit(vg)))
                stack; /* !vg_commit() has implict vg_revert() */

While vg_commit called later contains:

       if (cache_updated) {
                /* Instruct remote nodes to upgrade cached metadata. */
                if (!remote_commit_cached_metadata(vg))
                 ....

If we call pvs just after that suspend and before remote_commit_cached_metadata, the lvconvert hangs on remote_commit_cached_metadata because it's waiting on clvmd. And clvmd hangs on suspended devices too while trying to process the "remote_commit_cached_metadata".

Hence we end up with a deadlock here!

Comment 31 Peter Rajnoha 2015-06-16 13:37:59 UTC
I'm not sure still why LCK_VG_COMMIT clvmd command (caused by remote_commit_cached_metadata call) causes the suspended devices to be accessed on clvmd side, I'm just inspecting this...

Comment 32 Peter Rajnoha 2015-06-16 13:53:17 UTC
(In reply to Peter Rajnoha from comment #31)
> I'm not sure still why LCK_VG_COMMIT clvmd command (caused by
> remote_commit_cached_metadata call) causes the suspended devices to be
> accessed on clvmd side, I'm just inspecting this...

Nope, it's the pvs that hangs the whole clvmd and causes it to be unresponsive to any later requests (including the remote_commit_cached_metadata call). The pvs triggers suspended devs to be accessed even though we have VG write lock held, this is the exact lock state at the time the hang occurs:

[root@rhel6-b ~]# dlm_tool lockdump clvmd
id 03510001 gr PR rq IV pid 3790 master 0 "P_#global"
id 013a0001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVIUfemA25ggpkIaLbtVCQpdxgucjPY3jQq"
id 02a00001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVIvNot9ETRK3QGLb19CfcHdGs3tQrcZdYW"
id 00950001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVILrKxDIReTZwpPkIumvraAYqPIpaPw00W"
id 010a0001 gr PW rq IV pid 3714 master 0 "V_vg"


So we're holding V_vg write lock (dlm's PW) and we're requesting P_#global lock via pvs call (dlm's PR) - PW must be finished before taking PR, but there's something extra *before* that causes the suspended devs to be accessed even if the PR lock does not pass, it seems. I suspect an incorrect scan to be triggered here...

Comment 33 Peter Rajnoha 2015-06-16 14:13:34 UTC
So, while handling pvs and its P_#global while there V_vg is held and there are suspended devices present:

At first, this is the backtrace of the pvs request on clvmd side, hitting the "Releasing activation in critical section." error, keeping suspended devices in place:

(gdb) bt
#0  activation_release () at activate/activate.c:2368
#1  0x00007ffff7f19bc8 in refresh_toolcontext (cmd=0x7ffff00008c0) at commands/toolcontext.c:1898
#2  0x00007ffff7f061e3 in do_refresh_cache () at lvm-functions.c:658
#3  0x00007ffff7f06317 in do_lock_vg (command=1 '\001', lock_flags=68 'D', resource=0x7ffff8f46223 "P_#global") at lvm-functions.c:694
#4  0x00007ffff7efe9f6 in do_command (msg=0x7ffff8f46210, msglen=29, buf=0x7ffff7ecbd08, buflen=1481, retlen=0x7ffff7ecbd18) at clvmd-command.c:112
#5  0x00007ffff7f0364f in process_local_command (msg=0x7ffff8f46210, msglen=29, xid=51) at clvmd.c:1767
#6  0x00007ffff7f03e50 in process_work_item (cmd=0x7ffff8f3d860) at clvmd.c:1988
#7  0x00007ffff7f03f86 in lvm_thread_fn (arg=0x7fffffffe310) at clvmd.c:2033
#8  0x00007ffff6c31a51 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff697e96d in clone () from /lib64/libc.so.6

Later on, as part of P_#global clvmd request is handled, we refresh the cache, causing full scan to trigger:

dameons/clvmd/lvm-functions.c and the do_lock_vg/do_refresh_cache call/lvmcache_label_scan call. That full scan hangs on suspended devs.

Comment 34 Peter Rajnoha 2015-06-17 11:54:08 UTC
OK, so here's a quick fix for now - we don't have time to fiddle with this more.

https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=1e6a926e857d8df767415c6ff91d4dd1de74f363

I've induced the bug scenario as I commented in my previous comments and it's passing fine with this patch. I've also ran full test suite on my machines and it's passing. So I believe there are no regression from this patch.

However, we need to revisit this to fix this properly! I added comments to the code (FIXME) as well as the commit message.

Will do a scratch build in a while (Jon or corey, you can try that on your machines before I do the build).

Comment 38 Corey Marthaler 2015-06-17 22:28:45 UTC
This is no longer reproducible with the fix. I ran this test all day on two different clusters and saw no issues.


2.6.32-569.el6.x86_64
lvm2-2.02.118-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
lvm2-libs-2.02.118-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
lvm2-cluster-2.02.118-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
udev-147-2.63.el6    BUILT: Mon Jun  8 06:26:08 CDT 2015
device-mapper-1.02.95-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
device-mapper-libs-1.02.95-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
device-mapper-event-1.02.95-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
device-mapper-event-libs-1.02.95-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 08:43:06 CDT 2014
cmirror-2.02.118-3.el6    BUILT: Wed Jun 17 09:40:21 CDT 2015

Comment 42 Steve Almy 2015-06-29 17:54:49 UTC
PMApproved for 6.7 0-day.

Comment 48 Corey Marthaler 2016-02-17 23:40:53 UTC
Marking this verified in the latest rpms.

This test case passed hundreds of iterations. I filed bug 1309484 for the one failure.


2.6.32-615.el6.x86_64
lvm2-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-libs-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
lvm2-cluster-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
udev-147-2.71.el6    BUILT: Wed Feb 10 07:07:17 CST 2016
device-mapper-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-event-libs-1.02.115-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016
device-mapper-persistent-data-0.6.2-0.1.rc1.el6    BUILT: Wed Feb 10 09:52:15 CST 2016
cmirror-2.02.141-2.el6    BUILT: Wed Feb 10 07:49:03 CST 2016

Comment 50 errata-xmlrpc 2016-05-11 01:16:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0964.html