Bug 1219222
Summary: | device failure on exclusively active raid w/ snapshot (on top of cluster VG) leads to deadlock | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | |
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | |
lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | agk, cluster-qe, heinzm, jbrassow, lvm-team, msnitzer, prajnoha, prockai, salmy, tlavigne, zkabelac | |
Version: | 6.7 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.140-1.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, if an exclusively activated RAID logical volume with a snapshot was being altered and other LVM commands were simultaneously run, it could cause I/O and LVM commands to hang. This could happen, for example, if 'pvs' was run at the same time as a repair of the RAID logical volume.
The cause was traced to LVM commands attempting to read logical volumes while they were being altered. This issue has been resolved.
|
Story Points: | --- | |
Clone Of: | 1193222 | |||
: | 1236690 (view as bug list) | Environment: | ||
Last Closed: | 2016-05-11 01:16:52 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1193222 | |||
Bug Blocks: | 1236690, 1268411 | |||
Attachments: |
Description
Corey Marthaler
2015-05-06 20:58:46 UTC
This exists in rhel6.7 as well. 2.6.32-554.el6.x86_64 lvm2-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-libs-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-cluster-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 udev-147-2.61.el6 BUILT: Mon Mar 2 05:08:11 CST 2015 device-mapper-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 08:43:06 CDT 2014 cmirror-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 [root@host-122 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 92 2015-05-06 15:48:58 host-122 2 M 100 2015-05-06 15:48:59 host-123 3 M 96 2015-05-06 15:48:59 host-124 ================================================================================ Iteration 0.1 started at Wed May 6 14:44:30 CDT 2015 ================================================================================ Scenario kill_primary_synced_raid4_2legs: Kill primary leg of synced 2 leg raid4 volume(s) ********* RAID hash info for this scenario ********* * names: synced_primary_raid4_2legs_1 * sync: 1 * type: raid4 * -m |-i value: 2 * leg devices: /dev/sda1 /dev/sdb1 /dev/sde1 * spanned legs: 0 * failpv(s): /dev/sda1 * additional snap: /dev/sdb1 * failnode(s): host-122 * lvmetad: 0 * raid fault policy: allocate ****************************************************** Creating raids(s) on host-122... host-122: lvcreate --type raid4 -i 2 -n synced_primary_raid4_2legs_1 -L 500M black_bird /dev/sda1:0-2400 /dev/sdb1:0-2400 /dev/sde1:0-2400 EXCLUSIVELY ACTIVATING RAID on host-122 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices synced_primary_raid4_2legs_1 rwi-a-r--- 504.00m 3.17 synced_primary_raid4_2legs_1_rimage_0(0),synced_primary_raid4_2legs_1_rimage_1(0),synced_primary_raid4_2legs_1_rimage_2(0) [synced_primary_raid4_2legs_1_rimage_0] Iwi-aor--- 252.00m /dev/sda1(1) [synced_primary_raid4_2legs_1_rimage_1] Iwi-aor--- 252.00m /dev/sdb1(1) [synced_primary_raid4_2legs_1_rimage_2] Iwi-aor--- 252.00m /dev/sde1(1) [synced_primary_raid4_2legs_1_rmeta_0] ewi-aor--- 4.00m /dev/sda1(0) [synced_primary_raid4_2legs_1_rmeta_1] ewi-aor--- 4.00m /dev/sdb1(0) [synced_primary_raid4_2legs_1_rmeta_2] ewi-aor--- 4.00m /dev/sde1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on host-122... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on host-122... PV=/dev/sda1 synced_primary_raid4_2legs_1_rimage_0: 1.0 synced_primary_raid4_2legs_1_rmeta_0: 1.0 Creating a snapshot volume of each of the raids Writing verification files (checkit) to mirror(s) on... ---- host-122 ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- host-122 ---- Disabling device sda on host-122 simple pvs failed FI_engine: inject() method failed host-122 qarshd[3711]: Running cmdline: pvs -a host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device host-122 kernel: sd 2:0:0:1: rejecting I/O to offline device host-122 xinetd[2395]: EXIT: qarsh status=0 pid=3711 duration=67(sec) host-122 lvm[15257]: Error locking on node host-122: Command timed out host-122 kernel: INFO: task clvmd:29382 blocked for more than 120 seconds. host-122 kernel: Not tainted 2.6.32-554.el6.x86_64 #1 host-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. host-122 kernel: clvmd D 0000000000000000 0 29382 1 0x00000080 host-122 kernel: ffff88003a53fa18 0000000000000082 ffff880000000000 0003af8a169e9404 host-122 kernel: ffff88000591b6e0 ffff88002166f130 0000000005e3ab6c ffffffffaabbf2f5 host-122 kernel: 0000000012b294b7 ffffffff81aa6600 ffff880039cc7ad8 ffff88003a53ffd8 host-122 kernel: Call Trace: host-122 kernel: [<ffffffff81537bc3>] io_schedule+0x73/0xc0 host-122 kernel: [<ffffffff811d1e8d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270 host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20 host-122 kernel: [<ffffffff811d25f7>] __blockdev_direct_IO+0x77/0xe0 host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20 host-122 kernel: [<ffffffff811ce877>] blkdev_direct_IO+0x57/0x60 host-122 kernel: [<ffffffff811cd7f0>] ? blkdev_get_block+0x0/0x20 host-122 kernel: [<ffffffff811290db>] generic_file_aio_read+0x6bb/0x700 host-122 kernel: [<ffffffff812386d1>] ? avc_has_perm+0x71/0x90 host-122 kernel: [<ffffffff8123a582>] ? selinux_inode_permission+0x72/0xb0 host-122 kernel: [<ffffffff811cdd61>] blkdev_aio_read+0x51/0x80 host-122 kernel: [<ffffffff811913fa>] do_sync_read+0xfa/0x140 host-122 kernel: [<ffffffff810a14a0>] ? autoremove_wake_function+0x0/0x40 host-122 kernel: [<ffffffff811cdb8c>] ? block_ioctl+0x3c/0x40 host-122 kernel: [<ffffffff811a6dd2>] ? vfs_ioctl+0x22/0xa0 host-122 kernel: [<ffffffff8123e13b>] ? selinux_file_permission+0xfb/0x150 host-122 kernel: [<ffffffff81230ea6>] ? security_file_permission+0x16/0x20 host-122 kernel: [<ffffffff81191cf5>] vfs_read+0xb5/0x1a0 host-122 kernel: [<ffffffff81192aa6>] ? fget_light_pos+0x16/0x50 host-122 kernel: [<ffffffff81192041>] sys_read+0x51/0xb0 host-122 kernel: [<ffffffff810e882e>] ? __audit_syscall_exit+0x25e/0x290 host-122 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b host-122 kernel: INFO: task flush-253:8:3585 blocked for more than 120 seconds. host-122 kernel: Not tainted 2.6.32-554.el6.x86_64 #1 host-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. host-122 kernel: flush-253:8 D 0000000000000000 0 3585 2 0x00000080 host-122 kernel: ffff88000b8779a0 0000000000000046 ffff880000000000 0003af909dd2e65a host-122 kernel: ffff88000b877960 ffff880007b3dbb0 0000000005e3bb62 ffffffffaabbf2f5 host-122 kernel: 0000000012b294b7 ffffffff81aa6600 ffff880037e9e5f8 ffff88000b877fd8 host-122 kernel: Call Trace: host-122 kernel: [<ffffffff81127240>] ? sync_page+0x0/0x50 host-122 kernel: [<ffffffff81537bc3>] io_schedule+0x73/0xc0 host-122 kernel: [<ffffffff8112727d>] sync_page+0x3d/0x50 host-122 kernel: [<ffffffff8153845a>] __wait_on_bit_lock+0x5a/0xc0 host-122 kernel: [<ffffffff81127217>] __lock_page+0x67/0x70 host-122 kernel: [<ffffffff810a1520>] ? wake_bit_function+0x0/0x50 host-122 kernel: [<ffffffff81127357>] ? unlock_page+0x27/0x30 host-122 kernel: [<ffffffff8113c51d>] write_cache_pages+0x3cd/0x4c0 host-122 kernel: [<ffffffff8113b070>] ? __writepage+0x0/0x40 host-122 kernel: [<ffffffff8113c634>] generic_writepages+0x24/0x30 host-122 kernel: [<ffffffff8113c675>] do_writepages+0x35/0x40 host-122 kernel: [<ffffffff811be40d>] writeback_single_inode+0xdd/0x290 host-122 kernel: [<ffffffff811be80d>] writeback_sb_inodes+0xbd/0x170 host-122 kernel: [<ffffffff811be96b>] writeback_inodes_wb+0xab/0x1b0 host-122 kernel: [<ffffffff811bed63>] wb_writeback+0x2f3/0x410 host-122 kernel: [<ffffffff8108a772>] ? del_timer_sync+0x22/0x30 host-122 kernel: [<ffffffff811bf025>] wb_do_writeback+0x1a5/0x240 host-122 kernel: [<ffffffff811bf123>] bdi_writeback_task+0x63/0x1b0 host-122 kernel: [<ffffffff810a1327>] ? bit_waitqueue+0x17/0xd0 host-122 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100 host-122 kernel: [<ffffffff8114b406>] bdi_start_fn+0x86/0x100 host-122 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100 host-122 kernel: [<ffffffff810a100e>] kthread+0x9e/0xc0 host-122 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 host-122 kernel: [<ffffffff810a0f70>] ? kthread+0x0/0xc0 host-122 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 FWIW, this is reproducible on physical machines. [root@taft-02 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 68 2015-05-26 14:21:30 taft-01 2 M 68 2015-05-26 14:21:30 taft-02 4 M 68 2015-05-26 14:21:30 taft-04 [root@taft-02 ~]# cman_tool services fence domain member count 3 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 4 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 3 joined 1 remove 0 failed 0 seq 2,2 members 1 2 4 ./black_bird -R ../../var/share/resource_files/taft.xml -l /home/msp/cmarthal/work/sts/sts-root -r /usr/tests/sts-rhel6.7 -E taft-02 -i 2 -F -w EXT 8 disk(s) to be used: taft-01=/dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathe /dev/mapper/mpathh /dev/mapper/mpathb /dev/mapper/mpatha /dev/mapper/mpathc /dev/mapper/mpathd taft-02=/dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathe /dev/mapper/mpathh /dev/mapper/mpathc /dev/mapper/mpathd /dev/mapper/mpathb /dev/mapper/mpatha taft-04=/dev/mapper/mpathh /dev/mapper/mpathe /dev/mapper/mpathd /dev/mapper/mpathf /dev/mapper/mpathg /dev/mapper/mpathc /dev/mapper/mpathb /dev/mapper/mpatha [...] creating lvm devices... Create 7 PV(s) for black_bird on taft-01 Create VG black_bird on taft-01 Enabling raid allocate fault policies on: taft-02 ================================================================================ Iteration 0.1 started at Tue May 26 15:52:49 CDT 2015 ================================================================================ Scenario kill_primary_synced_raid1_2legs: Kill primary leg of synced 2 leg raid1 volume(s) ********* RAID hash info for this scenario ********* * names: synced_primary_raid1_2legs_1 * sync: 1 * type: raid1 * -m |-i value: 2 * leg devices: /dev/mapper/mpathbp1 /dev/mapper/mpathcp1 /dev/mapper/mpathdp1 * spanned legs: 0 * failpv(s): /dev/mapper/mpathbp1 * additional snap: /dev/mapper/mpathcp1 * failnode(s): taft-02 * lvmetad: 0 * raid fault policy: allocate ****************************************************** Creating raids(s) on taft-02... taft-02: lvcreate --type raid1 -m 2 -n synced_primary_raid1_2legs_1 -L 500M black_bird /dev/mapper/mpathbp1:0-2400 /dev/mapper/mpathcp1:0-2400 /dev/mapper/mpathdp1:0-2400 EXCLUSIVELY ACTIVATING RAID on taft-02 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices synced_primary_raid1_2legs_1 rwi-a-r--- 500.00m 1.60 synced_primary_raid1_2legs_1_rimage_0(0),synced_primary_raid1_2legs_1_rimage_1(0),synced_primary_raid1_2legs_1_rimage_2(0) [synced_primary_raid1_2legs_1_rimage_0] Iwi-aor--- 500.00m /dev/mapper/mpathbp1(1) [synced_primary_raid1_2legs_1_rimage_1] Iwi-aor--- 500.00m /dev/mapper/mpathcp1(1) [synced_primary_raid1_2legs_1_rimage_2] Iwi-aor--- 500.00m /dev/mapper/mpathdp1(1) [synced_primary_raid1_2legs_1_rmeta_0] ewi-aor--- 4.00m /dev/mapper/mpathbp1(0) [synced_primary_raid1_2legs_1_rmeta_1] ewi-aor--- 4.00m /dev/mapper/mpathcp1(0) [synced_primary_raid1_2legs_1_rmeta_2] ewi-aor--- 4.00m /dev/mapper/mpathdp1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on taft-02... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on taft-02... PV=/dev/mapper/mpathbp1 synced_primary_raid1_2legs_1_rimage_0: 1.0 synced_primary_raid1_2legs_1_rmeta_0: 1.0 Creating a snapshot volume of each of the raids Writing verification files (checkit) to mirror(s) on... ---- taft-02 ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-02 ---- Failing path sdc on taft-02 Failing path sdk on taft-02 Failing path sdaa on taft-02 Failing path sds on taft-02 simple pvs failed [DEADLOCK] May 26 15:57:28 taft-02 kernel: INFO: task clvmd:10589 blocked for more than 120 seconds. May 26 15:57:28 taft-02 kernel: Not tainted 2.6.32-562.el6.x86_64 #1 May 26 15:57:28 taft-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 26 15:57:28 taft-02 kernel: clvmd D 0000000000000001 0 10589 1 0x00000080 May 26 15:57:28 taft-02 kernel: ffff8802196fba18 0000000000000082 ffff8802196fb978 ffffffff8126f104 May 26 15:57:28 taft-02 kernel: ffff880217202d00 ffff880219d88900 ffff8802196fb9e8 ffffffffa0004daf May 26 15:57:28 taft-02 kernel: ffff8802196fb9d8 ffff8802196fb9d8 ffff880207c5dad8 ffff8802196fbfd8 May 26 15:57:28 taft-02 kernel: Call Trace: May 26 15:57:28 taft-02 kernel: [<ffffffff8126f104>] ? blk_unplug+0x34/0x70 May 26 15:57:28 taft-02 kernel: [<ffffffffa0004daf>] ? dm_table_unplug_all+0x5f/0x100 [dm_mod] May 26 15:57:28 taft-02 kernel: [<ffffffff81538313>] io_schedule+0x73/0xc0 May 26 15:57:28 taft-02 kernel: [<ffffffff811d205d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270 May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20 May 26 15:57:28 taft-02 kernel: [<ffffffff811d27c7>] __blockdev_direct_IO+0x77/0xe0 May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20 May 26 15:57:28 taft-02 kernel: [<ffffffff811cea47>] blkdev_direct_IO+0x57/0x60 May 26 15:57:28 taft-02 kernel: [<ffffffff811cd9c0>] ? blkdev_get_block+0x0/0x20 May 26 15:57:28 taft-02 kernel: [<ffffffff8112910b>] generic_file_aio_read+0x6bb/0x700 May 26 15:57:28 taft-02 kernel: [<ffffffff812388c1>] ? avc_has_perm+0x71/0x90 May 26 15:57:28 taft-02 kernel: [<ffffffff8123a772>] ? selinux_inode_permission+0x72/0xb0 May 26 15:57:28 taft-02 kernel: [<ffffffff811cdf31>] blkdev_aio_read+0x51/0x80 May 26 15:57:28 taft-02 kernel: [<ffffffff811915ca>] do_sync_read+0xfa/0x140 May 26 15:57:28 taft-02 kernel: [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40 May 26 15:57:28 taft-02 kernel: [<ffffffff811cdd5c>] ? block_ioctl+0x3c/0x40 May 26 15:57:28 taft-02 kernel: [<ffffffff811a6fa2>] ? vfs_ioctl+0x22/0xa0 May 26 15:57:28 taft-02 kernel: [<ffffffff8123e32b>] ? selinux_file_permission+0xfb/0x150 May 26 15:57:28 taft-02 kernel: [<ffffffff81231096>] ? security_file_permission+0x16/0x20 May 26 15:57:28 taft-02 kernel: [<ffffffff81191ec5>] vfs_read+0xb5/0x1a0 May 26 15:57:28 taft-02 kernel: [<ffffffff81192c76>] ? fget_light_pos+0x16/0x50 May 26 15:57:28 taft-02 kernel: [<ffffffff81192211>] sys_read+0x51/0xb0 May 26 15:57:28 taft-02 kernel: [<ffffffff810e885e>] ? __audit_syscall_exit+0x25e/0x290 May 26 15:57:28 taft-02 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Created attachment 1035388 [details]
messages from 124
Created attachment 1035389 [details]
strace of clvmd -d2 -f on 122
Created attachment 1035390 [details]
strace of clvmd -d2 -f on 123
Created attachment 1035391 [details]
strace of clvmd -d2 -f on 124
I've managed to reproduce with this: 1) /etc/init.d/clvmd stop 2) clvmd -f -d1 3) vgcreate vg /dev/sd{a..p} 4) lvcreate --type raid1 -m 2 -L 100M -aey vg 5) lvcreate -L50m -s vg/lvol0 6) echo 1 > /sys/block/sda/device/delete 7) dd if=/dev/zero of=/dev/sda count=1 (sda had rimage0 and rmeta0) There's one line in clvmd log pointing to something incorrectly done: "Inconsistent pre-commit metadata copies for volume group vg" The clvmd is processing requests and if I run pvs in the middle of this operation, I get a hang. If I wait for the clvmd to finish all the processing, the pvs is fine (though there's still the "inconsistent pre-commit metadata" in the log). So something is wrong here, seems like unprotected operations run in parallel. Will try to check the logs to get to the source of the problem. Created attachment 1038894 [details]
clvmd log for raid1 with snapshot repair after device failure - pvs run in the middle - hang
Created attachment 1038895 [details]
clvmd log for raid1 with snapshot repair after device failure - pvs run after all processing - pvs not hung
I've attached the logs from clvmd when the pvs fails (that happens when I run pvs in the middle of clvmd processing operations, presumably coming from dmeventd). That pvs fails on trying to obtain the global lock from clvmd. The other log shows the situation in which I waited for all the clvmd operations to finish (and hence pvs passed just fine in this case). I haven't looked at these logs in detail yet, so I'll need to go through them still... Hopefully that will point us to the source of the problem. The dmeventd thread is hung here: #0 0x00007fb585bab7dd in read () from /lib64/libpthread.so.0 #1 0x00007fb58464d002 in _send_request (inbuf=0x7fb58632bfd0 "3", inlen=24, retbuf=0x7fb58632c058) at locking/cluster_locking.c:120 #2 0x00007fb58464d4a7 in _cluster_request (clvmd_cmd=51 '3', node=0x7fb5846cb29d "", data=0x7fb58632c0c0, len=7, response=0x7fb58632c108, num=0x7fb58632c120) at locking/cluster_locking.c:224 #3 0x00007fb58464da99 in _lock_for_cluster (cmd=0x7fb5886b0d10, clvmd_cmd=51 '3', flags=292, name=0x7fb58632c200 "P_vg") at locking/cluster_locking.c:383 #4 0x00007fb58464e23c in _lock_resource (cmd=0x7fb5886b0d10, resource=0x7fb58632d2b0 "vg", flags=292, lv=0x0) at locking/cluster_locking.c:515 #5 0x00007fb5845e6036 in _lock_vol (cmd=0x7fb5886b0d10, resource=0x7fb58632d2b0 "vg", flags=292, lv_op=LV_NOOP, lv=0x0) at locking/locking.c:274 #6 0x00007fb5845e6372 in lock_vol (cmd=0x7fb5886b0d10, vol=0x7fb57c066a0a "vg", flags=292, lv=0x0) at locking/locking.c:354 #7 0x00007fb5846122df in vg_commit (vg=0x7fb57c0668c0) at metadata/metadata.c:2979 #8 0x00007fb584600717 in _lv_update_and_reload (lv=0x7fb57c068198, origin_only=1) at metadata/lv_manip.c:5968 #9 0x00007fb584600969 in lv_update_and_reload_origin (lv=0x7fb57c068198) at metadata/lv_manip.c:5995 #10 0x00007fb58462c916 in lv_raid_replace (lv=0x7fb57c068198, remove_pvs=0x7fb57c069c98, allocate_pvs=0x7fb57c066988) at metadata/raid_manip.c:1788 #11 0x00007fb584672cd8 in _lvconvert_raid (lv=0x7fb57c068198, lp=0x7fb58632d710) at lvconvert.c:1949 #12 0x00007fb58467a3de in _lvconvert_single (cmd=0x7fb5886b0d10, lv=0x7fb57c068198, handle=0x7fb58632d710) at lvconvert.c:3372 #13 0x00007fb58467aa0e in lvconvert_single (cmd=0x7fb5886b0d10, lp=0x7fb58632d710) at lvconvert.c:3476 #14 0x00007fb58467ae45 in lvconvert (cmd=0x7fb5886b0d10, argc=1, argv=0x7fb58632d908) at lvconvert.c:3566 #15 0x00007fb584688920 in lvm_run_command (cmd=0x7fb5886b0d10, argc=1, argv=0x7fb58632d908) at lvmcmdline.c:1519 #16 0x00007fb5846a83f4 in lvm2_run (handle=0x7fb5886b0d10, cmdline=0x7fb58632db70 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg/lvol0") at lvmcmdlib.c:86 #17 0x00007fb58492ae8b in dmeventd_lvm2_run (cmdline=0x7fb58632db70 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies vg/lvol0") at dmeventd_lvm.c:146 #18 0x00007fb58438fbae in run_repair (device=0x7fb57c000a20 "vg-lvol0-real") at dmeventd_raid.c:50 #19 0x00007fb58438fd3b in _process_raid_event (params=0x7fb57c004a30 "raid1", device=0x7fb57c000a20 "vg-lvol0-real") at dmeventd_raid.c:105 #20 0x00007fb58438fec4 in process_event (dmt=0x7fb57c0008c0, event=DM_EVENT_DEVICE_ERROR, unused=0x7fb588700eb8) at dmeventd_raid.c:147 #21 0x00007fb5863eb518 in _do_process_event (thread=0x7fb588700e40, task=0x7fb57c0008c0) at dmeventd.c:804 #22 0x00007fb5863eb844 in _monitor_thread (arg=0x7fb588700e40) at dmeventd.c:928 #23 0x00007fb585ba4a51 in start_thread () from /lib64/libpthread.so.0 #24 0x00007fb58548296d in clone () from /lib64/libc.so.6 So this dmeventd processing and pvs fight with each other for a lock. The pvs and its request for global lock causes "Releasing activation in critical section." error to be hit in clvmd (this does not happen in case dmeventd as well as pvs passes). One of clvmd threads is trying to access the top-level raid (vg/lvol0 from my test which is raid1 LV) volume which is not suspended - I still need to find why exactly (full scan I suspect), from the strace: stat("/dev/vg/lvol0", {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 8), ...}) = 0 open("/dev/vg/lvol0", O_RDONLY|O_DIRECT|O_NOATIME) = 17 fstat(17, {st_mode=S_IFBLK|0660, st_rdev=makedev(253, 8), ...}) = 0 ioctl(17, BLKBSZGET, 4096) = 0 ioctl(17, BLKPBSZGET, 512) = 0 lseek(17, 0, SEEK_SET) = 0 read(17, But this volume is just in the middle of repair with some of its non-top-level components still suspended: [root@rhel6-b ~]# dmsetup info -c -o name,attr Name Stat vg-lvol0_rimage_0 L-sw vg-lvol0_rmeta_0 L-sw vg-lvol1 L--w vg-lvol0 L--w vg-lvol0-real LIsw vg-lvol0_rimage_3 L--w vg-lvol0_rmeta_3 L--w vg-lvol0_rimage_2 L-sw vg-lvol1-cow L--w vg-lvol0_rmeta_2 L-sw vg-lvol0_rimage_1 L-sw vg-lvol0_rmeta_1 L-sw Which leads to the problem then as such top-level raid device is still not usable and it should be skipped during any scans. I think this is not properly covered by filter-usable.c code. We already have a check for the old mirror devices (the _ignore_blocked_mirror_devices fn to decide that and called from filter-usable.c code), but we don't have such deeper check for raid devices. I'll try to inspect this part a bit more, but it seems this is the source of the problem exactly. Ah, I'm starting to recall this one now: https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/activate/dev_manager.c#n548 And this related commit: https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=c0e17bca90326104020ba31b00fb9f66abf278b6 OK, so I need to find better check here for the situation we're hitting in this bug report. We can't use the original check, we need to use better one - the reason is explained in the comment in https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/activate/dev_manager.c#n548. I tried to reinstate the original check (just to see if this is the issue), but I can still reproduce. But this time, the hang occurs on vg/lvol1 in my test (which is the snapshot, before it was on the origin). So it seems the problem is located, but those checks in "is device usable" filter need to be enhanced to also cover more complext situations as described in this bz report. I'm taking this BZ. I've notice that the 'lv' isn't the device that is suspended, but 'lv-real'. Could this be part of the problem? Should it be allowable that 'pvs' can get in the middle of a suspend/resume cycle? Be careful that mirror "blocked I/O" is not the same as "suspended". RAID does not have a "blocked I/O" issue. You only have to deal with "suspended" with RAID. (In reply to Jonathan Earl Brassow from comment #27) > I've notice that the 'lv' isn't the device that is suspended, but 'lv-real'. > Could this be part of the problem? Yes, that's exactly the case here - we're checking just the top-level snapshot/snapshot origin LV which is not suspended here, but the LVs underneath are! That's why we need to extend the existing check for suspended devs... > > Should it be allowable that 'pvs' can get in the middle of a suspend/resume > cycle? Indeed! Well, in my opinion, this shouldn't be allowed, but I'm just about to check the code in this area. I suspect there's a bug exactly here which allows this interleave of operations. > > Be careful that mirror "blocked I/O" is not the same as "suspended". RAID > does not have a "blocked I/O" issue. You only have to deal with "suspended" > with RAID. Sure, I just meant this to be a similar situation, but not exactly the same... RAIDs need to be handled in different way here. I've tried to run lvconvert --repair manually together with pvs in parallel and all seems to be running fine - lvconvert takes VG write lock and any parallel pvs is paused until lvconvert finishes and I don't see a place for any interruption here. It looks like something is done differently when lvconvert is run from within dmeventd... (In reply to Peter Rajnoha from comment #29) > I've tried to run lvconvert --repair manually together with pvs in parallel > and all seems to be running fine Well, there's a small window in which I can make it hang! Take the stacktrace of the lvconvert --repair from comment #22. If we're in _lv_update_and_reload fn, we suspend devices: if (!(origin_only ? suspend_lv_origin(vg->cmd, lock_lv) : suspend_lv(vg->cmd, lock_lv))) { log_error("Failed to lock logical volume %s.", display_lvname(lock_lv)); vg_revert(vg); } else if (!(r = vg_commit(vg))) stack; /* !vg_commit() has implict vg_revert() */ While vg_commit called later contains: if (cache_updated) { /* Instruct remote nodes to upgrade cached metadata. */ if (!remote_commit_cached_metadata(vg)) .... If we call pvs just after that suspend and before remote_commit_cached_metadata, the lvconvert hangs on remote_commit_cached_metadata because it's waiting on clvmd. And clvmd hangs on suspended devices too while trying to process the "remote_commit_cached_metadata". Hence we end up with a deadlock here! I'm not sure still why LCK_VG_COMMIT clvmd command (caused by remote_commit_cached_metadata call) causes the suspended devices to be accessed on clvmd side, I'm just inspecting this... (In reply to Peter Rajnoha from comment #31) > I'm not sure still why LCK_VG_COMMIT clvmd command (caused by > remote_commit_cached_metadata call) causes the suspended devices to be > accessed on clvmd side, I'm just inspecting this... Nope, it's the pvs that hangs the whole clvmd and causes it to be unresponsive to any later requests (including the remote_commit_cached_metadata call). The pvs triggers suspended devs to be accessed even though we have VG write lock held, this is the exact lock state at the time the hang occurs: [root@rhel6-b ~]# dlm_tool lockdump clvmd id 03510001 gr PR rq IV pid 3790 master 0 "P_#global" id 013a0001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVIUfemA25ggpkIaLbtVCQpdxgucjPY3jQq" id 02a00001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVIvNot9ETRK3QGLb19CfcHdGs3tQrcZdYW" id 00950001 gr EX rq IV pid 3412 master 0 "vFe05B7s7BlCwSITdMBcCtQG7ZbmBiVILrKxDIReTZwpPkIumvraAYqPIpaPw00W" id 010a0001 gr PW rq IV pid 3714 master 0 "V_vg" So we're holding V_vg write lock (dlm's PW) and we're requesting P_#global lock via pvs call (dlm's PR) - PW must be finished before taking PR, but there's something extra *before* that causes the suspended devs to be accessed even if the PR lock does not pass, it seems. I suspect an incorrect scan to be triggered here... So, while handling pvs and its P_#global while there V_vg is held and there are suspended devices present: At first, this is the backtrace of the pvs request on clvmd side, hitting the "Releasing activation in critical section." error, keeping suspended devices in place: (gdb) bt #0 activation_release () at activate/activate.c:2368 #1 0x00007ffff7f19bc8 in refresh_toolcontext (cmd=0x7ffff00008c0) at commands/toolcontext.c:1898 #2 0x00007ffff7f061e3 in do_refresh_cache () at lvm-functions.c:658 #3 0x00007ffff7f06317 in do_lock_vg (command=1 '\001', lock_flags=68 'D', resource=0x7ffff8f46223 "P_#global") at lvm-functions.c:694 #4 0x00007ffff7efe9f6 in do_command (msg=0x7ffff8f46210, msglen=29, buf=0x7ffff7ecbd08, buflen=1481, retlen=0x7ffff7ecbd18) at clvmd-command.c:112 #5 0x00007ffff7f0364f in process_local_command (msg=0x7ffff8f46210, msglen=29, xid=51) at clvmd.c:1767 #6 0x00007ffff7f03e50 in process_work_item (cmd=0x7ffff8f3d860) at clvmd.c:1988 #7 0x00007ffff7f03f86 in lvm_thread_fn (arg=0x7fffffffe310) at clvmd.c:2033 #8 0x00007ffff6c31a51 in start_thread () from /lib64/libpthread.so.0 #9 0x00007ffff697e96d in clone () from /lib64/libc.so.6 Later on, as part of P_#global clvmd request is handled, we refresh the cache, causing full scan to trigger: dameons/clvmd/lvm-functions.c and the do_lock_vg/do_refresh_cache call/lvmcache_label_scan call. That full scan hangs on suspended devs. OK, so here's a quick fix for now - we don't have time to fiddle with this more. https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=1e6a926e857d8df767415c6ff91d4dd1de74f363 I've induced the bug scenario as I commented in my previous comments and it's passing fine with this patch. I've also ran full test suite on my machines and it's passing. So I believe there are no regression from this patch. However, we need to revisit this to fix this properly! I added comments to the code (FIXME) as well as the commit message. Will do a scratch build in a while (Jon or corey, you can try that on your machines before I do the build). This is no longer reproducible with the fix. I ran this test all day on two different clusters and saw no issues. 2.6.32-569.el6.x86_64 lvm2-2.02.118-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 lvm2-libs-2.02.118-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 lvm2-cluster-2.02.118-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 udev-147-2.63.el6 BUILT: Mon Jun 8 06:26:08 CDT 2015 device-mapper-1.02.95-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 device-mapper-libs-1.02.95-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 device-mapper-event-1.02.95-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 device-mapper-event-libs-1.02.95-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 08:43:06 CDT 2014 cmirror-2.02.118-3.el6 BUILT: Wed Jun 17 09:40:21 CDT 2015 PMApproved for 6.7 0-day. Marking this verified in the latest rpms. This test case passed hundreds of iterations. I filed bug 1309484 for the one failure. 2.6.32-615.el6.x86_64 lvm2-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 lvm2-libs-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 lvm2-cluster-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 udev-147-2.71.el6 BUILT: Wed Feb 10 07:07:17 CST 2016 device-mapper-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-libs-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-event-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-event-libs-1.02.115-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 device-mapper-persistent-data-0.6.2-0.1.rc1.el6 BUILT: Wed Feb 10 09:52:15 CST 2016 cmirror-2.02.141-2.el6 BUILT: Wed Feb 10 07:49:03 CST 2016 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0964.html |