| Summary: | deadlock when attempting to uncache a suspended exclusively activated cluster cache pool | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> |
| lvm2 sub component: | Cache Logical Volumes | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, zkabelac |
| Version: | 7.3 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-18 16:57:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Corey Marthaler
2016-10-27 16:05:16 UTC
FWIW, this is actually not reproducible every time. In fact, I have yet to reproduce this in a not virt machine environment. This is now easily reproducible in rhel7.4 w/ exclusively activated clustered VGs. It is not reproducible with lvmlockd or single node VGs. 3.10.0-657.el7.x86_64 lvm2-2.02.171-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 lvm2-libs-2.02.171-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 lvm2-cluster-2.02.171-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 device-mapper-1.02.140-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 device-mapper-libs-1.02.140-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 device-mapper-event-1.02.140-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 device-mapper-event-libs-1.02.140-1.el7 BUILT: Wed May 3 07:05:13 CDT 2017 device-mapper-persistent-data-0.7.0-0.1.rc6.el7 BUILT: Mon Mar 27 10:15:46 CDT 2017 SCENARIO - [attempt_to_remove_suspended_cache_pool] Create a cache volume, suspend the pool and meta devices, then attempt to remove it *** Cache info for this scenario *** * origin (slow): /dev/mapper/mpatha1 * pool (fast): /dev/mapper/mpathd1 ************************************ Adding "slow" and "fast" tags to corresponding pvs Create origin (slow) volume lvcreate --wipesignatures y --activate ey -L 4G -n suspend cache_sanity @slow Create cache data and cache metadata (fast) volumes lvcreate --activate ey -L 4G -n POOL cache_sanity @fast lvcreate --activate ey -L 12M -n POOL_meta cache_sanity @fast Create cache pool volume by combining the cache data and cache metadata (fast) volumes with policy: smq mode: writethrough lvconvert --yes --type cache-pool --cachepolicy smq --cachemode writethrough -c 64 --poolmetadata cache_sanity/POOL_meta cache_sanity/POOL WARNING: Converting logical volume cache_sanity/POOL and cache_sanity/POOL_meta to cache pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Create cached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --yes --type cache --cachemetadataformat 2 --cachepool cache_sanity/POOL cache_sanity/suspend CHECKING META FMT: lvs --noheadings -o lv_name --select 'cachemetadataformat=2' | grep suspend dmsetup status | grep cache_sanity-suspend | grep writethrough | grep -w smq dmsetup suspend cache_sanity-POOL_cdata Uncache/remove the cache pool while it's suspended [deadlock] May 11 12:15:42 mckinley-03 qarshd[46792]: Running cmdline: lvconvert --yes --type cache --cachemetadataformat 2 --cachepool cache_sanity/POOL cache_sanity/suspend May 11 12:15:42 mckinley-03 systemd-udevd: inotify_add_watch(7, /dev/dm-19, 10) failed: No such file or directory May 11 12:15:42 mckinley-03 multipathd: cache_sanity-POOL: adding map May 11 12:15:42 mckinley-03 multipathd: cache_sanity-POOL: devmap dm-19 added May 11 12:15:42 mckinley-03 multipathd: dm-19: remove map (uevent) May 11 12:15:42 mckinley-03 multipathd: dm-19: devmap not registered, can't remove May 11 12:15:42 mckinley-03 multipathd: dm-19: remove map (uevent) May 11 12:15:43 mckinley-03 systemd: Started qarsh Per-Connection Server (10.15.80.218:33194). May 11 12:15:43 mckinley-03 systemd: Starting qarsh Per-Connection Server (10.15.80.218:33194)... May 11 12:15:43 mckinley-03 qarshd[46861]: Talking to peer ::ffff:10.15.80.218:33194 (IPv6) May 11 12:15:43 mckinley-03 qarshd[46861]: Running cmdline: lvs --noheadings -o lv_name --select 'cachemetadataformat=2' | grep suspend May 11 12:15:43 mckinley-03 systemd: Started qarsh Per-Connection Server (10.15.80.218:33196). May 11 12:15:43 mckinley-03 systemd: Starting qarsh Per-Connection Server (10.15.80.218:33196)... May 11 12:15:43 mckinley-03 qarshd[46869]: Talking to peer ::ffff:10.15.80.218:33196 (IPv6) May 11 12:15:43 mckinley-03 qarshd[46869]: Running cmdline: lvs --noheadings -o lv_name --select 'cachemode=writethrough && cache_policy=smq' | grep suspend May 11 12:15:44 mckinley-03 systemd: Started qarsh Per-Connection Server (10.15.80.218:33198). May 11 12:15:44 mckinley-03 systemd: Starting qarsh Per-Connection Server (10.15.80.218:33198)... May 11 12:15:44 mckinley-03 qarshd[46878]: Talking to peer ::ffff:10.15.80.218:33198 (IPv6) May 11 12:15:44 mckinley-03 qarshd[46878]: Running cmdline: dmsetup status | grep cache_sanity-suspend | grep writethrough | grep -w smq May 11 12:15:44 mckinley-03 systemd: Started qarsh Per-Connection Server (10.15.80.218:33200). May 11 12:15:44 mckinley-03 systemd: Starting qarsh Per-Connection Server (10.15.80.218:33200)... May 11 12:15:44 mckinley-03 qarshd[46887]: Talking to peer ::ffff:10.15.80.218:33200 (IPv6) May 11 12:15:44 mckinley-03 qarshd[46887]: Running cmdline: dmsetup suspend cache_sanity-POOL_cdata May 11 12:15:45 mckinley-03 systemd: Started qarsh Per-Connection Server (10.15.80.218:33206). May 11 12:15:45 mckinley-03 systemd: Starting qarsh Per-Connection Server (10.15.80.218:33206)... May 11 12:15:45 mckinley-03 qarshd[46892]: Talking to peer ::ffff:10.15.80.218:33206 (IPv6) May 11 12:15:45 mckinley-03 qarshd[46892]: Running cmdline: lvconvert --uncache cache_sanity/suspend May 11 12:18:59 mckinley-03 kernel: INFO: task lvconvert:46893 blocked for more than 120 seconds. May 11 12:18:59 mckinley-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 11 12:18:59 mckinley-03 kernel: lvconvert D ffff99f1fcc44d80 0 46893 46892 0x00000080 May 11 12:18:59 mckinley-03 kernel: ffff99e1eda33990 0000000000000086 ffff99e1f6ea2f10 ffff99e1eda33fd8 May 11 12:18:59 mckinley-03 kernel: ffff99e1eda33fd8 ffff99e1eda33fd8 ffff99e1f6ea2f10 ffff99f1fea96cc0 May 11 12:19:00 mckinley-03 kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff99f1fcc44d80 May 11 12:19:00 mckinley-03 kernel: Call Trace: May 11 12:19:00 mckinley-03 kernel: [<ffffffff930a4aa9>] schedule+0x29/0x70 May 11 12:19:00 mckinley-03 kernel: [<ffffffff930a25b9>] schedule_timeout+0x239/0x2c0 May 11 12:19:00 mckinley-03 kernel: [<ffffffffc01b6e0b>] ? dm_make_request+0x16b/0x190 [dm_mod] May 11 12:19:00 mckinley-03 kernel: [<ffffffff92cf6c26>] ? generic_make_request+0x106/0x1e0 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92ae933c>] ? ktime_get_ts64+0x4c/0xf0 May 11 12:19:00 mckinley-03 kernel: [<ffffffff930a412d>] io_schedule_timeout+0xad/0x130 May 11 12:19:00 mckinley-03 kernel: [<ffffffff930a41c8>] io_schedule+0x18/0x20 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c41845>] do_blockdev_direct_IO+0x1c45/0x2020 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c41c75>] __blockdev_direct_IO+0x55/0x60 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c3ca90>] blkdev_direct_IO+0x60/0xa0 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92b8438c>] generic_file_aio_read+0x70c/0x790 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c11116>] ? path_openat+0x166/0x490 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c3cfec>] blkdev_aio_read+0x4c/0x70 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92bffd1d>] do_sync_read+0x8d/0xd0 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c0070c>] vfs_read+0x9c/0x170 May 11 12:19:00 mckinley-03 kernel: [<ffffffff92c015cf>] SyS_read+0x7f/0xe0 May 11 12:19:00 mckinley-03 kernel: [<ffffffff930b05c9>] system_call_fastpath+0x16/0x1b May 11 12:21:00 mckinley-03 kernel: INFO: task lvconvert:46893 blocked for more than 120 seconds. May 11 12:21:00 mckinley-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 11 12:21:00 mckinley-03 kernel: lvconvert D ffff99f1fcc44d80 0 46893 46892 0x00000080 May 11 12:21:00 mckinley-03 kernel: ffff99e1eda33990 0000000000000086 ffff99e1f6ea2f10 ffff99e1eda33fd8 May 11 12:21:00 mckinley-03 kernel: ffff99e1eda33fd8 ffff99e1eda33fd8 ffff99e1f6ea2f10 ffff99f1fea96cc0 May 11 12:21:00 mckinley-03 kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff99f1fcc44d80 May 11 12:21:00 mckinley-03 kernel: Call Trace: May 11 12:21:00 mckinley-03 kernel: [<ffffffff930a4aa9>] schedule+0x29/0x70 May 11 12:21:00 mckinley-03 kernel: [<ffffffff930a25b9>] schedule_timeout+0x239/0x2c0 May 11 12:21:00 mckinley-03 kernel: [<ffffffffc01b6e0b>] ? dm_make_request+0x16b/0x190 [dm_mod] May 11 12:21:00 mckinley-03 kernel: [<ffffffff92cf6c26>] ? generic_make_request+0x106/0x1e0 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92ae933c>] ? ktime_get_ts64+0x4c/0xf0 May 11 12:21:00 mckinley-03 kernel: [<ffffffff930a412d>] io_schedule_timeout+0xad/0x130 May 11 12:21:00 mckinley-03 kernel: [<ffffffff930a41c8>] io_schedule+0x18/0x20 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c41845>] do_blockdev_direct_IO+0x1c45/0x2020 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c41c75>] __blockdev_direct_IO+0x55/0x60 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c3ca90>] blkdev_direct_IO+0x60/0xa0 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92b8438c>] generic_file_aio_read+0x70c/0x790 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c11116>] ? path_openat+0x166/0x490 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c3cfec>] blkdev_aio_read+0x4c/0x70 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92bffd1d>] do_sync_read+0x8d/0xd0 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c0070c>] vfs_read+0x9c/0x170 May 11 12:21:00 mckinley-03 kernel: [<ffffffff92c015cf>] SyS_read+0x7f/0xe0 May 11 12:21:00 mckinley-03 kernel: [<ffffffff930b05c9>] system_call_fastpath+0x16/0x1b May 11 12:23:00 mckinley-03 kernel: INFO: task lvconvert:46893 blocked for more than 120 seconds. May 11 12:23:00 mckinley-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 11 12:23:00 mckinley-03 kernel: lvconvert D ffff99f1fcc44d80 0 46893 46892 0x00000080 May 11 12:23:00 mckinley-03 kernel: ffff99e1eda33990 0000000000000086 ffff99e1f6ea2f10 ffff99e1eda33fd8 May 11 12:23:00 mckinley-03 kernel: ffff99e1eda33fd8 ffff99e1eda33fd8 ffff99e1f6ea2f10 ffff99f1fea96cc0 May 11 12:23:00 mckinley-03 kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff99f1fcc44d80 May 11 12:23:00 mckinley-03 kernel: Call Trace: May 11 12:23:00 mckinley-03 kernel: [<ffffffff930a4aa9>] schedule+0x29/0x70 May 11 12:23:00 mckinley-03 kernel: [<ffffffff930a25b9>] schedule_timeout+0x239/0x2c0 May 11 12:23:00 mckinley-03 kernel: [<ffffffffc01b6e0b>] ? dm_make_request+0x16b/0x190 [dm_mod] May 11 12:23:00 mckinley-03 kernel: [<ffffffff92cf6c26>] ? generic_make_request+0x106/0x1e0 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92ae933c>] ? ktime_get_ts64+0x4c/0xf0 May 11 12:23:00 mckinley-03 kernel: [<ffffffff930a412d>] io_schedule_timeout+0xad/0x130 May 11 12:23:00 mckinley-03 kernel: [<ffffffff930a41c8>] io_schedule+0x18/0x20 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c41845>] do_blockdev_direct_IO+0x1c45/0x2020 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c41c75>] __blockdev_direct_IO+0x55/0x60 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c3ca90>] blkdev_direct_IO+0x60/0xa0 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c3c1a0>] ? I_BDEV+0x10/0x10 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92b8438c>] generic_file_aio_read+0x70c/0x790 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c11116>] ? path_openat+0x166/0x490 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c3cfec>] blkdev_aio_read+0x4c/0x70 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92bffd1d>] do_sync_read+0x8d/0xd0 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c0070c>] vfs_read+0x9c/0x170 May 11 12:23:00 mckinley-03 kernel: [<ffffffff92c015cf>] SyS_read+0x7f/0xe0 May 11 12:23:00 mckinley-03 kernel: [<ffffffff930b05c9>] system_call_fastpath+0x16/0x1b Looks like this is actually reproducible in 7.6 w/ exclusively activated cache pools on shared lvmlockd VGs. Turning off this this scenario in *all* clustered environments going forward... 3.10.0-927.el7.x86_64 lvm2-2.02.180-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 lvm2-libs-2.02.180-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 lvm2-cluster-2.02.180-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 lvm2-lockd-2.02.180-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 lvm2-python-boom-0.9-4.el7 BUILT: Fri Jul 20 12:23:30 CDT 2018 cmirror-2.02.180-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 device-mapper-1.02.149-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 device-mapper-libs-1.02.149-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 device-mapper-event-1.02.149-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 device-mapper-event-libs-1.02.149-1.el7 BUILT: Fri Jul 20 12:21:35 CDT 2018 device-mapper-persistent-data-0.7.3-3.el7 BUILT: Tue Nov 14 05:07:18 CST 2017 SCENARIO - [attempt_to_remove_suspended_cache_pool] Create a cache volume, suspend the pool and meta devices, then attempt to remove it *** Cache info for this scenario *** * origin (slow): /dev/mapper/mpathb1 * pool (fast): /dev/mapper/mpathd1 ************************************ Adding "slow" and "fast" tags to corresponding pvs Create origin (slow) volume lvcreate --wipesignatures y --activate ey -L 4G -n suspend cache_sanity @slow Create cache data and cache metadata (fast) volumes lvcreate --activate ey -L 4G -n POOL cache_sanity @fast lvcreate --activate ey -L 12M -n POOL_meta cache_sanity @fast Create cache pool volume by combining the cache data and cache metadata (fast) volumes with policy: mq mode: writethrough lvconvert --yes --type cache-pool --cachepolicy mq --cachemode writethrough -c 64 --poolmetadata cache_sanity/POOL_meta cache_sanity/POOL WARNING: Converting cache_sanity/POOL and cache_sanity/POOL_meta to cache pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Create cached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --yes --type cache --cachemetadataformat 2 --cachepool cache_sanity/POOL cache_sanity/suspend lvm cache metadata format doesn't appear to be consistent with what was specified (2) dmsetup suspend cache_sanity-POOL_cdata Uncache/remove the cache pool while it's suspended lvconvert --uncache cache_sanity/suspend [DEADLOCK] Jul 26 17:06:33 harding-03 qarshd[10825]: Running cmdline: lvconvert --uncache cache_sanity/suspend Jul 26 17:09:01 harding-03 kernel: INFO: task lvconvert:10826 blocked for more than 120 seconds. Jul 26 17:09:01 harding-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 26 17:09:01 harding-03 kernel: lvconvert D ffffa0d16d34b0c0 0 10826 10825 0x00000080 Jul 26 17:09:01 harding-03 kernel: Call Trace: Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d61069>] schedule+0x29/0x70 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d62955>] rwsem_down_write_failed+0x225/0x3a0 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa66c67df>] ? try_check_zero+0xbf/0xf0 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa69831c7>] call_rwsem_down_write_failed+0x17/0x30 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d601dd>] down_write+0x2d/0x3d Jul 26 17:09:01 harding-03 kernel: [<ffffffffc08cb8d5>] cache_postsuspend+0x35/0x330 [dm_cache] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc08cb8a0>] ? writethrough_endio+0x170/0x170 [dm_cache] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc02feedd>] dm_table_postsuspend_targets+0x4d/0x60 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc02fc09c>] dm_suspend+0xbc/0xc0 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc0301a87>] dev_suspend+0x197/0x260 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc0302b4e>] ctl_ioctl+0x24e/0x550 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc03018f0>] ? table_load+0x390/0x390 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffc0302e5e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6854470>] do_vfs_ioctl+0x3a0/0x5a0 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d6076e>] ? __schedule+0x14e/0xa20 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6854711>] SyS_ioctl+0xa1/0xc0 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d6dc95>] ? system_call_after_swapgs+0xa2/0x146 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d6dd55>] system_call_fastpath+0x1c/0x21 Jul 26 17:09:01 harding-03 kernel: [<ffffffffa6d6dca1>] ? system_call_after_swapgs+0xae/0x146 Jul 26 17:11:01 harding-03 kernel: INFO: task lvconvert:10826 blocked for more than 120 seconds. Jul 26 17:11:01 harding-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 26 17:11:01 harding-03 kernel: lvconvert D ffffa0d16d34b0c0 0 10826 10825 0x00000080 Jul 26 17:11:01 harding-03 kernel: Call Trace: Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6d61069>] schedule+0x29/0x70 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6d62955>] rwsem_down_write_failed+0x225/0x3a0 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa66c67df>] ? try_check_zero+0xbf/0xf0 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa69831c7>] call_rwsem_down_write_failed+0x17/0x30 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6d601dd>] down_write+0x2d/0x3d Jul 26 17:11:01 harding-03 kernel: [<ffffffffc08cb8d5>] cache_postsuspend+0x35/0x330 [dm_cache] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc08cb8a0>] ? writethrough_endio+0x170/0x170 [dm_cache] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc02feedd>] dm_table_postsuspend_targets+0x4d/0x60 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc02fc09c>] dm_suspend+0xbc/0xc0 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc0301a87>] dev_suspend+0x197/0x260 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc0302b4e>] ctl_ioctl+0x24e/0x550 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc03018f0>] ? table_load+0x390/0x390 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffc0302e5e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6854470>] do_vfs_ioctl+0x3a0/0x5a0 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6d6076e>] ? __schedule+0x14e/0xa20 Jul 26 17:11:01 harding-03 kernel: [<ffffffffa6854711>] SyS_ioctl+0xa1/0xc0 Jul 26 17:11:02 harding-03 kernel: [<ffffffffa6d6dc95>] ? system_call_after_swapgs+0xa2/0x146 Jul 26 17:11:02 harding-03 kernel: [<ffffffffa6d6dd55>] system_call_fastpath+0x1c/0x21 Jul 26 17:11:02 harding-03 kernel: [<ffffffffa6d6dca1>] ? system_call_after_swapgs+0xae/0x146 This can deadlock in single machine (non cluster/shared) mode as well. 3.10.0-931.el7.x86_64 lvm2-2.02.180-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 lvm2-libs-2.02.180-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 device-mapper-1.02.149-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 device-mapper-libs-1.02.149-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 device-mapper-event-1.02.149-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 device-mapper-event-libs-1.02.149-2.el7 BUILT: Wed Aug 1 11:22:48 CDT 2018 device-mapper-persistent-data-0.7.3-3.el7 BUILT: Tue Nov 14 05:07:18 CST 2017 SCENARIO - [attempt_to_remove_suspended_cache_pool] Create a cache volume, suspend the pool and meta devices, then attempt to remove it WARNING: Not using lvmetad because a repair command was run. *** Cache info for this scenario *** * origin (slow): /dev/sda1 /dev/sdd1 /dev/sdf1 * pool (fast): /dev/sdc1 /dev/sde1 /dev/sdg1 ************************************ Adding "slow" and "fast" tags to corresponding pvs WARNING: Not using lvmetad because a repair command was run. WARNING: Not using lvmetad because a repair command was run. Create origin (slow) volume lvcreate -i 3 -L 4G -n suspend cache_sanity @slow WARNING: Not using lvmetad because a repair command was run. Create cache data and cache metadata (fast) volumes lvcreate -i 3 -L 4G -n POOL cache_sanity @fast WARNING: Not using lvmetad because a repair command was run. WARNING: xfs_external_log signature detected on /dev/cache_sanity/POOL at offset 196608. Wipe it? [y/n]: [n] Aborted wiping of xfs_external_log. 1 existing signature left on the device. lvcreate -i 3 -L 12M -n POOL_meta cache_sanity @fast WARNING: Not using lvmetad because a repair command was run. Create cache pool volume by combining the cache data and cache metadata (fast) volumes with policy: smq mode: writethrough lvconvert --yes --type cache-pool --cachepolicy smq --cachemode writethrough -c 64 --poolmetadata cache_sanity/POOL_meta cache_sanity/POOL WARNING: Not using lvmetad because a repair command was run. WARNING: Converting cache_sanity/POOL and cache_sanity/POOL_meta to cache pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Create cached volume by combining the cache pool (fast) and origin (slow) volumes lvconvert --yes --type cache --cachemetadataformat 2 --cachepool cache_sanity/POOL cache_sanity/suspend WARNING: Not using lvmetad because a repair command was run. WARNING: Not using lvmetad because a repair command was run. dmsetup suspend cache_sanity-POOL_cdata Uncache/remove the cache pool while it's suspended lvconvert --uncache cache_sanity/suspend WARNING: Not using lvmetad because a repair command was run. Aug 6 19:13:16 host-093 kernel: [<ffffffffc03bafad>] dm_table_postsuspend_targets+0x4d/0x60 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffffc03b807c>] dm_suspend+0xbc/0xc0 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffffc03bdb57>] dev_suspend+0x197/0x260 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffffc03bec1e>] ctl_ioctl+0x24e/0x550 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffffc03bd9c0>] ? table_load+0x390/0x390 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffffc03bef2e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Aug 6 19:13:16 host-093 kernel: [<ffffffff920549a0>] do_vfs_ioctl+0x3a0/0x5a0 Aug 6 19:13:16 host-093 kernel: [<ffffffff9256313e>] ? __schedule+0x14e/0xa20 Aug 6 19:13:16 host-093 kernel: [<ffffffff92054c41>] SyS_ioctl+0xa1/0xc0 Aug 6 19:13:16 host-093 kernel: [<ffffffff92570c95>] ? system_call_after_swapgs+0xa2/0x146 Aug 6 19:13:16 host-093 kernel: [<ffffffff92570d55>] system_call_fastpath+0x1c/0x21 Aug 6 19:13:16 host-093 kernel: [<ffffffff92570ca1>] ? system_call_after_swapgs+0xae/0x146 Aug 6 19:15:16 host-093 kernel: INFO: task lvconvert:7390 blocked for more than 120 seconds. Aug 6 19:15:16 host-093 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 6 19:15:16 host-093 kernel: lvconvert D ffffa031ce0cc100 0 7390 7389 0x00000080 Aug 6 19:15:16 host-093 kernel: Call Trace: Aug 6 19:15:16 host-093 kernel: [<ffffffff92563a39>] schedule+0x29/0x70 Aug 6 19:15:16 host-093 kernel: [<ffffffff92565325>] rwsem_down_write_failed+0x225/0x3a0 Aug 6 19:15:16 host-093 kernel: [<ffffffff91ec66bf>] ? try_check_zero+0xbf/0xf0 Aug 6 19:15:16 host-093 kernel: [<ffffffff92183d37>] call_rwsem_down_write_failed+0x17/0x30 Aug 6 19:15:16 host-093 kernel: [<ffffffff92562bad>] down_write+0x2d/0x3d Aug 6 19:15:16 host-093 kernel: [<ffffffffc07a48d5>] cache_postsuspend+0x35/0x330 [dm_cache] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03bafad>] dm_table_postsuspend_targets+0x4d/0x60 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03b807c>] dm_suspend+0xbc/0xc0 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03bdb57>] dev_suspend+0x197/0x260 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03bec1e>] ctl_ioctl+0x24e/0x550 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03bd9c0>] ? table_load+0x390/0x390 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffffc03bef2e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Aug 6 19:15:16 host-093 kernel: [<ffffffff920549a0>] do_vfs_ioctl+0x3a0/0x5a0 Aug 6 19:15:16 host-093 kernel: [<ffffffff9256313e>] ? __schedule+0x14e/0xa20 Aug 6 19:15:16 host-093 kernel: [<ffffffff92054c41>] SyS_ioctl+0xa1/0xc0 Aug 6 19:15:16 host-093 kernel: [<ffffffff92570c95>] ? system_call_after_swapgs+0xa2/0x146 Aug 6 19:15:16 host-093 kernel: [<ffffffff92570d55>] system_call_fastpath+0x1c/0x21 Aug 6 19:15:16 host-093 kernel: [<ffffffff92570ca1>] ? system_call_after_swapgs+0xae/0x146 Aug 6 19:17:16 host-093 kernel: INFO: task lvconvert:7390 blocked for more than 120 seconds. Aug 6 19:17:16 host-093 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 6 19:17:16 host-093 kernel: lvconvert D ffffa031ce0cc100 0 7390 7389 0x00000080 Aug 6 19:17:16 host-093 kernel: Call Trace: Aug 6 19:17:16 host-093 kernel: [<ffffffff92563a39>] schedule+0x29/0x70 Aug 6 19:17:16 host-093 kernel: [<ffffffff92565325>] rwsem_down_write_failed+0x225/0x3a0 Aug 6 19:17:16 host-093 kernel: [<ffffffff91ec66bf>] ? try_check_zero+0xbf/0xf0 Aug 6 19:17:16 host-093 kernel: [<ffffffff92183d37>] call_rwsem_down_write_failed+0x17/0x30 Aug 6 19:17:16 host-093 kernel: [<ffffffff92562bad>] down_write+0x2d/0x3d Aug 6 19:17:16 host-093 kernel: [<ffffffffc07a48d5>] cache_postsuspend+0x35/0x330 [dm_cache] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03bafad>] dm_table_postsuspend_targets+0x4d/0x60 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03b807c>] dm_suspend+0xbc/0xc0 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03bdb57>] dev_suspend+0x197/0x260 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03bec1e>] ctl_ioctl+0x24e/0x550 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03bd9c0>] ? table_load+0x390/0x390 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffffc03bef2e>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Aug 6 19:17:16 host-093 kernel: [<ffffffff920549a0>] do_vfs_ioctl+0x3a0/0x5a0 Aug 6 19:17:16 host-093 kernel: [<ffffffff9256313e>] ? __schedule+0x14e/0xa20 Aug 6 19:17:16 host-093 kernel: [<ffffffff92054c41>] SyS_ioctl+0xa1/0xc0 Aug 6 19:17:16 host-093 kernel: [<ffffffff92570c95>] ? system_call_after_swapgs+0xa2/0x146 Aug 6 19:17:16 host-093 kernel: [<ffffffff92570d55>] system_call_fastpath+0x1c/0x21 Aug 6 19:17:16 host-093 kernel: [<ffffffff92570ca1>] ? system_call_after_swapgs+0xae/0x146 Closing as there is not even design for upstream handling in these situations. Currently lvm2 works as expected and waits till devices suspended outside of lvm2 world are resumed. |