Hide Forgot
Description of problem: If thin-pool device runs out of free space it may leave threads (e.g. dd) blocking on IO that is unable to complete until the thin-pool is given additional space. Analysis of stack traces from below: (blkdev_put, dd) versus (blkdev_get, lvresize): dd's blkdev_put() got bdev->bd_mutex and is stuck in sync_blockdev() (because the thinp device ran out of space to provision). Given dd is blocked while holding bdev->bd_mutex it is expected that other threads would get stuck in blkdev_get() trying to get the same bdev->bd_mutex (already held). So this definitely impacts what LVM can expect to do against a thinp device that has exhausted its thin-pool's free space. Version-Release number of selected component (if applicable): Basic issue exists in older kernels but it is a larger concern with thinp enabled kernels: Linux >= 3.2 How reproducible: Always (AFAIK). Steps to Reproduce: create thin pool create thin device from pool (with disabled monitoring - so no dmeventd resize happens) run dd if=/dev/zero of=/dev/vg/lv now thin pool gets filled - and so whole thin waits for resize ( device-mapper: thin: 253:2: no free space available.) now run: dmsetup info - gets frozen gets unfrozen when succesful resize of pool device happens. Actual results: thread is blocking on IO while holding bdev->bd_mutex: thread gets stuck in sync_blockdev() on last close: INFO: task dd:6857 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. dd D 0000000000000000 0 6857 5194 0x00000004 ffff880128ccbc08 0000000000000046 ffffffff814da724 0000000000000001 0000000028ccbcc0 ffff88006e208000 ffff88006e208000 ffff880128ccbfd8 ffff880128ccbfd8 ffff880128ccbfd8 ffffffff81a0d020 ffff88006e208000 Call Trace: [<ffffffff814da724>] ? __schedule+0x994/0xb50 [<ffffffff8110baa0>] ? __lock_page+0x70/0x70 [<ffffffff814da97f>] schedule+0x3f/0x60 [<ffffffff814daa2f>] io_schedule+0x8f/0xd0 [<ffffffff8110baae>] sleep_on_page+0xe/0x20 [<ffffffff814db4ef>] __wait_on_bit+0x5f/0x90 [<ffffffff8110cde2>] ? find_get_pages_tag+0x132/0x210 [<ffffffff8110ccb0>] ? grab_cache_page_nowait+0xa0/0xa0 [<ffffffff8110bc18>] wait_on_page_bit+0x78/0x80 [<ffffffff8107c6f0>] ? autoremove_wake_function+0x40/0x40 [<ffffffff8110bd2c>] filemap_fdatawait_range+0x10c/0x1b0 [<ffffffff8110bdfb>] filemap_fdatawait+0x2b/0x30 [<ffffffff8110e434>] filemap_write_and_wait+0x44/0x60 [<ffffffff8119b661>] __sync_blockdev+0x21/0x40 [<ffffffff8119b693>] sync_blockdev+0x13/0x20 [<ffffffff8119b70e>] __blkdev_put+0x6e/0x1f0 [<ffffffff8119b8ed>] blkdev_put+0x5d/0x180 [<ffffffff8119ba38>] blkdev_close+0x28/0x30 [<ffffffff811623fe>] fput+0xfe/0x240 [<ffffffff8115df69>] filp_close+0x69/0x90 [<ffffffff8115e220>] sys_close+0xc0/0x1a0 [<ffffffff814e5d6b>] system_call_fastpath+0x16/0x1b 1 lock held by dd/6857: #0: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff8119b6df>] __blkdev_put+0x3f/0x1f0 and other threads try to access the device and get blocked behind original thread that has the bdev->bd_mutex held: lvresize D 0000000000000000 0 7669 6750 0x00000000 ffff88001be91ac8 0000000000000046 ffffffff814da724 ffff88001bbfa300 0000000000000007 ffff88001bbfa300 ffff88001be91aa8 ffff88001be91fd8 ffff88001be91fd8 ffff88001be91fd8 ffff880136d44600 ffff88001bbfa300 Call Trace: [<ffffffff814da724>] ? __schedule+0x994/0xb50 [<ffffffff814da97f>] schedule+0x3f/0x60 [<ffffffff814db902>] mutex_lock_nested+0x1b2/0x3f0 [<ffffffff8119babe>] ? __blkdev_get+0x7e/0x430 [<ffffffff8119babe>] ? __blkdev_get+0x7e/0x430 [<ffffffff8119babe>] __blkdev_get+0x7e/0x430 [<ffffffff8119bec3>] blkdev_get+0x53/0x350 [<ffffffff814e1d79>] ? sub_preempt_count+0xa9/0xe0 [<ffffffff814de535>] ? _raw_spin_unlock+0x35/0x60 [<ffffffff8119c22c>] blkdev_open+0x6c/0x90 [<ffffffff8115da84>] __dentry_open+0x334/0x4a0 [<ffffffff814e1d79>] ? sub_preempt_count+0xa9/0xe0 [<ffffffff8119c1c0>] ? blkdev_get+0x350/0x350 [<ffffffff8115f1e4>] nameidata_to_filp+0x74/0x80 [<ffffffff8117080c>] do_last+0x26c/0x900 [<ffffffff8117119b>] path_openat+0x2bb/0x3e0 [<ffffffff8112e343>] ? might_fault+0x53/0xb0 [<ffffffff811713e2>] do_filp_open+0x42/0xa0 [<ffffffff814de535>] ? _raw_spin_unlock+0x35/0x60 [<ffffffff8117f25d>] ? alloc_fd+0x18d/0x1f0 [<ffffffff8115f2e7>] do_sys_open+0xf7/0x1d0 [<ffffffff8115f3e0>] sys_open+0x20/0x30 [<ffffffff814e5d6b>] system_call_fastpath+0x16/0x1b Expected results: Exhausted thin-pool should not result in tasks blocking on IO while holding bdev->bd_mutex.
I reproduced this deadlock with multipath when using queue_if_no_path. block layer was modified to increase the window for DM devices to hit the flush on close code path (allowing me to fail the remaining path(s) in the multipath device before the block layer proceeds to flush outstanding buffered IO on the last close): diff --git a/fs/block_dev.c b/fs/block_dev.c index b07f1da..d055ed4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -26,6 +26,7 @@ #include <linux/log2.h> #include <linux/kmemleak.h> #include <asm/uaccess.h> +#include <linux/delay.h> #include "internal.h" struct bdev_inode { @@ -1429,6 +1430,10 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) if (!--bdev->bd_openers) { WARN_ON_ONCE(bdev->bd_holders); + if (MAJOR(bdev->bd_dev) == 253) { + printk("%s: sleeping for 60 seconds\n", __func__); + msleep(60000); + } sync_blockdev(bdev); kill_bdev(bdev); /* ->release can cause the old bdi to disappear, 1) stop multipathd 2) fail 3 of the 4 paths: # dmsetup message mpathb 0 fail_path /dev/sda # dmsetup message mpathb 0 fail_path /dev/sdb # dmsetup message mpathb 0 fail_path /dev/sdc 3) start buffered IO: # dd if=/dev/zero of=/dev/mapper/mpathb bs=1024k count=100 (you'll see: "__blkdev_put: sleeping for 60 seconds" in the kernel log) 4) fail the last path in the mpath device: # dmsetup message mpathb 0 fail_path /dev/sdd Results in the following backtrace (once the msleep completes): dd D ffff8801174e5750 5104 1555 1272 0x00000000 ffff880118eb7bf8 0000000000000046 0000000000036afa ffffffffb0fcef3b ffff880118eb6010 ffff8801174e53c0 00000000001d25c0 ffff880118eb7fd8 ffff880118eb7fd8 00000000001d25c0 ffff880118468600 ffff8801174e53c0 Call Trace: [<ffffffff810c94d4>] ? __lock_page+0x6d/0x6d [<ffffffff8138c9ce>] schedule+0x5a/0x5c [<ffffffff8138ca5c>] io_schedule+0x8c/0xcf [<ffffffff810c94e2>] sleep_on_page+0xe/0x12 [<ffffffff8138cee6>] __wait_on_bit+0x48/0x7b [<ffffffff810c96d8>] wait_on_page_bit+0x72/0x79 [<ffffffff8105fb87>] ? autoremove_wake_function+0x3d/0x3d [<ffffffff810d2869>] ? pagevec_lookup_tag+0x25/0x2e [<ffffffff810c9a26>] filemap_fdatawait_range+0xa4/0x171 [<ffffffff810c9b17>] filemap_fdatawait+0x24/0x26 [<ffffffff810ca0e7>] filemap_write_and_wait+0x2f/0x3b [<ffffffff81132da2>] __sync_blockdev+0x2c/0x2e [<ffffffff81132db7>] sync_blockdev+0x13/0x15 [<ffffffff81132e65>] __blkdev_put+0xac/0x198 [<ffffffff8113307d>] blkdev_put+0x12c/0x135 [<ffffffff811330ad>] blkdev_close+0x27/0x29 [<ffffffff81107d76>] fput+0x114/0x1e4 [<ffffffff8110472d>] filp_close+0x75/0x80 [<ffffffff811047e3>] sys_close+0xab/0xe5 [<ffffffff81395b02>] system_call_fastpath+0x16/0x1b Any other mpathb open attempt (e.g. dmsetup info mpathb) will hang trying to get bdev->bd_mutex (reinstating a path allows IO to continue, eliminating the bdev->bd_mutex deadlock)
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
Is this still a problem with 3.9 based F19 kernels?
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
This shouldn't have been closed. Is it still a problem or if not, which change fixed it?
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug has been in a needinfo state for more than 1 month and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 19, please feel free to reopen the bug and provide the additional information requested.