Bug 769917

Summary:	deadlock in block core when thin-pool runs out of space
Product:	[Fedora] Fedora	Reporter:	Mike Snitzer <msnitzer>
Component:	kernel	Assignee:	Mike Snitzer <msnitzer>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	19	CC:	agk, bmarzins, dwysocha, gansalmon, heinzm, itamar, jforbes, jonathan, kernel-maint, madhu.chinakonda, msnitzer, prockai, zkabelac
Target Milestone:	---	Keywords:	Reopened
Target Release:	---	Flags:	jforbes: needinfo?
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-03-10 14:42:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Snitzer 2011-12-22 18:15:06 UTC

Description of problem:

If thin-pool device runs out of free space it may leave threads (e.g. dd) blocking on IO that is unable to complete until the thin-pool is given additional space. 

Analysis of stack traces from below:
(blkdev_put, dd) versus (blkdev_get, lvresize):

dd's blkdev_put() got bdev->bd_mutex and is stuck in sync_blockdev()
(because the thinp device ran out of space to provision).

Given dd is blocked while holding bdev->bd_mutex it is expected that
other threads would get stuck in blkdev_get() trying to get the same
bdev->bd_mutex (already held).

So this definitely impacts what LVM can expect to do against a thinp
device that has exhausted its thin-pool's free space.


Version-Release number of selected component (if applicable):
Basic issue exists in older kernels but it is a larger concern with thinp enabled kernels: Linux >= 3.2

How reproducible:
Always (AFAIK).

Steps to Reproduce:
create thin pool
create thin device from pool
(with disabled monitoring - so no dmeventd resize happens)

run   dd if=/dev/zero of=/dev/vg/lv

now thin pool gets filled - and so whole thin waits for resize
( device-mapper: thin: 253:2: no free space available.)

now run:   dmsetup info  - gets frozen

gets unfrozen when succesful resize of pool device happens.

  
Actual results:

thread is blocking on IO while holding bdev->bd_mutex:

thread gets stuck in sync_blockdev() on last close:
INFO: task dd:6857 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dd              D 0000000000000000     0  6857   5194 0x00000004
 ffff880128ccbc08 0000000000000046 ffffffff814da724 0000000000000001
 0000000028ccbcc0 ffff88006e208000 ffff88006e208000 ffff880128ccbfd8
 ffff880128ccbfd8 ffff880128ccbfd8 ffffffff81a0d020 ffff88006e208000
Call Trace:
 [<ffffffff814da724>] ? __schedule+0x994/0xb50
 [<ffffffff8110baa0>] ? __lock_page+0x70/0x70
 [<ffffffff814da97f>] schedule+0x3f/0x60
 [<ffffffff814daa2f>] io_schedule+0x8f/0xd0
 [<ffffffff8110baae>] sleep_on_page+0xe/0x20
 [<ffffffff814db4ef>] __wait_on_bit+0x5f/0x90
 [<ffffffff8110cde2>] ? find_get_pages_tag+0x132/0x210
 [<ffffffff8110ccb0>] ? grab_cache_page_nowait+0xa0/0xa0
 [<ffffffff8110bc18>] wait_on_page_bit+0x78/0x80
 [<ffffffff8107c6f0>] ? autoremove_wake_function+0x40/0x40
 [<ffffffff8110bd2c>] filemap_fdatawait_range+0x10c/0x1b0
 [<ffffffff8110bdfb>] filemap_fdatawait+0x2b/0x30
 [<ffffffff8110e434>] filemap_write_and_wait+0x44/0x60
 [<ffffffff8119b661>] __sync_blockdev+0x21/0x40
 [<ffffffff8119b693>] sync_blockdev+0x13/0x20
 [<ffffffff8119b70e>] __blkdev_put+0x6e/0x1f0
 [<ffffffff8119b8ed>] blkdev_put+0x5d/0x180
 [<ffffffff8119ba38>] blkdev_close+0x28/0x30
 [<ffffffff811623fe>] fput+0xfe/0x240
 [<ffffffff8115df69>] filp_close+0x69/0x90
 [<ffffffff8115e220>] sys_close+0xc0/0x1a0
 [<ffffffff814e5d6b>] system_call_fastpath+0x16/0x1b
1 lock held by dd/6857:
 #0:  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff8119b6df>] __blkdev_put+0x3f/0x1f0

and other threads try to access the device and get blocked behind original thread that has the bdev->bd_mutex held:

lvresize        D 0000000000000000     0  7669   6750 0x00000000
 ffff88001be91ac8 0000000000000046 ffffffff814da724 ffff88001bbfa300
 0000000000000007 ffff88001bbfa300 ffff88001be91aa8 ffff88001be91fd8
 ffff88001be91fd8 ffff88001be91fd8 ffff880136d44600 ffff88001bbfa300
Call Trace:
 [<ffffffff814da724>] ? __schedule+0x994/0xb50
 [<ffffffff814da97f>] schedule+0x3f/0x60
 [<ffffffff814db902>] mutex_lock_nested+0x1b2/0x3f0
 [<ffffffff8119babe>] ? __blkdev_get+0x7e/0x430
 [<ffffffff8119babe>] ? __blkdev_get+0x7e/0x430
 [<ffffffff8119babe>] __blkdev_get+0x7e/0x430
 [<ffffffff8119bec3>] blkdev_get+0x53/0x350
 [<ffffffff814e1d79>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff814de535>] ? _raw_spin_unlock+0x35/0x60
 [<ffffffff8119c22c>] blkdev_open+0x6c/0x90
 [<ffffffff8115da84>] __dentry_open+0x334/0x4a0
 [<ffffffff814e1d79>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff8119c1c0>] ? blkdev_get+0x350/0x350
 [<ffffffff8115f1e4>] nameidata_to_filp+0x74/0x80
 [<ffffffff8117080c>] do_last+0x26c/0x900
 [<ffffffff8117119b>] path_openat+0x2bb/0x3e0
 [<ffffffff8112e343>] ? might_fault+0x53/0xb0
 [<ffffffff811713e2>] do_filp_open+0x42/0xa0
 [<ffffffff814de535>] ? _raw_spin_unlock+0x35/0x60
 [<ffffffff8117f25d>] ? alloc_fd+0x18d/0x1f0
 [<ffffffff8115f2e7>] do_sys_open+0xf7/0x1d0
 [<ffffffff8115f3e0>] sys_open+0x20/0x30
 [<ffffffff814e5d6b>] system_call_fastpath+0x16/0x1b

Expected results:
Exhausted thin-pool should not result in tasks blocking on IO while holding bdev->bd_mutex.

Comment 1 Mike Snitzer 2012-01-03 15:31:50 UTC

I reproduced this deadlock with multipath when using queue_if_no_path.

block layer was modified to increase the window for DM devices to hit the flush on close code path (allowing me to fail the remaining path(s) in the multipath device before the block layer proceeds to flush outstanding buffered IO on the last close):

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b07f1da..d055ed4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -26,6 +26,7 @@
 #include <linux/log2.h>
 #include <linux/kmemleak.h>
 #include <asm/uaccess.h>
+#include <linux/delay.h>
 #include "internal.h"
 
 struct bdev_inode {
@@ -1429,6 +1430,10 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 
        if (!--bdev->bd_openers) {
                WARN_ON_ONCE(bdev->bd_holders);
+               if (MAJOR(bdev->bd_dev) == 253) {
+                       printk("%s: sleeping for 60 seconds\n", __func__);
+                       msleep(60000);
+               }
                sync_blockdev(bdev);
                kill_bdev(bdev);
                /* ->release can cause the old bdi to disappear,

1) stop multipathd
2) fail 3 of the 4 paths:
   # dmsetup message mpathb 0 fail_path /dev/sda
   # dmsetup message mpathb 0 fail_path /dev/sdb
   # dmsetup message mpathb 0 fail_path /dev/sdc
3) start buffered IO:
   # dd if=/dev/zero of=/dev/mapper/mpathb bs=1024k count=100
   (you'll see: "__blkdev_put: sleeping for 60 seconds" in the kernel log)
4) fail the last path in the mpath device:
   # dmsetup message mpathb 0 fail_path /dev/sdd

Results in the following backtrace (once the msleep completes):
dd              D ffff8801174e5750  5104  1555   1272 0x00000000
 ffff880118eb7bf8 0000000000000046 0000000000036afa ffffffffb0fcef3b
 ffff880118eb6010 ffff8801174e53c0 00000000001d25c0 ffff880118eb7fd8
 ffff880118eb7fd8 00000000001d25c0 ffff880118468600 ffff8801174e53c0
Call Trace:
 [<ffffffff810c94d4>] ? __lock_page+0x6d/0x6d
 [<ffffffff8138c9ce>] schedule+0x5a/0x5c
 [<ffffffff8138ca5c>] io_schedule+0x8c/0xcf
 [<ffffffff810c94e2>] sleep_on_page+0xe/0x12
 [<ffffffff8138cee6>] __wait_on_bit+0x48/0x7b
 [<ffffffff810c96d8>] wait_on_page_bit+0x72/0x79
 [<ffffffff8105fb87>] ? autoremove_wake_function+0x3d/0x3d
 [<ffffffff810d2869>] ? pagevec_lookup_tag+0x25/0x2e
 [<ffffffff810c9a26>] filemap_fdatawait_range+0xa4/0x171
 [<ffffffff810c9b17>] filemap_fdatawait+0x24/0x26
 [<ffffffff810ca0e7>] filemap_write_and_wait+0x2f/0x3b
 [<ffffffff81132da2>] __sync_blockdev+0x2c/0x2e
 [<ffffffff81132db7>] sync_blockdev+0x13/0x15
 [<ffffffff81132e65>] __blkdev_put+0xac/0x198
 [<ffffffff8113307d>] blkdev_put+0x12c/0x135
 [<ffffffff811330ad>] blkdev_close+0x27/0x29
 [<ffffffff81107d76>] fput+0x114/0x1e4
 [<ffffffff8110472d>] filp_close+0x75/0x80
 [<ffffffff811047e3>] sys_close+0xab/0xe5
 [<ffffffff81395b02>] system_call_fastpath+0x16/0x1b

Any other mpathb open attempt (e.g. dmsetup info mpathb) will hang trying to get bdev->bd_mutex

(reinstating a path allows IO to continue, eliminating the bdev->bd_mutex deadlock)

Comment 2 Fedora End Of Life 2013-04-03 19:34:39 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 3 Justin M. Forbes 2013-04-05 15:55:59 UTC

Is this still a problem with 3.9 based F19 kernels?

Comment 4 Justin M. Forbes 2013-04-23 17:28:20 UTC

This bug is being closed with INSUFFICIENT_DATA as there has not been a
response in 2 weeks.  If you are still experiencing this issue,
please reopen and attach the relevant data from the latest kernel you are
running and any data that might have been requested previously.

Comment 5 Alasdair Kergon 2013-10-04 11:44:36 UTC

This shouldn't have been closed.  Is it still a problem or if not, which change fixed it?

Comment 6 Justin M. Forbes 2014-01-03 22:09:58 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 7 Justin M. Forbes 2014-03-10 14:42:46 UTC

*********** MASS BUG UPDATE **************

This bug has been in a needinfo state for more than 1 month and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 19, please feel free to reopen the bug and provide the additional information requested.