Hide Forgot
Description of problem: When thin-pool is exhausted it may block waiting on IO (due to no space) while holding a very coarse-grained lock (pmd->root_lock) that other threads will attempt to lock. Creating a cascade of hung tasks. Version-Release number of selected component (if applicable): Linux 3.2 with DM thinp support Steps to Reproduce: Happened during creation of thin provisioned device - running mkfs.ext4 and getting out-of-space - now dmeventd run lvresize and this happened: INFO: task kworker/u:1:34 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u:1 D 0000000000000001 0 34 2 0x00000000 ffff88001d49f700 0000000000000046 ffffffff814da9a4 0000000000000001 000000001d49f7b8 ffff88001d48a300 ffff88001d48a300 ffff88001d49ffd8 ffff88001d49ffd8 ffff88001d49ffd8 ffff88001184c600 ffff88001d48a300 Call Trace: [<ffffffff814da9a4>] ? __schedule+0x994/0xb50 [<ffffffffa02ce420>] ? submit_io+0x1c0/0x1c0 [dm_bufio] [<ffffffff814dabff>] schedule+0x3f/0x60 [<ffffffff814dacaf>] io_schedule+0x8f/0xd0 [<ffffffffa02ce42e>] do_io_schedule+0xe/0x20 [dm_bufio] [<ffffffff814db76f>] __wait_on_bit+0x5f/0x90 [<ffffffffa02ce420>] ? submit_io+0x1c0/0x1c0 [dm_bufio] [<ffffffff814db81c>] out_of_line_wait_on_bit+0x7c/0x90 [<ffffffff8107c7b0>] ? autoremove_wake_function+0x40/0x40 [<ffffffffa02cf315>] new_read+0xc5/0x380 [dm_bufio] [<ffffffffa02cf629>] dm_bufio_read+0x29/0x30 [dm_bufio] [<ffffffffa02d8605>] dm_bm_read_lock+0x35/0x1c0 [dm_persistent_data] [<ffffffff810967cd>] ? trace_hardirqs_on_caller+0x11d/0x1b0 [<ffffffff8109686d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffffa02db858>] dm_tm_read_lock+0x18/0x30 [dm_persistent_data] [<ffffffffa02d9768>] sm_ll_find_free_block+0x108/0x1b0 [dm_persistent_data] [<ffffffffa02dab3c>] sm_metadata_new_block+0x3c/0xe0 [dm_persistent_data] [<ffffffffa02db76d>] dm_tm_shadow_block+0xbd/0x190 [dm_persistent_data] [<ffffffffa02ddd89>] shadow_step+0x59/0x110 [dm_persistent_data] [<ffffffffa02dc14b>] btree_insert_raw+0x3b/0x620 [dm_persistent_data] [<ffffffffa02dc90c>] insert+0x1dc/0x340 [dm_persistent_data] [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool] [<ffffffffa02dca7e>] dm_btree_insert_notify+0xe/0x10 [dm_persistent_data] [<ffffffffa02ec160>] dm_thin_insert_block+0x80/0xc0 [dm_thin_pool] [<ffffffff814de845>] ? _raw_spin_unlock_irqrestore+0x65/0x80 [<ffffffffa02e8a2f>] process_prepared_mapping+0x4f/0x1e0 [dm_thin_pool] [<ffffffff814e1ff9>] ? sub_preempt_count+0xa9/0xe0 [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool] [<ffffffffa02e98db>] do_worker+0xdb/0x620 [dm_thin_pool] [<ffffffff810739fa>] ? process_one_work+0x12a/0x6e0 [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool] [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool] [<ffffffff81073a5b>] process_one_work+0x18b/0x6e0 [<ffffffff810739fa>] ? process_one_work+0x12a/0x6e0 [<ffffffff8107438f>] worker_thread+0x15f/0x350 [<ffffffff81074230>] ? rescuer_thread+0x240/0x240 [<ffffffff8107bd2b>] kthread+0x9b/0xa0 [<ffffffff814e7274>] kernel_thread_helper+0x4/0x10 [<ffffffff81037b3c>] ? finish_task_switch+0x7c/0x120 [<ffffffff814de89b>] ? _raw_spin_unlock_irq+0x3b/0x60 [<ffffffff814dede1>] ? retint_restore_args+0xe/0xe [<ffffffff8107bc90>] ? __init_kthread_worker+0x70/0x70 [<ffffffff814e7270>] ? gs_change+0xb/0xb 3 locks held by kworker/u:1/34: #0: (name){.+.+..}, at: [<ffffffff810739fa>] process_one_work+0x12a/0x6e0 #1: ((&pool->worker)){+.+...}, at: [<ffffffff810739fa>] process_one_work+0x12a/0x6e0 #2: (&pmd->root_lock){++++..}, at: [<ffffffffa02ec116>] dm_thin_insert_block+0x36/0xc0 [dm_thin_pool] versus: INFO: task dmeventd:4946 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. dmeventd D ffff88001d54f960 0 4946 1 0x00020000 ffff88001d7f5b20 0000000000000046 ffff88001d7f5ae0 ffffffff810965e6 000000939c71a7ec ffff88001c114600 0000000000000046 ffff88001d7f5fd8 ffff88001d7f5fd8 ffff88001d7f5fd8 ffff88001184c600 ffff88001c114600 Call Trace: [<ffffffff810965e6>] ? mark_held_locks+0x86/0x150 [<ffffffff814dabff>] schedule+0x3f/0x60 [<ffffffff814ddc25>] rwsem_down_failed_common+0xc5/0x160 [<ffffffff814ddcf5>] rwsem_down_read_failed+0x15/0x17 [<ffffffff812d2f84>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa02ebe7e>] ? dm_pool_get_metadata_transaction_id+0x2e/0x60 [dm_thin_pool] [<ffffffff814dd08e>] ? down_read+0x7e/0x93 [<ffffffffa02ebe7e>] ? dm_pool_get_metadata_transaction_id+0x2e/0x60 [dm_thin_pool] [<ffffffffa02ebe7e>] dm_pool_get_metadata_transaction_id+0x2e/0x60 [dm_thin_pool] [<ffffffffa02e81b4>] pool_status+0x44/0x290 [dm_thin_pool] [<ffffffffa00cd9bb>] ? dm_get_live_table+0x2b/0x60 [dm_mod] [<ffffffff814de765>] ? _raw_read_unlock_irqrestore+0x65/0x80 [<ffffffffa00d427b>] retrieve_status+0x8b/0x1a0 [dm_mod] [<ffffffffa00d43ec>] table_status+0x5c/0x90 [dm_mod] [<ffffffffa00d5c3a>] ctl_ioctl+0x15a/0x2c0 [dm_mod] [<ffffffffa00d4390>] ? retrieve_status+0x1a0/0x1a0 [dm_mod] [<ffffffff81161c48>] ? fget_light+0x48/0x3b0 [<ffffffffa00d5dd2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod] [<ffffffff811b4711>] compat_sys_ioctl+0xc1/0x1540 [<ffffffff81181d2f>] ? mntput+0x1f/0x30 [<ffffffff8116beb2>] ? path_put+0x22/0x30 [<ffffffff81165a4a>] ? vfs_fstatat+0x5a/0x70 [<ffffffff814e7424>] ? sysenter_dispatch+0x22/0x2c [<ffffffff812d308d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff814e7409>] sysenter_dispatch+0x7/0x2c [<ffffffff812d304e>] ? trace_hardirqs_on_thunk+0x3a/0x3f 1 lock held by dmeventd/4946: #0: (&pmd->root_lock){++++..}, at: [<ffffffffa02ebe7e>] dm_pool_get_metadata_transaction_id+0x2e/0x60 [dm_thin_pool] and: INFO: task dmsetup:5260 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. dmsetup D 0000000000000000 0 5260 5209 0x00020000 ffff880011ff7b80 0000000000000046 ffffffff814da9a4 ffffffff810965e6 ffff88001fdd3040 ffff8800152e2300 0000000000000046 ffff880011ff7fd8 ffff880011ff7fd8 ffff880011ff7fd8 ffff88001e904600 ffff8800152e2300 Call Trace: [<ffffffff814da9a4>] ? __schedule+0x994/0xb50 [<ffffffff810965e6>] ? mark_held_locks+0x86/0x150 [<ffffffff814dabff>] schedule+0x3f/0x60 [<ffffffff814ddc25>] rwsem_down_failed_common+0xc5/0x160 [<ffffffff814ddcf5>] rwsem_down_read_failed+0x15/0x17 [<ffffffff812d2f84>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa02ecba2>] ? dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool] [<ffffffff814dd08e>] ? down_read+0x7e/0x93 [<ffffffffa02ecba2>] ? dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool] [<ffffffffa02ecba2>] dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool] [<ffffffffa02e9e89>] thin_status+0x69/0x1b0 [dm_thin_pool] [<ffffffff814de765>] ? _raw_read_unlock_irqrestore+0x65/0x80 [<ffffffffa00d427b>] retrieve_status+0x8b/0x1a0 [dm_mod] [<ffffffffa00d43ec>] table_status+0x5c/0x90 [dm_mod] [<ffffffffa00d5c3a>] ctl_ioctl+0x15a/0x2c0 [dm_mod] [<ffffffffa00d4390>] ? retrieve_status+0x1a0/0x1a0 [dm_mod] [<ffffffffa00d5dd2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod] [<ffffffff811b4711>] compat_sys_ioctl+0xc1/0x1540 [<ffffffff8109686d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff814df1e3>] ? error_sti+0x5/0x6 [<ffffffff814dedce>] ? retint_swapgs+0xe/0x13 [<ffffffff810967cd>] ? trace_hardirqs_on_caller+0x11d/0x1b0 [<ffffffff812d304e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff814e7409>] sysenter_dispatch+0x7/0x2c 1 lock held by dmsetup/5260: #0: (&pmd->root_lock){++++..}, at: [<ffffffffa02ecba2>] dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool] Actual results: kworker/u:1:34 is blocking deep in dm_thin_insert_block() path while holding a write lock on pmd->root_lock (dmsetup and dmeventd are waiting for a read lock on pmd->root_lock). Expected results: Seems we need to requeue the work item if there is no free space but the space map will need a new block.
Can't reproduce, though I do believe there is a real issue here. We're not sure what the development version of dmeventd did to cause the issue. Though we do know it was trying to open the out of space thin dev - which it just shouldn't do. The test setup Kabi used is also quit esoteric. The PV was a loop dev of a file on /dev/shm. The above traces show the space map was in the process of submitting io via dm_bufio - so not a simple deadlock on a lock. Here's the test script I tried, and failed to produce any issues, it's been added to the thinp-test-suite. def test_bug_769921 # we create a pool with a really tiny data volume that wont be # able to complete a mkfs. with_standard_pool(8) do |pool| with_new_thin(pool, @volume_size, 0) do |thin| event_tracker = pool.event_tracker; thin_fs = FS::file_system(:ext4, thin) fork {thin_fs.format} # FIXME: this is so common it should go into a utility lib event_tracker.wait do status = PoolStatus.new(pool) status.used_data_blocks >= status.total_data_blocks - @low_water_mark end # we're not sure what the development version of dmeventd was # doing to create the issue; some experiments to find out. pool.info # Resize the pool so the format can complete table = Table.new(ThinPool.new(@size, @metadata_dev, @data_dev, @data_block_size, @low_water_mark)) pool.load(table) pool.resume end end Process.wait if $?.exitstatus > 0 raise RuntimeError, "wipe sub process failed" end end
Will reopen if we hit this again.