769921 – deadlock in thin-pool when free space is exhausted

Bug 769921 - deadlock in thin-pool when free space is exhausted

Summary: deadlock in thin-pool when free space is exhausted

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	LVM and device-mapper development team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-12-22 18:31 UTC by Mike Snitzer
Modified:	2012-01-05 15:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-01-05 15:26:35 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mike Snitzer 2011-12-22 18:31:21 UTC

Description of problem:
When thin-pool is exhausted it may block waiting on IO (due to no space) while holding a very coarse-grained lock (pmd->root_lock) that other threads will attempt to lock.  Creating a cascade of hung tasks.

Version-Release number of selected component (if applicable):
Linux 3.2 with DM thinp support

Steps to Reproduce:
Happened during creation of thin provisioned device - running mkfs.ext4
and getting out-of-space - now dmeventd run lvresize and this happened:

INFO: task kworker/u:1:34 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kworker/u:1     D 0000000000000001     0    34      2 0x00000000
  ffff88001d49f700 0000000000000046 ffffffff814da9a4 0000000000000001
  000000001d49f7b8 ffff88001d48a300 ffff88001d48a300 ffff88001d49ffd8
  ffff88001d49ffd8 ffff88001d49ffd8 ffff88001184c600 ffff88001d48a300
 Call Trace:
  [<ffffffff814da9a4>] ? __schedule+0x994/0xb50
  [<ffffffffa02ce420>] ? submit_io+0x1c0/0x1c0 [dm_bufio]
  [<ffffffff814dabff>] schedule+0x3f/0x60
  [<ffffffff814dacaf>] io_schedule+0x8f/0xd0
  [<ffffffffa02ce42e>] do_io_schedule+0xe/0x20 [dm_bufio]
  [<ffffffff814db76f>] __wait_on_bit+0x5f/0x90
  [<ffffffffa02ce420>] ? submit_io+0x1c0/0x1c0 [dm_bufio]
  [<ffffffff814db81c>] out_of_line_wait_on_bit+0x7c/0x90
  [<ffffffff8107c7b0>] ? autoremove_wake_function+0x40/0x40
  [<ffffffffa02cf315>] new_read+0xc5/0x380 [dm_bufio]
  [<ffffffffa02cf629>] dm_bufio_read+0x29/0x30 [dm_bufio]
  [<ffffffffa02d8605>] dm_bm_read_lock+0x35/0x1c0 [dm_persistent_data]
  [<ffffffff810967cd>] ? trace_hardirqs_on_caller+0x11d/0x1b0
  [<ffffffff8109686d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffffa02db858>] dm_tm_read_lock+0x18/0x30 [dm_persistent_data]
  [<ffffffffa02d9768>] sm_ll_find_free_block+0x108/0x1b0 [dm_persistent_data]
  [<ffffffffa02dab3c>] sm_metadata_new_block+0x3c/0xe0 [dm_persistent_data]
  [<ffffffffa02db76d>] dm_tm_shadow_block+0xbd/0x190 [dm_persistent_data]
  [<ffffffffa02ddd89>] shadow_step+0x59/0x110 [dm_persistent_data]
  [<ffffffffa02dc14b>] btree_insert_raw+0x3b/0x620 [dm_persistent_data]
  [<ffffffffa02dc90c>] insert+0x1dc/0x340 [dm_persistent_data]
  [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool]
  [<ffffffffa02dca7e>] dm_btree_insert_notify+0xe/0x10 [dm_persistent_data]
  [<ffffffffa02ec160>] dm_thin_insert_block+0x80/0xc0 [dm_thin_pool]
  [<ffffffff814de845>] ? _raw_spin_unlock_irqrestore+0x65/0x80
  [<ffffffffa02e8a2f>] process_prepared_mapping+0x4f/0x1e0 [dm_thin_pool]
  [<ffffffff814e1ff9>] ? sub_preempt_count+0xa9/0xe0
  [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool]
  [<ffffffffa02e98db>] do_worker+0xdb/0x620 [dm_thin_pool]
  [<ffffffff810739fa>] ? process_one_work+0x12a/0x6e0
  [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool]
  [<ffffffffa02e9800>] ? process_shared_bio+0x3d0/0x3d0 [dm_thin_pool]
  [<ffffffff81073a5b>] process_one_work+0x18b/0x6e0
  [<ffffffff810739fa>] ? process_one_work+0x12a/0x6e0
  [<ffffffff8107438f>] worker_thread+0x15f/0x350
  [<ffffffff81074230>] ? rescuer_thread+0x240/0x240
  [<ffffffff8107bd2b>] kthread+0x9b/0xa0
  [<ffffffff814e7274>] kernel_thread_helper+0x4/0x10
  [<ffffffff81037b3c>] ? finish_task_switch+0x7c/0x120
  [<ffffffff814de89b>] ? _raw_spin_unlock_irq+0x3b/0x60
  [<ffffffff814dede1>] ? retint_restore_args+0xe/0xe
  [<ffffffff8107bc90>] ? __init_kthread_worker+0x70/0x70
  [<ffffffff814e7270>] ? gs_change+0xb/0xb
 3 locks held by kworker/u:1/34:
  #0:  (name){.+.+..}, at: [<ffffffff810739fa>] process_one_work+0x12a/0x6e0                                                    
  #1:  ((&pool->worker)){+.+...}, at: [<ffffffff810739fa>]                                                            process_one_work+0x12a/0x6e0
  #2:  (&pmd->root_lock){++++..}, at: [<ffffffffa02ec116>]                                                            
dm_thin_insert_block+0x36/0xc0 [dm_thin_pool]

versus:

 INFO: task dmeventd:4946 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 dmeventd        D ffff88001d54f960     0  4946      1 0x00020000
  ffff88001d7f5b20 0000000000000046 ffff88001d7f5ae0 ffffffff810965e6
  000000939c71a7ec ffff88001c114600 0000000000000046 ffff88001d7f5fd8
  ffff88001d7f5fd8 ffff88001d7f5fd8 ffff88001184c600 ffff88001c114600
 Call Trace:
  [<ffffffff810965e6>] ? mark_held_locks+0x86/0x150
  [<ffffffff814dabff>] schedule+0x3f/0x60
  [<ffffffff814ddc25>] rwsem_down_failed_common+0xc5/0x160
  [<ffffffff814ddcf5>] rwsem_down_read_failed+0x15/0x17  
  [<ffffffff812d2f84>] call_rwsem_down_read_failed+0x14/0x30
  [<ffffffffa02ebe7e>] ? dm_pool_get_metadata_transaction_id+0x2e/0x60
[dm_thin_pool]
  [<ffffffff814dd08e>] ? down_read+0x7e/0x93
  [<ffffffffa02ebe7e>] ? dm_pool_get_metadata_transaction_id+0x2e/0x60
[dm_thin_pool]
  [<ffffffffa02ebe7e>] dm_pool_get_metadata_transaction_id+0x2e/0x60
[dm_thin_pool]
  [<ffffffffa02e81b4>] pool_status+0x44/0x290 [dm_thin_pool]
  [<ffffffffa00cd9bb>] ? dm_get_live_table+0x2b/0x60 [dm_mod]
  [<ffffffff814de765>] ? _raw_read_unlock_irqrestore+0x65/0x80
  [<ffffffffa00d427b>] retrieve_status+0x8b/0x1a0 [dm_mod]   
  [<ffffffffa00d43ec>] table_status+0x5c/0x90 [dm_mod]
  [<ffffffffa00d5c3a>] ctl_ioctl+0x15a/0x2c0 [dm_mod]
  [<ffffffffa00d4390>] ? retrieve_status+0x1a0/0x1a0 [dm_mod]
  [<ffffffff81161c48>] ? fget_light+0x48/0x3b0
  [<ffffffffa00d5dd2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
  [<ffffffff811b4711>] compat_sys_ioctl+0xc1/0x1540
  [<ffffffff81181d2f>] ? mntput+0x1f/0x30
  [<ffffffff8116beb2>] ? path_put+0x22/0x30
  [<ffffffff81165a4a>] ? vfs_fstatat+0x5a/0x70
  [<ffffffff814e7424>] ? sysenter_dispatch+0x22/0x2c
  [<ffffffff812d308d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
  [<ffffffff814e7409>] sysenter_dispatch+0x7/0x2c
  [<ffffffff812d304e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 1 lock held by dmeventd/4946:
  #0:  (&pmd->root_lock){++++..}, at: [<ffffffffa02ebe7e>]
dm_pool_get_metadata_transaction_id+0x2e/0x60 [dm_thin_pool]

and:

 INFO: task dmsetup:5260 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 dmsetup         D 0000000000000000     0  5260   5209 0x00020000
  ffff880011ff7b80 0000000000000046 ffffffff814da9a4 ffffffff810965e6
  ffff88001fdd3040 ffff8800152e2300 0000000000000046 ffff880011ff7fd8
  ffff880011ff7fd8 ffff880011ff7fd8 ffff88001e904600 ffff8800152e2300
 Call Trace:
  [<ffffffff814da9a4>] ? __schedule+0x994/0xb50
  [<ffffffff810965e6>] ? mark_held_locks+0x86/0x150
  [<ffffffff814dabff>] schedule+0x3f/0x60
  [<ffffffff814ddc25>] rwsem_down_failed_common+0xc5/0x160
  [<ffffffff814ddcf5>] rwsem_down_read_failed+0x15/0x17  
  [<ffffffff812d2f84>] call_rwsem_down_read_failed+0x14/0x30
  [<ffffffffa02ecba2>] ? dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool]
  [<ffffffff814dd08e>] ? down_read+0x7e/0x93
  [<ffffffffa02ecba2>] ? dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool]
  [<ffffffffa02ecba2>] dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool]
  [<ffffffffa02e9e89>] thin_status+0x69/0x1b0 [dm_thin_pool]
  [<ffffffff814de765>] ? _raw_read_unlock_irqrestore+0x65/0x80
  [<ffffffffa00d427b>] retrieve_status+0x8b/0x1a0 [dm_mod]   
  [<ffffffffa00d43ec>] table_status+0x5c/0x90 [dm_mod]
  [<ffffffffa00d5c3a>] ctl_ioctl+0x15a/0x2c0 [dm_mod]
  [<ffffffffa00d4390>] ? retrieve_status+0x1a0/0x1a0 [dm_mod]
  [<ffffffffa00d5dd2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
  [<ffffffff811b4711>] compat_sys_ioctl+0xc1/0x1540
  [<ffffffff8109686d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff814df1e3>] ? error_sti+0x5/0x6 
  [<ffffffff814dedce>] ? retint_swapgs+0xe/0x13
 [<ffffffff810967cd>] ? trace_hardirqs_on_caller+0x11d/0x1b0
  [<ffffffff812d304e>] ? trace_hardirqs_on_thunk+0x3a/0x3f 
  [<ffffffff814e7409>] sysenter_dispatch+0x7/0x2c
 1 lock held by dmsetup/5260:
  #0:  (&pmd->root_lock){++++..}, at: [<ffffffffa02ecba2>]
dm_thin_get_mapped_count+0x32/0x60 [dm_thin_pool]

  
Actual results:
kworker/u:1:34 is blocking deep in dm_thin_insert_block() path while
holding a write lock on pmd->root_lock (dmsetup and dmeventd are waiting
for a read lock on pmd->root_lock).

Expected results:
Seems we need to requeue the work item if there is no free space but the
space map will need a new block.

Comment 1 Joe Thornber 2012-01-05 15:07:20 UTC

Can't reproduce, though I do believe there is a real issue here.

We're not sure what the development version of dmeventd did to cause the issue.  Though we do know it was trying to open the out of space thin dev - which it just shouldn't do.

The test setup Kabi used is also quit esoteric.  The PV was a loop dev of a file on /dev/shm.  The above traces show the space map was in the process of submitting io via dm_bufio - so not a simple deadlock on a lock.

Here's the test script I tried, and failed to produce any issues, it's been added to the thinp-test-suite.

  def test_bug_769921
    # we create a pool with a really tiny data volume that wont be                                                               
    # able to complete a mkfs.                                                                                                   
    with_standard_pool(8) do |pool|
      with_new_thin(pool, @volume_size, 0) do |thin|

        event_tracker = pool.event_tracker;

        thin_fs = FS::file_system(:ext4, thin)
        fork {thin_fs.format}

        # FIXME: this is so common it should go into a utility lib                                                               
        event_tracker.wait do
          status = PoolStatus.new(pool)
          status.used_data_blocks >= status.total_data_blocks - @low_water_mark
        end

        # we're not sure what the development version of dmeventd was                                                            
        # doing to create the issue; some experiments to find out.                                                               
        pool.info

        # Resize the pool so the format can complete                                                                             
        table = Table.new(ThinPool.new(@size, @metadata_dev, @data_dev,
                                       @data_block_size, @low_water_mark))
        pool.load(table)
        pool.resume
      end
    end

    Process.wait
    if $?.exitstatus > 0
      raise RuntimeError, "wipe sub process failed"
    end
  end

Comment 2 Mike Snitzer 2012-01-05 15:26:35 UTC

Will reopen if we hit this again.

Note You need to log in before you can comment on or make changes to this bug.