706036 – pvmove stuck waiting for I/O to complete

Bug 706036 - pvmove stuck waiting for I/O to complete

Summary: pvmove stuck waiting for I/O to complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Alasdair Kergon
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:	684083
Blocks:	602516 668957
TreeView+	depends on / blocked

Reported:	2011-05-19 09:13 UTC by hank
Modified:	2018-12-03 17:17 UTC (History)
CC List:	17 users (show)
Fixed In Version:	lvm2-2.02.88-1.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:	602516
Clones:	1020385 (view as bug list)
Environment:
Last Closed:	2012-02-21 06:04:22 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
a patch for solve pvmove hung problem (1.88 KB, patch) 2011-05-26 06:00 UTC, hank	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	57723	0	None	None	None	Never
Red Hat Product Errata	RHBA-2012:0161	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2012-02-20 15:07:59 UTC

Description hank 2011-05-19 09:13:57 UTC

+++ This bug was initially created as a clone of Bug #602516 +++

Description of problem:
A customer reported that when they are doing pvmove while doing some file IO and database IO on the LV, the pvmove and those IO may hang.

They see that in their production system before. And they setup a testing system with the following structure:

LV:Ftest               LV:Ftestdb
   |                        |
   +-----------+------------+
               |
           VG:vgtest
               |
   +-----+-----+-----+-----+
   |     |           |     |
  sdb   sdd         sdg   sdi  

They do the test like this:
1. Create vgtest with 4 PVs
2. Create LV "Ftest" and "Ftestdb" on the VG, specify PV as sdb and sdd
3. Create ext3 fs on Ftest and Ftestdb and mount them
4. Run some file IO workload on Ftest and run oracle workload on Ftestdb
5. do pvmove from sdb to sdg, finish successfully in 5 mins (see pvmove_1)
6. do pvmove from sdd to sdi, seems hanged at this line (see pvmove_2):

   Suspending vgtest-Ftestdb (253:14) without device flush

And customer waited for a few hours and it didn't have any progress. Moreover, all the IO on the LV seems stopped and the related process hanged at "D" state.

Moreover, we asked to customer to use "iostat -x 1" to check and no IO traffic on the disk. And "ctrl+c" cannot break the pvmove process.

Version-Release number of selected component (if applicable):
kernel-2.6.9-67.EL

How reproducible:
Every time the above test is run.


--- Additional comment from lmcilroy on 2010-06-09 23:06:25 EDT ---

$ cat pvmove_1
root-root@hklxdv55-DR:/$pvmove -v /dev/sdb /dev/sdg
    Wiping cache of LVM-capable devices
    Finding volume group "vgtest"
    Archiving volume group "vgtest" metadata (seqno 13).
    Creating logical volume pvmove0
    Moving 16 extents of logical volume vgtest/Ftest
    Moving 15 extents of logical volume vgtest/Ftestdb
    Found volume group "vgtest"
    Found volume group "vgtest"
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 14).
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftest (253:13) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:14) without device flush
    Found volume group "vgtest"
    Creating vgtest-pvmove0
    Loading vgtest-pvmove0 table
    Resuming vgtest-pvmove0 (253:15)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftest table
    Resuming vgtest-Ftest (253:13)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftestdb table
    Resuming vgtest-Ftestdb (253:14)
    Checking progress every 15 seconds
  /dev/sdb: Moved: 16.1%
  /dev/sdb: Moved: 32.3%
  /dev/sdb: Moved: 51.6%
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 15).
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftest (253:13) without device flush
    Suspending vgtest-pvmove0 (253:15) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:14) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Resuming vgtest-pvmove0 (253:15)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftest table
    Suppressed vgtest-Ftest identical table reload.
    Resuming vgtest-Ftest (253:13)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftestdb table
    Suppressed vgtest-Ftestdb identical table reload.
    Resuming vgtest-Ftestdb (253:14)
  /dev/sdb: Moved: 77.4%
  /dev/sdb: Moved: 96.8%
  /dev/sdb: Moved: 100.0%
    Found volume group "vgtest"
    Found volume group "vgtest"
    Loading vgtest-Ftest table
    Suspending vgtest-Ftest (253:13) without device flush
    Suspending vgtest-pvmove0 (253:15) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Resuming vgtest-pvmove0 (253:15)
    Loading vgtest-Ftestdb table
    Suspending vgtest-Ftestdb (253:14) without device flush
    Suspending vgtest-pvmove0 (253:15) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Found volume group "vgtest"
    Resuming vgtest-pvmove0 (253:15)
    Found volume group "vgtest"
    Resuming vgtest-Ftest (253:13)
    Found volume group "vgtest"
    Resuming vgtest-Ftestdb (253:14)
    Found volume group "vgtest"
    Removing vgtest-pvmove0 (253:15)
    Found volume group "vgtest"
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 17).

$ cat pvmove_2
root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi
    Wiping cache of LVM-capable devices
    Finding volume group "vgtest"
    Archiving volume group "vgtest" metadata (seqno 17).
    Creating logical volume pvmove0
    Moving 8 extents of logical volume vgtest/Ftest
    Moving 17 extents of logical volume vgtest/Ftestdb
    Found volume group "vgtest"
    Found volume group "vgtest"
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 18).
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftest (253:13) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:14) without device flush
    Found volume group "vgtest"
    Creating vgtest-pvmove0
    Loading vgtest-pvmove0 table
    Resuming vgtest-pvmove0 (253:15)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftest table
    Resuming vgtest-Ftest (253:13)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table
    Suppressed vgtest-pvmove0 identical table reload.
    Loading vgtest-Ftestdb table
    Resuming vgtest-Ftestdb (253:14)
    Checking progress every 15 seconds
  /dev/sdd: Moved: 16.0%
  /dev/sdd: Moved: 32.0%
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 19).
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftest (253:13) without device flush
    Suspending vgtest-pvmove0 (253:15) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:14) without device flush

--- Additional comment from lmcilroy on 2010-06-09 23:15:05 EDT ---

Looking at the vmcore, the pvmove process is blocked attempting to suspend the device - it hasn't run for 13 minutes:

PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
#0 [101e05d5c48] schedule at ffffffff8035fb2c
#1 [101e05d5cd0] io_schedule at ffffffff80360fcb
#2 [101e05d5ce0] dm_suspend at ffffffffa00de43d
#3 [101e05d5d80] dev_suspend at ffffffffa00e11a1
#4 [101e05d5db0] ctl_ioctl at ffffffffa00e2383
#5 [101e05d5f40] sys_ioctl at ffffffff801a668a
   RIP: 0000003185dc25d9  RSP: 0000007fbfffb688  RFLAGS: 00000206
   RAX: 0000000000000010  RBX: ffffffff80110a92  RCX: 000000000000000a
   RDX: 000000000066c920  RSI: 00000000c138fd06  RDI: 0000000000000008
   RBP: 0000000000000000   R8: 0000000000000005   R9: 000000000066c950
   R10: 0000003186114700  R11: 0000000000000246  R12: 000000000066ca58
   R13: 0000000000000000  R14: 00000000000000fd  R15: 000000000066c920
   ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

I found this oracle process waiting for I/O to complete and according to the last_ran time got stuck just before the pvmove process so may be of interest:

PID: 7240   TASK: 10222fbf650       CPU: 0   COMMAND: "oracle"
#0 [10222fc1b38] schedule at ffffffff8035fb2c
#1 [10222fc1bc0] io_schedule at ffffffff80360fcb
#2 [10222fc1bd0] wait_on_page_bit at ffffffff801688ae
#3 [10222fc1c70] wait_on_page_writeback_range at ffffffff80168935
#4 [10222fc1d40] sync_page_range_nolock at ffffffff80168aa2
#5 [10222fc1d70] generic_file_aio_write_nolock at ffffffff8016b159
#6 [10222fc1da0] generic_file_aio_write at ffffffff8016b1eb
#7 [10222fc1df0] ext3_file_write at ffffffffa010b2f9
#8 [10222fc1e10] do_sync_write at ffffffff8018f412
#9 [10222fc1f10] vfs_write at ffffffff8018f50d
#10 [10222fc1f40] sys_pwrite64 at ffffffff8018f6ea
#11 [10222fc1f80] system_call at ffffffff80110a92
   RIP: 000000318660bf75  RSP: 0000007fbfffdda8  RFLAGS: 00010246
   RAX: 0000000000000012  RBX: ffffffff80110a92  RCX: 000000002ae835f7
   RDX: 0000000000000400  RSI: 000000006071ec00  RDI: 0000000000000016
   RBP: 00048753a5be767c   R8: 0000000000092abc   R9: 000000000020e0c2
   R10: 000000000238e000  R11: 0000000000000246  R12: 0000000000000200
   R13: 0000007fbfffd8f0  R14: 0000002a97d56c48  R15: 0000000000000b6f
   ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

And this oracle process got stuck just after the pvmove process:

PID: 7238   TASK: 10223ba4530       CPU: 0   COMMAND: "oracle"
#0 [1022302db38] schedule at ffffffff8035fb2c
#1 [1022302dbc0] io_schedule at ffffffff80360fcb
#2 [1022302dbd0] wait_on_page_bit at ffffffff801688ae
#3 [1022302dc70] wait_on_page_writeback_range at ffffffff80168935
#4 [1022302dd40] sync_page_range_nolock at ffffffff80168aa2
#5 [1022302dd70] generic_file_aio_write_nolock at ffffffff8016b159
#6 [1022302dda0] generic_file_aio_write at ffffffff8016b1eb
#7 [1022302ddf0] ext3_file_write at ffffffffa010b2f9
#8 [1022302de10] do_sync_write at ffffffff8018f412
#9 [1022302df10] vfs_write at ffffffff8018f50d
#10 [1022302df40] sys_pwrite64 at ffffffff8018f6ea
#11 [1022302df80] system_call at ffffffff80110a92
   RIP: 000000318660bf75  RSP: 0000007fbfffd808  RFLAGS: 00010212
   RAX: 0000000000000012  RBX: ffffffff80110a92  RCX: 000000008c663000
   RDX: 00000000000a0000  RSI: 000000008c75f000  RDI: 0000000000000014
   RBP: 00048753a5c610a1   R8: 00000000000182a1   R9: 000000000020e4a6
   R10: 0000000029108000  R11: 0000000000000246  R12: 0000000000002000
   R13: 0000007fbfffd750  R14: 0000002a975bf4b0  R15: 000000009758e4c8
   ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

The pvmove process appears to be stuck in this code in dm_suspend():

/*
* Then we wait for the already mapped ios to
* complete.
*/
while (1) {
set_current_state(TASK_INTERRUPTIBLE);

if (!atomic_read(&md->pending) || signal_pending(current))
break;

io_schedule();
}

The suspend will only continue if pvmove receives a signal or the count of pending I/Os drops to zero.  Let's check both cases:

static inline int signal_pending(struct task_struct *p)
{
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

crash> task 1020e063110
PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
struct task_struct {
 state = 1,
 thread_info = 0x101e05d4000,
...

crash> thread_info 0x101e05d4000
struct thread_info {
 task = 0x1020e063110,
 exec_domain = 0xffffffff8043d0c0,
 flags = 0,
...

The TIF_SIGPENDING flag is not set so it hasn't received a signal.

The dm_suspend() function leaves the mapped_device pointer in %rbp which we can see here:

crash> dis dm_suspend
0xffffffffa00de24f <dm_suspend>:        push   %r14
0xffffffffa00de251 <dm_suspend+2>:      xor    %eax,%eax
0xffffffffa00de253 <dm_suspend+4>:      mov    $0xa,%ecx
0xffffffffa00de258 <dm_suspend+9>:      cld   
0xffffffffa00de259 <dm_suspend+10>:     push   %r13
0xffffffffa00de25b <dm_suspend+12>:     mov    %esi,%edx
0xffffffffa00de25d <dm_suspend+14>:     push   %r12
0xffffffffa00de25f <dm_suspend+16>:     push   %rbp
0xffffffffa00de260 <dm_suspend+17>:     mov    %rdi,%rbp
...
0xffffffffa00de41a <dm_suspend+459>:    mov    0xe8(%rbp),%eax   # md->pending
0xffffffffa00de420 <dm_suspend+465>:    test   %eax,%eax
0xffffffffa00de422 <dm_suspend+467>:    je     0xffffffffa00de43f
0xffffffffa00de424 <dm_suspend+469>:    mov    %gs:0x0,%rax
0xffffffffa00de42d <dm_suspend+478>:    mov    0x8(%rax),%rax
0xffffffffa00de431 <dm_suspend+482>:    mov    0x10(%rax),%eax
0xffffffffa00de434 <dm_suspend+485>:    test   $0x4,%al
0xffffffffa00de436 <dm_suspend+487>:    jne    0xffffffffa00de43f
0xffffffffa00de438 <dm_suspend+489>:    callq  0xffffffff80360fbc <io_schedule>

And io_schedule() pushes it onto the stack here:

crash> dis io_schedule
0xffffffff80360fbc <io_schedule>:       push   %rbp
0xffffffff80360fbd <io_schedule+1>:     mov    %rsp,%rbp
0xffffffff80360fc0 <io_schedule+4>:     incl   0x166d1e(%rip)        # 0xffffffff804c7ce4 <per_cpu__runqueues+4676>
0xffffffff80360fc6 <io_schedule+10>:    callq  0xffffffff8035f58b <schedule>
0xffffffff80360fcb <io_schedule+15>:    decl   0x166d13(%rip)        # 0xffffffff804c7ce4 <per_cpu__runqueues+4676>
0xffffffff80360fd1 <io_schedule+21>:    leaveq
0xffffffff80360fd2 <io_schedule+22>:    retq   

crash> bt -f 7373
PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
#0 [101e05d5c48] schedule at ffffffff8035fb2c
   101e05d5c50: 0000000000000006 00000101e05d5d28
   101e05d5c60: 0000007300000046 000001020e063110
   101e05d5c70: 00000000000bc4cd 000001022e9d3110
   101e05d5c80: 000001020e0633f8 000001022e2c7200
   101e05d5c90: ffffffffa00dfa35 00000101e05d5cd4
   101e05d5ca0: 000001023c09cc00 0000000000000000
   101e05d5cb0: 00000000ffffffea 0000000000000000
   101e05d5cc0: ffffffffa00e1040 00000101e05d5cd8
   101e05d5cd0: ffffffff80360fcb
#1 [101e05d5cd0] io_schedule at ffffffff80360fcb
   101e05d5cd8: 000001023f213e00 ffffffffa00de43d
...

crash> px *(struct mapped_device *)0x000001023f213e00
$7 = {
 io_lock = {
   activity = 0x0,
   wait_lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0x5,
     module = 0xffffffff80384ccd "lib/rwsem-spinlock.c",
     owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c",
     oline = 0x12e
   },
   wait_list = {
     next = 0x1023f213e38,
     prev = 0x1023f213e38
   }
 },
 suspend_lock = {
   count = {
     counter = 0x0
   },
   sleepers = 0x0,
   wait = {
     lock = {
       magic = 0x1d244b3c,
       lock = 0x0,
       babble = 0xa,
       module = 0xffffffffa00e5161 "include/linux/wait.h",
       owner = 0x0,
       oline = 0x0
     },
     task_list = {
       next = 0x1023f213e80,
       prev = 0x1023f213e80
     }
   }
 },
 pushback_lock = {
   magic = 0x1d244b3c,
   lock = 0x0,
   babble = 0x5,
   module = 0xffffffffa00e4f96 "drivers/md/dm.c",
   owner = 0x0,
   oline = 0x0
 },
 map_lock = {<No data fields>},
 holders = {
   counter = 0x3
 },
 open_count = {
   counter = 0x1
 },
 flags = 0x1,
 queue = 0x1023e379420,
 disk = 0x100bfc97e00,
 interface_ptr = 0x1023ee29200,
 pending = {
   counter = 0x1
 },
 wait = {
   lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0xa,
     module = 0xffffffffa00e5161 "include/linux/wait.h",
     owner = 0xffffffff8037aba9 "kernel/fork.c",
     oline = 0x6c
   },
   task_list = {
     next = 0x101e05d5d40,
     prev = 0x101e05d5d40
   }
 },
 deferred = {
   head = 0x101e2a69980,
   tail = 0x1020289a380
 },
 pushback = {
   head = 0x0,
   tail = 0x0
 },
 map = 0x1023c09cc00,
 io_pool = 0x1023d338b80,
 tio_pool = 0x1023d338c80,
 bs = 0x1023ee29cc0,
 event_nr = {
   counter = 0x0
 },
 eventq = {
   lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0xa,
     module = 0xffffffffa00e5161 "include/linux/wait.h",
     owner = 0x0,
     oline = 0x0
   },
   task_list = {
     next = 0x1023f213fa8,
     prev = 0x1023f213fa8
   }
 },
 frozen_sb = 0x0,
 suspended_bdev = 0x1023c89a300
}

And here we have md->pending == 1 so it's still waiting for a device mapped I/O to complete and call dec_pending()->end_io_acct() which will wake up the pvmove process.  So we need to work out what is holding up that last I/O.

--- Additional comment from lmcilroy on 2010-06-09 23:51:22 EDT ---

Some analysis from Bryn:

PV table:

pv0 8  16 /dev/sdb KccCqL-ngfW-l0s9-EJZ0-SsGa-rL0Q-VZbUKe
pv1 8  48 /dev/sdd jIXNDc-yBd3-dRKg-8m20-qWRV-jyhr-d5DTu4
pv2 8  96 /dev/sdg iqCYlL-Hl0j-SyCN-XwKF-dxgE-UwXx-VfSU22
pv3 8 128 /dev/sdi RGoqRm-BZ3H-b1sP-8JJ9-ijgi-gJW2-bV0a0c

According to the dm data in the lvmdump we have two suspended devices:

$ grep '\ L.s.\ ' lvmdump-hklxdv55-2010051831514/dmsetup_info
vgtest-Ftest     253  13 L-sw    3    2      0 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG606m6TeYUkWcl1U3kfqjIh5zJEhorK53Emr
vgtest-pvmove0   253  15 L-sw    3    3      1 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG6069vX6BhJ0Z7etvouBH8VZM82I3GnhRXu5

This is consistent with the state recorded in pvmove_2:

root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi
   Wiping cache of LVM-capable devices
   Finding volume group "vgtest"
   Archiving volume group "vgtest" metadata (seqno 17).
   Creating logical volume pvmove0
[...]
   Suspending vgtest-Ftest (253:13) without device flush
   Suspending vgtest-pvmove0 (253:15) without device flush
   Found volume group "vgtest"
   Found volume group "vgtest"
   Suspending vgtest-Ftestdb (253:14) without device flush

We can see that the temporary vgtest-pvmove0 LV has been created:

vgtest-pvmove0: 0 1048576 mirror core 1 1024 2 8:48 2097536 8:128 384
vgtest-pvmove0: 1048576 2097152 linear 8:48 384
vgtest-pvmove0: 3145728 131072 linear 8:48 3146112

Looking at the device stack at this point:

vgtest-Ftest: 0 2097152 linear 8:96 384
vgtest-Ftest: 2097152 1048576 linear 253:15 0

vgtest-Ftestdb: 0 2097152 linear 253:15 1048576
vgtest-Ftestdb: 2097152 1966080 linear 8:96 2097536
vgtest-Ftestdb: 4063232 131072 linear 253:15 3145728

vgtest-Ftest is a two-segment volume consisting of one region on /dev/sdg and one region on the temporary mirror:

+--------------+
| vgtest-Ftest | [S]
+--------------+
/    / \      \
+-----+ +----------------+
| sdg | | vgtest-pvmove0 | [S]
+-----+ +----------------+

vgtest-Ftestdb is a three-segment volume consisting of one region on /dev/sdg and two regions on the temporary mirror:

 +----------------+
 | vgtest-Ftestdb | [*]
 +----------------+
  /  / |   |   \   \
+----+ +-----+ +----------------+
|pvm0| | sdg | |       pvm0     | [S]
+----+ +-----+ +----------------+

The pvmove command running on the system is blocked attempting to suspend vgtest-Ftestdb:

kernel: pvmove        S 00004aae9fdf7c94     0 25787  24814                     (NOTLB)
kernel: 0000010061f4dcc8 0000000000000006 0000010061f4dd28 0000007300000046
kernel:        00000101ad8b9110 00000000000c4342 0000010230b88230 00000101ad8b93f8
kernel:        0000010068967700 ffffffffa00dfa35
kernel: Call Trace:<ffffffffa00dfa35>{:dm_mod:dm_table_unplug_all+49}
kernel:        <ffffffffa00e1040>{:dm_mod:dev_suspend+0} <ffffffff80360fcb>{io_schedule+15}
kernel:        <ffffffffa00de43d>{:dm_mod:dm_suspend+494} <ffffffff80183b4b>{map_vm_area+447}
kernel:        <ffffffff801355dd>{default_wake_function+0} <ffffffff801355dd>{default_wake_function+0}
kernel:        <ffffffffa00e11a1>{:dm_mod:dev_suspend+353} <ffffffffa00e2383>{:dm_mod:ctl_ioctl+602}
kernel:        <ffffffff801a668a>{sys_ioctl+1006} <ffffffff80110a92>{system_call+126}

The pvmove has made no progress at all between the three sets of call traces (register values are unchanged between the 3 sets).

This data also indicates that the first pvmove did indeed complete correctly as reported (all segments have been transferred correctly to /dev/sdg and the updated metadata reflects this).

--- Additional comment from lmcilroy on 2010-06-10 02:38:04 EDT ---

Tried to reproduce this issue in a VM using four physical volumes (each backed by a 2GB file), combined them into one volume group, created logical volumes over the first two disks and formatted them with ext3.  I ran a load on the filesystems (dd and fsstress) while pvmove'ing sda to sdc and sdb to sdd and vice versa.  I could not cause pvmove to get stuck but it did run very slowly.

  --- Volume group ---
  VG Name               vgtest
  System ID             
  Format                lvm2
  Metadata Areas        4
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                4
  Act PV                4
  VG Size               7.75 GB
  PE Size               64.00 MB
  Total PE              124
  Alloc PE / Size       62 / 3.88 GB
  Free  PE / Size       62 / 3.88 GB
  VG UUID               wWsTLq-NprO-E0Wn-OdYa-77Eq-MOtH-MrxxVf
   
  --- Logical volume ---
  LV Name                /dev/vgtest/lv0
  VG Name                vgtest
  LV UUID                2SIki7-aIKj-yJMd-4zej-YoRQ-SyBF-3R3wKr
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             31
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Name                /dev/vgtest/lv1
  VG Name                vgtest
  LV UUID                iFfX3c-naiq-oPfo-bOxT-G3f8-hj0X-SLgvOG
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             31
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Physical volumes ---
  PV Name               /dev/sda     
  PV UUID               f4GWeC-f2zr-1wTm-dzlL-Q0ue-l4OY-SC2CC5
  PV Status             allocatable
  Total PE / Free PE    31 / 31
   
  PV Name               /dev/sdb     
  PV UUID               aPwCGC-X4fl-sMvF-fkiW-PXOS-zGGr-eDyc3o
  PV Status             allocatable
  Total PE / Free PE    31 / 31
   
  PV Name               /dev/sdc     
  PV UUID               F1e9Yb-IcHS-Ae3u-OmLB-lR14-ytqg-fEnGV2
  PV Status             allocatable
  Total PE / Free PE    31 / 0
   
  PV Name               /dev/sdd     
  PV UUID               MhmBhu-Vk7q-g75v-mX8G-3lMP-b1RI-YthlbL
  PV Status             allocatable
  Total PE / Free PE    31 / 0


--- Additional comment from lmcilroy on 2010-06-28 02:02:39 EDT ---

Continuing the analysis:

crash> px *(struct mapped_device *)0x000001023f213e00
$32 = {
  io_lock = {
    activity = 0x0, 
    wait_lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0x5, 
      module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      oline = 0x12e
    }, 
    wait_list = {
      next = 0x1023f213e38, 
      prev = 0x1023f213e38
    }
  }, 
  suspend_lock = {
    count = {
      counter = 0x0
    }, 
    sleepers = 0x0, 
    wait = {
      lock = {
        magic = 0x1d244b3c, 
        lock = 0x0, 
        babble = 0xa, 
        module = 0xffffffffa00e5161 "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0x0
      }, 
      task_list = {
        next = 0x1023f213e80, 
        prev = 0x1023f213e80
      }
    }
  }, 
  pushback_lock = {
    magic = 0x1d244b3c, 
    lock = 0x0, 
    babble = 0x5, 
    module = 0xffffffffa00e4f96 "drivers/md/dm.c", 
    owner = 0x0, 
    oline = 0x0
  }, 
  map_lock = {<No data fields>}, 
  holders = {
    counter = 0x3
  }, 
  open_count = {
    counter = 0x1
  }, 
  flags = 0x1, 
  queue = 0x1023e379420, 
  disk = 0x100bfc97e00, 
  interface_ptr = 0x1023ee29200, 
  pending = {
    counter = 0x1
  }, 
  wait = {
    lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0xa, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037aba9 "kernel/fork.c", 
      oline = 0x6c
    }, 
    task_list = {
      next = 0x101e05d5d40, 
      prev = 0x101e05d5d40
    }
  }, 
  deferred = {
    head = 0x101e2a69980, 
    tail = 0x1020289a380
  }, 
  pushback = {
    head = 0x0, 
    tail = 0x0
  }, 
  map = 0x1023c09cc00, 
  io_pool = 0x1023d338b80, 
  tio_pool = 0x1023d338c80, 
  bs = 0x1023ee29cc0, 
  event_nr = {
    counter = 0x0
  }, 
  eventq = {
    lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0xa, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0x0, 
      oline = 0x0
    }, 
    task_list = {
      next = 0x1023f213fa8, 
      prev = 0x1023f213fa8
    }
  }, 
  frozen_sb = 0x0, 
  suspended_bdev = 0x1023c89a300
}

pending is 1
flags is DMF_BLOCK_IO (stalling I/Os while trying to suspend)
what does the disk queue say?

crash> gendisk 0x100bfc97e00
struct gendisk {
  major = 253, 
  first_minor = 14, 
  minors = 1, 
  disk_name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0x0, 
  fops = 0xffffffffa00eef00, 
  queue = 0x1023e379420, 
  private_data = 0x1023f213e00, 
  capacity = 4194304, 
  flags = 16, 
  devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  number = 0, 
  driverfs_dev = 0x0, 
  kobj = {
    k_name = 0x100bfc97eb0 "dm-14", 
    name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
    kref = {
      refcount = {
        counter = 3
      }
    }, 
    entry = {
      next = 0x1023c09ecc8, 
      prev = 0x1023f213cc8
    }, 
    parent = 0xffffffff804646e0, 
    kset = 0xffffffff804646c0, 
    ktype = 0x0, 
    dentry = 0x1023e05da20
  }, 
  random = 0x1023de20f00, 
  policy = 0, 
  sync_io = {
    counter = 0
  }, 
  stamp = 4296507683, 
  stamp_idle = 4296507683, 
  in_flight = 1, 
  dkstats = {
    read_sectors = 435498, 
    write_sectors = 5866608, 
    reads = 12972, 
    writes = 733326, 
    read_merges = 0, 
    write_merges = 0, 
    read_ticks = 55488, 
    write_ticks = 1125481, 
    io_ticks = 1096527, 
    time_in_queue = 1943931
  }
}

in_flight is 1 so that agrees.
Device is 253:14 (dm-14)

Let's look a the dm map and find out what's below this device.

crash> px *((struct mapped_device *)0x000001023f213e00)->map
$34 = {
  holders = {
    counter = 0x2
  }, 
  depth = 0x1, 
  counts = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  index = {0xffffff000003c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  num_targets = 0x3, 
  num_allocated = 0x10, 
  highs = 0xffffff000003c000, 
  targets = 0xffffff000003c080, 
  mode = 0x3, 
  devices = {
    next = 0x101f9374ac0, 
    prev = 0x10235bb3e80
  }, 
  limits = {
    max_sectors = 0xffff, 
    max_phys_segments = 0x80, 
    max_hw_segments = 0xff, 
    hardsect_size = 0x200, 
    max_segment_size = 0x10000, 
    seg_boundary_mask = 0xffffffff
  }, 
  event_fn = 0xffffffffa00dd404 <event_callback>, 
  event_context = 0x1023f213e00, 
  md = 0x1023f213e00
}

crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->type->name
$38 = 0xffffffffa00e53c2 "linear"

crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->private
$39 = (void *) 0x101f93edfa0

crash> px *((struct linear_c *)0x101f93edfa0)->dev
$41 = {
  list = {
    next = 0x1023c09cce8, 
    prev = 0x101f9374ac0
  }, 
  count = {
    counter = 0x2
  }, 
  mode = 0x3, 
  bdev = 0x101dfa6ca00, 
  name = "253:15\000\000\000\000\000\000\000\000\000"
}

crash> block_device 0x101dfa6ca00
struct block_device {
  bd_dev = 265289743, 
  bd_inode = 0x101dfa6cb10, 
  bd_openers = 3, 
  bd_sem = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffff803758bd "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x101dfa6ca50, 
        prev = 0x101dfa6ca50
      }
    }
  }, 
  bd_mount_sem = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffff803758bd "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x101dfa6ca98, 
        prev = 0x101dfa6ca98
      }
    }
  }, 
  bd_inodes = {
    next = 0x1022bb68970, 
    prev = 0x1022bb68970
  }, 
  bd_holder = 0xffffffffa00e524e, 
  bd_holders = 4, 
  bd_contains = 0x101dfa6ca00, 
  bd_block_size = 4096, 
  bd_part = 0x0, 
  bd_part_count = 0, 
  bd_invalidated = 0, 
  bd_disk = 0x1023c09ec00, 
  bd_list = {
    next = 0x10006be7270, 
    prev = 0x1023c1d5570
  }, 
  bd_inode_backing_dev_info = 0x0, 
  bd_private = 0
}

crash> gendisk 0x1023c09ec00
struct gendisk {
  major = 253, 
  first_minor = 15, 
  minors = 1, 
  disk_name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0x0, 
  fops = 0xffffffffa00eef00, 
  queue = 0x1023e379810, 
  private_data = 0x1023c09ce00, 
  capacity = 3276800, 
  flags = 16, 
  devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  number = 0, 
  driverfs_dev = 0x0, 
  kobj = {
    k_name = 0x1023c09ecb0 "dm-15", 
    name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
    kref = {
      refcount = {
        counter = 3
      }
    }, 
    entry = {
      next = 0xffffffff804646d0, 
      prev = 0x100bfc97ec8
    }, 
    parent = 0xffffffff804646e0, 
    kset = 0xffffffff804646c0, 
    ktype = 0x0, 
    dentry = 0x1020d87a908
  }, 
  random = 0x101f93edf80, 
  policy = 0, 
  sync_io = {
    counter = 0
  }, 
  stamp = 4296507683, 
  stamp_idle = 4296507683, 
  in_flight = 0, 
  dkstats = {
    read_sectors = 40, 
    write_sectors = 504856, 
    reads = 5, 
    writes = 63107, 
    read_merges = 0, 
    write_merges = 0, 
    read_ticks = 79, 
    write_ticks = 55902, 
    io_ticks = 28089, 
    time_in_queue = 55982
  }
}

in_flight is 0.
Device is 253:15 (dm-15) so that's another dm device.
this dm's mapped_device is in private_data

crash> mapped_device 0x1023c09ce00
struct mapped_device {
  io_lock = {
    activity = 0, 
    wait_lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 5, 
      module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      oline = 302
    }, 
    wait_list = {
      next = 0x1023c09ce38, 
      prev = 0x1023c09ce38
    }
  }, 
  suspend_lock = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffffa00e5161 "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x1023c09ce80, 
        prev = 0x1023c09ce80
      }
    }
  }, 
  pushback_lock = {
    magic = 488917820, 
    lock = 0, 
    babble = 5, 
    module = 0xffffffffa00e4f96 "drivers/md/dm.c", 
    owner = 0x0, 
    oline = 0
  }, 
  map_lock = {<No data fields>}, 
  holders = {
    counter = 4
  }, 
  open_count = {
    counter = 3
  }, 
  flags = 3, 
  queue = 0x1023e379810, 
  disk = 0x1023c09ec00, 
  interface_ptr = 0x1022e2c7f40, 
  pending = {
    counter = 0
  }, 
  wait = {
    lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 10, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037aba9 "kernel/fork.c", 
      oline = 131
    }, 
    task_list = {
      next = 0x1023c09cf20, 
      prev = 0x1023c09cf20
    }
  }, 
  deferred = {
    head = 0x1023f264980, 
    tail = 0x1020aa44800
  }, 
  pushback = {
    head = 0x0, 
    tail = 0x0
  }, 
  map = 0x1023d05d000, 
  io_pool = 0x1022eff23c0, 
  tio_pool = 0x1022eff22c0, 
  bs = 0x101f36e4300, 
  event_nr = {
    counter = 1
  }, 
  eventq = {
    lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 10, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037a9ff "kernel/sched.c", 
      oline = 2894
    }, 
    task_list = {
      next = 0x1023c09cfa8, 
      prev = 0x1023c09cfa8
    }
  }, 
  frozen_sb = 0x0, 
  suspended_bdev = 0x101dfa6ca00
}

pending is 0.
flags is 3 so that's DMF_BLOCK_IO and DMF_SUSPENDED.

Aha! The underlying device is suspended.  The pvmove is stuck trying to suspend dm-14 because an I/O is blocked by the underlying device dm-15 because it is suspended - dm-15 will need to resumed before dm-14 can continue.

--- Additional comment from lmcilroy on 2010-08-26 03:02:05 EDT ---

I've managed to reproduce this problem but I had to mimic the customer's lvm config exactly to get pvmove to hang.  Here's what I did:

- Installed a VM with RHEL4.6
- Created 4 x 2GB files on the host and attached them to the VM as SCSI disks

$ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd

Must use a 64MB extent size:

$ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd
 
This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second.  The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again.  The only way I know to create that is with this:

$ lvcreate -n Ftest vgtest -l 16 /dev/sda
$ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb
$ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda
$ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb
$ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb

This is what it all looks like after that.

vgtest-Ftest: 0 2097152 linear 8:0 384
vgtest-Ftest: 2097152 1048576 linear 8:16 384

vgtest-Ftestdb: 0 2097152 linear 8:16 1048960
vgtest-Ftestdb: 2097152 1966080 linear 8:0 2097536
vgtest-Ftestdb: 4063232 131072 linear 8:16 3146112


  --- Logical volume ---
  LV Name                /dev/vgtest/Ftest
  VG Name                vgtest
  LV UUID                XaxY0e-TNVq-gwzS-2scb-yA6L-lZ8u-atAnHm
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                1.50 GB
  Current LE             24
  Segments               2
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:3
   
  --- Logical volume ---
  LV Name                /dev/vgtest/Ftestdb
  VG Name                vgtest
  LV UUID                2a5AaI-enRN-DdAk-eqCK-JMJ5-Jm4p-0lAzdg
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                2.00 GB
  Current LE             32
  Segments               3
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:4


  --- Volume group ---
  VG Name               vgtest
  System ID             
  Format                lvm2
  Metadata Areas        4
  Metadata Sequence No  72
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                4
  Act PV                4
  VG Size               7.75 GB
  PE Size               64.00 MB
  Total PE              124
  Alloc PE / Size       56 / 3.50 GB
  Free  PE / Size       68 / 4.25 GB
  VG UUID               42GW3U-1l0X-Sb5I-jD5b-S3Gy-w8gc-LtAlRB


  --- Physical volume ---
  PV Name               /dev/sda
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes (but full)
  PE Size (KByte)       65536
  Total PE              31
  Free PE               0
  Allocated PE          31
  PV UUID               gCJW0L-Lqrd-snDX-aswE-rLJH-wvlc-I73604
   
  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               6
  Allocated PE          25
  PV UUID               cvzwHb-vYqS-uTz0-cclI-P1Nt-N35H-R11i8x
   
  --- Physical volume ---
  PV Name               /dev/sdc
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               31
  Allocated PE          0
  PV UUID               yDuC5g-Pfla-TUvF-kGk3-zPpg-f4o3-AEFc3p
   
  --- Physical volume ---
  PV Name               /dev/sdd
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               31
  Allocated PE          0
  PV UUID               DwVjpk-XvvM-Q0q3-0VPJ-lOGu-Ycze-CcDVAd

Make filesystems and mount both volumes:

$ mkfs.ext3 /dev/vgtest/Ftest
$ mkfs.ext3 /dev/vgtest/Ftestdb
$ mount /dev/vgtest/Ftest /mnt/Ftest
$ mount /dev/vgtest/Ftestdb /mnt/Ftestdb

Add an I/O load to the Ftestdb volume:

$ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1

And fire off pvmove, this one should succeed:

$ pvmove -v /dev/sda /dev/sdc

This one should hang (if it doesn't hang keep following the steps):

$ pvmove -v /dev/sdb /dev/sdd

This will succeed:

$ pvmove -v /dev/sdc /dev/sda

And this one should hang:

$ pvmove -v /dev/sdd /dev/sdb

The output from pvmove before it hangs shows that it is trying to suspend a device (vgtest-Ftestdb) that sits on top on another already suspended device (vgtest-pvmove0):

    Suspending vgtest-Ftest (253:2) without device flush
    Suspending vgtest-pvmove0 (253:4) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:3) without device flush

--- Additional comment from dejohnso on 2011-02-16 15:28:36 EST ---

Information that might prove helpful that one customer has noticed.

Scenario.
New bigger volume is added and data moved on it via pvmove.
pvmove at some time stops in D state.

dmsetup info will show suspended devices
dmsetup resume <map> - on the suspended devices unblocks pvmove and it continues.

This only happens with pvmove.

--- Additional comment from lmcilroy on 2011-02-16 19:29:39 EST ---

Manually resuming the devices will clear the problem but it may also undermine the integrity of pvmove since the devices are being resumed before it expects them to be.

This problem is due to pvmove suspending a device (A) that sits on top of another device (B) that is already suspended.  Any I/O sent to device B while it is suspended gets queued and wont complete until device B is resumed.  For device A to be suspended it must wait for all I/O it has already issued to complete but device A is waiting for an I/O that's in device B's queue.

The fix for this problem will need to be done in pvmove.  It will need to avoid the sequence "suspend B, ... suspend A" by either always suspending A first whenever suspending B or by converting it to "suspend B, ... resume B, suspend A, suspend B".

Comment 1 hank 2011-05-26 06:00:05 UTC

Created attachment 500986 [details]
a patch for solve pvmove hung problem

Summary the problem:

when doing pvmove, the lv tree is as below:

+--------------+
| vgtest-Ftest |
+--------------+
/    / \      \
+-----+ +----------------+
| sdg | |      pvm0      |
+-----+ +----------------+


 +----------------+
 | vgtest-Ftestdb |
 +----------------+
  /  / |   |   \   \
+----+ +-----+ +----------------+
|pvm0| | sdg | |       pvm0     |
+----+ +-----+ +----------------+

pvmove will update metadata periodic.
When doing this, pvmove will suspend all the lv,
1, it suspend Ftest and its underlying lv: pvm0
2, suspend Ftestdb (pvm0 is also Ftestdb's underlying lv, but
   if it is already been suspended, the program will skip it)

After step 1, an io is written to pvm0 through Ftestdb, and because
pvm0 is already been suspended, the io will not be implement. So there
is an in-flight io on Ftestdb, it will alwasy in-flight unless the pvm0
is resume.
Then do step 2, pvmove try to suspend Ftestdb. Before suspend it, kernel will
wait until all the in-flight io complete. Because pvm0 is already been suspended,
io on it will nevery been implement, so there is alwasy an io in-flight, and
the pvmove program is hung.

To solve this problem, I resume all the underlying lv before try to suspend a lv.

Comment 3 Corey Marthaler 2011-05-27 19:10:40 UTC

Is there potentially more needed to reproduce this? I've been unable to reproduce a hang while running fs I/O during multiple pvmoves.

2.6.18-261.el5

lvm2-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:24 CDT 2011
lvm2-cluster-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:43 CDT 2011
device-mapper-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
device-mapper-event-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 4 Lachlan McIlroy 2011-05-30 00:53:44 UTC

(In reply to comment #3)
> Is there potentially more needed to reproduce this? I've been unable to
> reproduce a hang while running fs I/O during multiple pvmoves.

Corey, do you have multiple logical volumes sharing the same physical disk?  I think that's a requirement in order to reproduce the bug.

Comment 5 hank 2011-05-30 01:02:54 UTC

(In reply to comment #3)
> Is there potentially more needed to reproduce this? I've been unable to
> reproduce a hang while running fs I/O during multiple pvmoves.
> 
> 2.6.18-261.el5
> 
> lvm2-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:24 CDT 2011
> lvm2-cluster-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:43 CDT 2011
> device-mapper-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
> device-mapper-event-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
> cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
> kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

This bug is cloned from rhel4.

In the original bug, Lachlan provide a method to reprudoce the bug:

- Installed a VM with RHEL4.6
- Created 4 x 2GB files on the host and attached them to the VM as SCSI disks

$ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd

Must use a 64MB extent size:

$ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd

This where it got confusing - the volume Ftest is made up of 2 segments and
spans two disks with 16 extents on the first disk and 8 on the second.  The
volume Ftestdb is made up of 3 segments and spans the same two disks with 16
extents on the second disk, then 15 extents on the first and then 1 extent on
the second disk again.  The only way I know to create that is with this:

$ lvcreate -n Ftest vgtest -l 16 /dev/sda
$ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb
$ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda
$ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb
$ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb

......

Make filesystems and mount both volumes:

$ mkfs.ext3 /dev/vgtest/Ftest
$ mkfs.ext3 /dev/vgtest/Ftestdb
$ mount /dev/vgtest/Ftest /mnt/Ftest
$ mount /dev/vgtest/Ftestdb /mnt/Ftestdb

Add an I/O load to the Ftestdb volume:

$ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1

And fire off pvmove, this one should succeed:

$ pvmove -v /dev/sda /dev/sdc

This one should hang (if it doesn't hang keep following the steps):

$ pvmove -v /dev/sdb /dev/sdd

This will succeed:

$ pvmove -v /dev/sdc /dev/sda

And this one should hang:

$ pvmove -v /dev/sdd /dev/sdb

Comment 6 Alasdair Kergon 2011-06-02 12:23:21 UTC

Lots of information to plough through here, but it's not obvious to me yet: are you attempting to run two pvmoves simultaneously?  Or waiting for one to finish before starting the next?

Comment 7 Alasdair Kergon 2011-06-02 12:25:35 UTC

In other words, what is the *simplest* setup you have that recreates the problem?

Comment 8 Alasdair Kergon 2011-06-02 12:27:59 UTC

(And if they don't have to run simultaneously, is it necessary to run two pvmoves to see the problem or can it be seen in some other way by just running one pvmove?  The write-up here looks rather confused.)

Comment 9 Alasdair Kergon 2011-06-02 13:03:40 UTC

Regarding the attached patch:  If the devices are suspended, that's been done for a reason, and resuming them forcibly to release trapped i/o isn't a good idea.  Better to make sure no i/o can become trapped in the first place.

Comment 10 Milan Broz 2011-06-02 13:32:04 UTC

(Also always think how how it works in cluster - you cannot force resume just on local node. And this code is below cluster locking which singalise suspend/resume to other nodes, so it will cause inconsistent state on cluster.)

Comment 11 hank 2011-06-03 01:35:24 UTC

Reply to Alasdair:

To trigger this bug, need only one pvmove. Just follow Comment 5 can
reproduce it.
When do this step:
pvmove -v /dev/sdb /dev/sdd
And when pvmove update metadata, it will suspend all lvs on sdb/sdd,
they are: Ftest and Ftestdb.
When suspend Ftest, it will also suspend its underlying lv: pvmove0.
pvmove0 is also an underlying device of Ftestdb.
After Ftest and pvmove0 are suspended, and before Ftestdb is suspended,
some io write through Ftestdb to pvmove0, these io will be pending on
Ftestdb.
Then suspend Ftestdb, it need wait until all Ftestdb's in-flight io
complete. But because some io are write through Ftestdb to pvmove0,
and pvmove0 is already been suspended, these io will not complete
forever. Then pvmove hung.

About resuming all the underlying lv forcibly, I agree it is not a
good idea. Maybe we need add two flags.
One for command context, indicate this suspend is called by pvmove for
update metadata, only this kind of suspend trigger resume.
Another flag for temp lv created by pvmove, when do resume, only this
kind of lv will be resume.

Comment 12 hank 2011-06-03 01:37:59 UTC

Reply to Milan:

Thanks for your remind. I don't consider carefully for cluster condition.

I need sometime to consider how to deal with it.

Comment 13 Alasdair Kergon 2011-06-06 10:47:59 UTC

I am working on a fix.

In the meantime, as a workaround, try using the -n option of pvmove to move only one LV at once.

List of LVS in VG:  lvs --noheadings -o name $vg
Move one LV:  pvmove -i0 -n $lvname

Comment 16 Alasdair Kergon 2011-06-17 11:34:20 UTC

I found a series of problems with pvmove which I have fixed in the upstream repository.  However, I would like to wait some time to see that no unintended side-effects show up before seeing the change back-ported to RHEL.  An audit for similar bugs found a problem in lvremove, which I have also fixed.  lvconvert still needs to be audited.

Comment 17 hank 2011-07-01 09:27:42 UTC

The same problem is also on rhel4 and rhel6, so if the problem is fixed, please also back-port to rhel4 and rhel6.

Comment 19 Alasdair Kergon 2011-07-06 17:02:26 UTC

This passed the upstream test suite for the first time last night.  However, due to the complexity of the change and the amount of regression testing I believe it needs, I am not offering this as a Z-stream release, but only releasing it as part of the next scheduled update, viz. 5.8.  In the meantime, I'm afraid the  workaround of running pvmove on one LV at a time is the best I can offer.  That can be scripted, and will not have a significant impact on performance because the same amount of data is still being moved, just in a different order.

Comment 20 Alasdair Kergon 2011-07-08 21:35:48 UTC

Upstream release 2.02.86 include in Fedora rawhide.  Please test.

Comment 23 Milan Broz 2011-10-17 21:36:02 UTC

Fixed in lvm2-2.02.88-1.el5.

Comment 27 Corey Marthaler 2011-11-22 21:22:06 UTC

Fix verified in the latest rpms.

2.6.18-274.el5

lvm2-2.02.88-4.el5    BUILT: Wed Nov 16 09:40:55 CST 2011
lvm2-cluster-2.02.88-4.el5    BUILT: Wed Nov 16 09:46:51 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009


[root@grant-01 tmp]# pvcreate /dev/sd[bc][12]
  Writing physical volume data to disk "/dev/sdb1"
  Physical volume "/dev/sdb1" successfully created
  Writing physical volume data to disk "/dev/sdb2"
  Physical volume "/dev/sdb2" successfully created
  Writing physical volume data to disk "/dev/sdc1"
  Physical volume "/dev/sdc1" successfully created
  Writing physical volume data to disk "/dev/sdc2"
  Physical volume "/dev/sdc2" successfully created
[root@grant-01 tmp]# vgcreate -s 64M vgtest  /dev/sdb1 /dev/sdb2 /dev/sdc1 /dev/sdc2
  Volume group "vgtest" successfully created

[root@grant-01 tmp]# lvcreate -n Ftest vgtest -l 16 /dev/sdb1
  Logical volume "Ftest" created
[root@grant-01 tmp]# lvcreate -n Ftestdb vgtest -l 16 /dev/sdb2
  Logical volume "Ftestdb" created
[root@grant-01 tmp]# lvextend -l 31 /dev/vgtest/Ftestdb /dev/sdb1
  Extending logical volume Ftestdb to 1.94 GB
  Logical volume Ftestdb successfully resized
[root@grant-01 tmp]# lvextend -l 24 /dev/vgtest/Ftest /dev/sdb2
  Extending logical volume Ftest to 1.50 GB
  Logical volume Ftest successfully resized
[root@grant-01 tmp]# lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb2
  Extending logical volume Ftestdb to 2.00 GB
  Logical volume Ftestdb successfully resized

[root@grant-01 tmp]# lvs -a -o +devices
  LV       VG         Attr   LSize  Devices        
  Ftest    vgtest     -wi-a-  1.50G /dev/sdb1(0)
  Ftest    vgtest     -wi-a-  1.50G /dev/sdb2(16)  
  Ftestdb  vgtest     -wi-a-  2.00G /dev/sdb2(0)   
  Ftestdb  vgtest     -wi-a-  2.00G /dev/sdb1(16)  
  Ftestdb  vgtest     -wi-a-  2.00G /dev/sdb2(24)

[root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftest
[root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftestdb
[root@grant-01 tmp]# mkdir /mnt/Ftest
[root@grant-01 tmp]# mkdir /mnt/Ftestdb
[root@grant-01 tmp]# mount /dev/vgtest/Ftest /mnt/Ftest
[root@grant-01 tmp]# mount /dev/vgtest/Ftestdb /mnt/Ftestdb


[root@grant-01 tmp]# fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1
seed = 1322167840

[root@grant-01 ~]# pvmove -v /dev/sdb1 /dev/sdc1
    Finding volume group "vgtest"
    Archiving volume group "vgtest" metadata (seqno 6).
    Creating logical volume pvmove0
    Moving 16 extents of logical volume vgtest/Ftest
    Moving 15 extents of logical volume vgtest/Ftestdb
    Found volume group "vgtest"
    activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftest
    Found volume group "vgtest"
    activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftestdb
    Updating volume group metadata
    Found volume group "vgtest"
    Found volume group "vgtest"
    Creating vgtest-pvmove0
    Loading vgtest-pvmove0 table (253:4)
    Loading vgtest-Ftest table (253:2)
    Loading vgtest-Ftestdb table (253:3)
    Suspending vgtest-Ftest (253:2) with device flush
    Suspending vgtest-pvmove0 (253:4) with device flush
    Suspending vgtest-Ftestdb (253:3) with device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Found volume group "vgtest"
    activation/volume_list configuration setting not defined, checking only host tags for vgtest/pvmove0
    Resuming vgtest-pvmove0 (253:4)
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table (253:4)
    Suppressed vgtest-pvmove0 identical table reload.
    Resuming vgtest-Ftest (253:2)
    Resuming vgtest-Ftestdb (253:3)
    Found volume group "vgtest"
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 7).
    Checking progress before waiting every 15 seconds
  /dev/sdb1: Moved: 3.2%
  /dev/sdb1: Moved: 51.6%
    Updating volume group metadata
    Found volume group "vgtest"
    Found volume group "vgtest"
    Loading vgtest-pvmove0 table (253:4)
    Suspending vgtest-pvmove0 (253:4) with device flush
    Found volume group "vgtest"
    Resuming vgtest-pvmove0 (253:4)
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 8).
  /dev/sdb1: Moved: 100.0%
    Found volume group "vgtest"
    Found volume group "vgtest"
    Loading vgtest-Ftest table (253:2)
    Loading vgtest-Ftestdb table (253:3)
    Loading vgtest-pvmove0 table (253:4)
    Suspending vgtest-Ftest (253:2) with device flush
    Suspending vgtest-Ftestdb (253:3) with device flush
    Suspending vgtest-pvmove0 (253:4) with device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Found volume group "vgtest"
    Resuming vgtest-pvmove0 (253:4)
    Found volume group "vgtest"
    Resuming vgtest-Ftest (253:2)
    Found volume group "vgtest"
    Resuming vgtest-Ftestdb (253:3)
    Found volume group "vgtest"
    Removing vgtest-pvmove0 (253:4)
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 10).


[root@grant-01 ~]# pvmove -v /dev/sdb2 /dev/sdc2
    Finding volume group "vgtest"
    Archiving volume group "vgtest" metadata (seqno 7).
    Creating logical volume pvmove1
  Skipping locked LV Ftest
  Skipping locked LV Ftestdb
  Skipping mirror LV pvmove0
  All data on source PV skipped. It contains locked, hidden or non-top level LVs only.
  No data to move for vgtest

Comment 28 errata-xmlrpc 2012-02-21 06:04:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0161.html

Note You need to log in before you can comment on or make changes to this bug.