+++ This bug was initially created as a clone of Bug #602516 +++ Description of problem: A customer reported that when they are doing pvmove while doing some file IO and database IO on the LV, the pvmove and those IO may hang. They see that in their production system before. And they setup a testing system with the following structure: LV:Ftest LV:Ftestdb | | +-----------+------------+ | VG:vgtest | +-----+-----+-----+-----+ | | | | sdb sdd sdg sdi They do the test like this: 1. Create vgtest with 4 PVs 2. Create LV "Ftest" and "Ftestdb" on the VG, specify PV as sdb and sdd 3. Create ext3 fs on Ftest and Ftestdb and mount them 4. Run some file IO workload on Ftest and run oracle workload on Ftestdb 5. do pvmove from sdb to sdg, finish successfully in 5 mins (see pvmove_1) 6. do pvmove from sdd to sdi, seems hanged at this line (see pvmove_2): Suspending vgtest-Ftestdb (253:14) without device flush And customer waited for a few hours and it didn't have any progress. Moreover, all the IO on the LV seems stopped and the related process hanged at "D" state. Moreover, we asked to customer to use "iostat -x 1" to check and no IO traffic on the disk. And "ctrl+c" cannot break the pvmove process. Version-Release number of selected component (if applicable): kernel-2.6.9-67.EL How reproducible: Every time the above test is run. --- Additional comment from lmcilroy on 2010-06-09 23:06:25 EDT --- $ cat pvmove_1 root-root@hklxdv55-DR:/$pvmove -v /dev/sdb /dev/sdg Wiping cache of LVM-capable devices Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 13). Creating logical volume pvmove0 Moving 16 extents of logical volume vgtest/Ftest Moving 15 extents of logical volume vgtest/Ftestdb Found volume group "vgtest" Found volume group "vgtest" Updating volume group metadata Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 14). Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftest (253:13) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush Found volume group "vgtest" Creating vgtest-pvmove0 Loading vgtest-pvmove0 table Resuming vgtest-pvmove0 (253:15) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftest table Resuming vgtest-Ftest (253:13) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftestdb table Resuming vgtest-Ftestdb (253:14) Checking progress every 15 seconds /dev/sdb: Moved: 16.1% /dev/sdb: Moved: 32.3% /dev/sdb: Moved: 51.6% Updating volume group metadata Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 15). Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftest (253:13) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-pvmove0 table Resuming vgtest-pvmove0 (253:15) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftest table Suppressed vgtest-Ftest identical table reload. Resuming vgtest-Ftest (253:13) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftestdb table Suppressed vgtest-Ftestdb identical table reload. Resuming vgtest-Ftestdb (253:14) /dev/sdb: Moved: 77.4% /dev/sdb: Moved: 96.8% /dev/sdb: Moved: 100.0% Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-Ftest table Suspending vgtest-Ftest (253:13) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Resuming vgtest-pvmove0 (253:15) Loading vgtest-Ftestdb table Suspending vgtest-Ftestdb (253:14) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" Resuming vgtest-pvmove0 (253:15) Found volume group "vgtest" Resuming vgtest-Ftest (253:13) Found volume group "vgtest" Resuming vgtest-Ftestdb (253:14) Found volume group "vgtest" Removing vgtest-pvmove0 (253:15) Found volume group "vgtest" Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 17). $ cat pvmove_2 root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi Wiping cache of LVM-capable devices Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 17). Creating logical volume pvmove0 Moving 8 extents of logical volume vgtest/Ftest Moving 17 extents of logical volume vgtest/Ftestdb Found volume group "vgtest" Found volume group "vgtest" Updating volume group metadata Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 18). Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftest (253:13) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush Found volume group "vgtest" Creating vgtest-pvmove0 Loading vgtest-pvmove0 table Resuming vgtest-pvmove0 (253:15) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftest table Resuming vgtest-Ftest (253:13) Found volume group "vgtest" Loading vgtest-pvmove0 table Suppressed vgtest-pvmove0 identical table reload. Loading vgtest-Ftestdb table Resuming vgtest-Ftestdb (253:14) Checking progress every 15 seconds /dev/sdd: Moved: 16.0% /dev/sdd: Moved: 32.0% Updating volume group metadata Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 19). Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftest (253:13) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush --- Additional comment from lmcilroy on 2010-06-09 23:15:05 EDT --- Looking at the vmcore, the pvmove process is blocked attempting to suspend the device - it hasn't run for 13 minutes: PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" #0 [101e05d5c48] schedule at ffffffff8035fb2c #1 [101e05d5cd0] io_schedule at ffffffff80360fcb #2 [101e05d5ce0] dm_suspend at ffffffffa00de43d #3 [101e05d5d80] dev_suspend at ffffffffa00e11a1 #4 [101e05d5db0] ctl_ioctl at ffffffffa00e2383 #5 [101e05d5f40] sys_ioctl at ffffffff801a668a RIP: 0000003185dc25d9 RSP: 0000007fbfffb688 RFLAGS: 00000206 RAX: 0000000000000010 RBX: ffffffff80110a92 RCX: 000000000000000a RDX: 000000000066c920 RSI: 00000000c138fd06 RDI: 0000000000000008 RBP: 0000000000000000 R8: 0000000000000005 R9: 000000000066c950 R10: 0000003186114700 R11: 0000000000000246 R12: 000000000066ca58 R13: 0000000000000000 R14: 00000000000000fd R15: 000000000066c920 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b I found this oracle process waiting for I/O to complete and according to the last_ran time got stuck just before the pvmove process so may be of interest: PID: 7240 TASK: 10222fbf650 CPU: 0 COMMAND: "oracle" #0 [10222fc1b38] schedule at ffffffff8035fb2c #1 [10222fc1bc0] io_schedule at ffffffff80360fcb #2 [10222fc1bd0] wait_on_page_bit at ffffffff801688ae #3 [10222fc1c70] wait_on_page_writeback_range at ffffffff80168935 #4 [10222fc1d40] sync_page_range_nolock at ffffffff80168aa2 #5 [10222fc1d70] generic_file_aio_write_nolock at ffffffff8016b159 #6 [10222fc1da0] generic_file_aio_write at ffffffff8016b1eb #7 [10222fc1df0] ext3_file_write at ffffffffa010b2f9 #8 [10222fc1e10] do_sync_write at ffffffff8018f412 #9 [10222fc1f10] vfs_write at ffffffff8018f50d #10 [10222fc1f40] sys_pwrite64 at ffffffff8018f6ea #11 [10222fc1f80] system_call at ffffffff80110a92 RIP: 000000318660bf75 RSP: 0000007fbfffdda8 RFLAGS: 00010246 RAX: 0000000000000012 RBX: ffffffff80110a92 RCX: 000000002ae835f7 RDX: 0000000000000400 RSI: 000000006071ec00 RDI: 0000000000000016 RBP: 00048753a5be767c R8: 0000000000092abc R9: 000000000020e0c2 R10: 000000000238e000 R11: 0000000000000246 R12: 0000000000000200 R13: 0000007fbfffd8f0 R14: 0000002a97d56c48 R15: 0000000000000b6f ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b And this oracle process got stuck just after the pvmove process: PID: 7238 TASK: 10223ba4530 CPU: 0 COMMAND: "oracle" #0 [1022302db38] schedule at ffffffff8035fb2c #1 [1022302dbc0] io_schedule at ffffffff80360fcb #2 [1022302dbd0] wait_on_page_bit at ffffffff801688ae #3 [1022302dc70] wait_on_page_writeback_range at ffffffff80168935 #4 [1022302dd40] sync_page_range_nolock at ffffffff80168aa2 #5 [1022302dd70] generic_file_aio_write_nolock at ffffffff8016b159 #6 [1022302dda0] generic_file_aio_write at ffffffff8016b1eb #7 [1022302ddf0] ext3_file_write at ffffffffa010b2f9 #8 [1022302de10] do_sync_write at ffffffff8018f412 #9 [1022302df10] vfs_write at ffffffff8018f50d #10 [1022302df40] sys_pwrite64 at ffffffff8018f6ea #11 [1022302df80] system_call at ffffffff80110a92 RIP: 000000318660bf75 RSP: 0000007fbfffd808 RFLAGS: 00010212 RAX: 0000000000000012 RBX: ffffffff80110a92 RCX: 000000008c663000 RDX: 00000000000a0000 RSI: 000000008c75f000 RDI: 0000000000000014 RBP: 00048753a5c610a1 R8: 00000000000182a1 R9: 000000000020e4a6 R10: 0000000029108000 R11: 0000000000000246 R12: 0000000000002000 R13: 0000007fbfffd750 R14: 0000002a975bf4b0 R15: 000000009758e4c8 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b The pvmove process appears to be stuck in this code in dm_suspend(): /* * Then we wait for the already mapped ios to * complete. */ while (1) { set_current_state(TASK_INTERRUPTIBLE); if (!atomic_read(&md->pending) || signal_pending(current)) break; io_schedule(); } The suspend will only continue if pvmove receives a signal or the count of pending I/Os drops to zero. Let's check both cases: static inline int signal_pending(struct task_struct *p) { return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING)); } crash> task 1020e063110 PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" struct task_struct { state = 1, thread_info = 0x101e05d4000, ... crash> thread_info 0x101e05d4000 struct thread_info { task = 0x1020e063110, exec_domain = 0xffffffff8043d0c0, flags = 0, ... The TIF_SIGPENDING flag is not set so it hasn't received a signal. The dm_suspend() function leaves the mapped_device pointer in %rbp which we can see here: crash> dis dm_suspend 0xffffffffa00de24f <dm_suspend>: push %r14 0xffffffffa00de251 <dm_suspend+2>: xor %eax,%eax 0xffffffffa00de253 <dm_suspend+4>: mov $0xa,%ecx 0xffffffffa00de258 <dm_suspend+9>: cld 0xffffffffa00de259 <dm_suspend+10>: push %r13 0xffffffffa00de25b <dm_suspend+12>: mov %esi,%edx 0xffffffffa00de25d <dm_suspend+14>: push %r12 0xffffffffa00de25f <dm_suspend+16>: push %rbp 0xffffffffa00de260 <dm_suspend+17>: mov %rdi,%rbp ... 0xffffffffa00de41a <dm_suspend+459>: mov 0xe8(%rbp),%eax # md->pending 0xffffffffa00de420 <dm_suspend+465>: test %eax,%eax 0xffffffffa00de422 <dm_suspend+467>: je 0xffffffffa00de43f 0xffffffffa00de424 <dm_suspend+469>: mov %gs:0x0,%rax 0xffffffffa00de42d <dm_suspend+478>: mov 0x8(%rax),%rax 0xffffffffa00de431 <dm_suspend+482>: mov 0x10(%rax),%eax 0xffffffffa00de434 <dm_suspend+485>: test $0x4,%al 0xffffffffa00de436 <dm_suspend+487>: jne 0xffffffffa00de43f 0xffffffffa00de438 <dm_suspend+489>: callq 0xffffffff80360fbc <io_schedule> And io_schedule() pushes it onto the stack here: crash> dis io_schedule 0xffffffff80360fbc <io_schedule>: push %rbp 0xffffffff80360fbd <io_schedule+1>: mov %rsp,%rbp 0xffffffff80360fc0 <io_schedule+4>: incl 0x166d1e(%rip) # 0xffffffff804c7ce4 <per_cpu__runqueues+4676> 0xffffffff80360fc6 <io_schedule+10>: callq 0xffffffff8035f58b <schedule> 0xffffffff80360fcb <io_schedule+15>: decl 0x166d13(%rip) # 0xffffffff804c7ce4 <per_cpu__runqueues+4676> 0xffffffff80360fd1 <io_schedule+21>: leaveq 0xffffffff80360fd2 <io_schedule+22>: retq crash> bt -f 7373 PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" #0 [101e05d5c48] schedule at ffffffff8035fb2c 101e05d5c50: 0000000000000006 00000101e05d5d28 101e05d5c60: 0000007300000046 000001020e063110 101e05d5c70: 00000000000bc4cd 000001022e9d3110 101e05d5c80: 000001020e0633f8 000001022e2c7200 101e05d5c90: ffffffffa00dfa35 00000101e05d5cd4 101e05d5ca0: 000001023c09cc00 0000000000000000 101e05d5cb0: 00000000ffffffea 0000000000000000 101e05d5cc0: ffffffffa00e1040 00000101e05d5cd8 101e05d5cd0: ffffffff80360fcb #1 [101e05d5cd0] io_schedule at ffffffff80360fcb 101e05d5cd8: 000001023f213e00 ffffffffa00de43d ... crash> px *(struct mapped_device *)0x000001023f213e00 $7 = { io_lock = { activity = 0x0, wait_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 0x12e }, wait_list = { next = 0x1023f213e38, prev = 0x1023f213e38 } }, suspend_lock = { count = { counter = 0x0 }, sleepers = 0x0, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213e80, prev = 0x1023f213e80 } } }, pushback_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0x0 }, map_lock = {<No data fields>}, holders = { counter = 0x3 }, open_count = { counter = 0x1 }, flags = 0x1, queue = 0x1023e379420, disk = 0x100bfc97e00, interface_ptr = 0x1023ee29200, pending = { counter = 0x1 }, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 0x6c }, task_list = { next = 0x101e05d5d40, prev = 0x101e05d5d40 } }, deferred = { head = 0x101e2a69980, tail = 0x1020289a380 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023c09cc00, io_pool = 0x1023d338b80, tio_pool = 0x1023d338c80, bs = 0x1023ee29cc0, event_nr = { counter = 0x0 }, eventq = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213fa8, prev = 0x1023f213fa8 } }, frozen_sb = 0x0, suspended_bdev = 0x1023c89a300 } And here we have md->pending == 1 so it's still waiting for a device mapped I/O to complete and call dec_pending()->end_io_acct() which will wake up the pvmove process. So we need to work out what is holding up that last I/O. --- Additional comment from lmcilroy on 2010-06-09 23:51:22 EDT --- Some analysis from Bryn: PV table: pv0 8 16 /dev/sdb KccCqL-ngfW-l0s9-EJZ0-SsGa-rL0Q-VZbUKe pv1 8 48 /dev/sdd jIXNDc-yBd3-dRKg-8m20-qWRV-jyhr-d5DTu4 pv2 8 96 /dev/sdg iqCYlL-Hl0j-SyCN-XwKF-dxgE-UwXx-VfSU22 pv3 8 128 /dev/sdi RGoqRm-BZ3H-b1sP-8JJ9-ijgi-gJW2-bV0a0c According to the dm data in the lvmdump we have two suspended devices: $ grep '\ L.s.\ ' lvmdump-hklxdv55-2010051831514/dmsetup_info vgtest-Ftest 253 13 L-sw 3 2 0 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG606m6TeYUkWcl1U3kfqjIh5zJEhorK53Emr vgtest-pvmove0 253 15 L-sw 3 3 1 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG6069vX6BhJ0Z7etvouBH8VZM82I3GnhRXu5 This is consistent with the state recorded in pvmove_2: root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi Wiping cache of LVM-capable devices Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 17). Creating logical volume pvmove0 [...] Suspending vgtest-Ftest (253:13) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush We can see that the temporary vgtest-pvmove0 LV has been created: vgtest-pvmove0: 0 1048576 mirror core 1 1024 2 8:48 2097536 8:128 384 vgtest-pvmove0: 1048576 2097152 linear 8:48 384 vgtest-pvmove0: 3145728 131072 linear 8:48 3146112 Looking at the device stack at this point: vgtest-Ftest: 0 2097152 linear 8:96 384 vgtest-Ftest: 2097152 1048576 linear 253:15 0 vgtest-Ftestdb: 0 2097152 linear 253:15 1048576 vgtest-Ftestdb: 2097152 1966080 linear 8:96 2097536 vgtest-Ftestdb: 4063232 131072 linear 253:15 3145728 vgtest-Ftest is a two-segment volume consisting of one region on /dev/sdg and one region on the temporary mirror: +--------------+ | vgtest-Ftest | [S] +--------------+ / / \ \ +-----+ +----------------+ | sdg | | vgtest-pvmove0 | [S] +-----+ +----------------+ vgtest-Ftestdb is a three-segment volume consisting of one region on /dev/sdg and two regions on the temporary mirror: +----------------+ | vgtest-Ftestdb | [*] +----------------+ / / | | \ \ +----+ +-----+ +----------------+ |pvm0| | sdg | | pvm0 | [S] +----+ +-----+ +----------------+ The pvmove command running on the system is blocked attempting to suspend vgtest-Ftestdb: kernel: pvmove S 00004aae9fdf7c94 0 25787 24814 (NOTLB) kernel: 0000010061f4dcc8 0000000000000006 0000010061f4dd28 0000007300000046 kernel: 00000101ad8b9110 00000000000c4342 0000010230b88230 00000101ad8b93f8 kernel: 0000010068967700 ffffffffa00dfa35 kernel: Call Trace:<ffffffffa00dfa35>{:dm_mod:dm_table_unplug_all+49} kernel: <ffffffffa00e1040>{:dm_mod:dev_suspend+0} <ffffffff80360fcb>{io_schedule+15} kernel: <ffffffffa00de43d>{:dm_mod:dm_suspend+494} <ffffffff80183b4b>{map_vm_area+447} kernel: <ffffffff801355dd>{default_wake_function+0} <ffffffff801355dd>{default_wake_function+0} kernel: <ffffffffa00e11a1>{:dm_mod:dev_suspend+353} <ffffffffa00e2383>{:dm_mod:ctl_ioctl+602} kernel: <ffffffff801a668a>{sys_ioctl+1006} <ffffffff80110a92>{system_call+126} The pvmove has made no progress at all between the three sets of call traces (register values are unchanged between the 3 sets). This data also indicates that the first pvmove did indeed complete correctly as reported (all segments have been transferred correctly to /dev/sdg and the updated metadata reflects this). --- Additional comment from lmcilroy on 2010-06-10 02:38:04 EDT --- Tried to reproduce this issue in a VM using four physical volumes (each backed by a 2GB file), combined them into one volume group, created logical volumes over the first two disks and formatted them with ext3. I ran a load on the filesystems (dd and fsstress) while pvmove'ing sda to sdc and sdb to sdd and vice versa. I could not cause pvmove to get stuck but it did run very slowly. --- Volume group --- VG Name vgtest System ID Format lvm2 Metadata Areas 4 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 4 Act PV 4 VG Size 7.75 GB PE Size 64.00 MB Total PE 124 Alloc PE / Size 62 / 3.88 GB Free PE / Size 62 / 3.88 GB VG UUID wWsTLq-NprO-E0Wn-OdYa-77Eq-MOtH-MrxxVf --- Logical volume --- LV Name /dev/vgtest/lv0 VG Name vgtest LV UUID 2SIki7-aIKj-yJMd-4zej-YoRQ-SyBF-3R3wKr LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 31 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Name /dev/vgtest/lv1 VG Name vgtest LV UUID iFfX3c-naiq-oPfo-bOxT-G3f8-hj0X-SLgvOG LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 31 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 --- Physical volumes --- PV Name /dev/sda PV UUID f4GWeC-f2zr-1wTm-dzlL-Q0ue-l4OY-SC2CC5 PV Status allocatable Total PE / Free PE 31 / 31 PV Name /dev/sdb PV UUID aPwCGC-X4fl-sMvF-fkiW-PXOS-zGGr-eDyc3o PV Status allocatable Total PE / Free PE 31 / 31 PV Name /dev/sdc PV UUID F1e9Yb-IcHS-Ae3u-OmLB-lR14-ytqg-fEnGV2 PV Status allocatable Total PE / Free PE 31 / 0 PV Name /dev/sdd PV UUID MhmBhu-Vk7q-g75v-mX8G-3lMP-b1RI-YthlbL PV Status allocatable Total PE / Free PE 31 / 0 --- Additional comment from lmcilroy on 2010-06-28 02:02:39 EDT --- Continuing the analysis: crash> px *(struct mapped_device *)0x000001023f213e00 $32 = { io_lock = { activity = 0x0, wait_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 0x12e }, wait_list = { next = 0x1023f213e38, prev = 0x1023f213e38 } }, suspend_lock = { count = { counter = 0x0 }, sleepers = 0x0, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213e80, prev = 0x1023f213e80 } } }, pushback_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0x0 }, map_lock = {<No data fields>}, holders = { counter = 0x3 }, open_count = { counter = 0x1 }, flags = 0x1, queue = 0x1023e379420, disk = 0x100bfc97e00, interface_ptr = 0x1023ee29200, pending = { counter = 0x1 }, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 0x6c }, task_list = { next = 0x101e05d5d40, prev = 0x101e05d5d40 } }, deferred = { head = 0x101e2a69980, tail = 0x1020289a380 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023c09cc00, io_pool = 0x1023d338b80, tio_pool = 0x1023d338c80, bs = 0x1023ee29cc0, event_nr = { counter = 0x0 }, eventq = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213fa8, prev = 0x1023f213fa8 } }, frozen_sb = 0x0, suspended_bdev = 0x1023c89a300 } pending is 1 flags is DMF_BLOCK_IO (stalling I/Os while trying to suspend) what does the disk queue say? crash> gendisk 0x100bfc97e00 struct gendisk { major = 253, first_minor = 14, minors = 1, disk_name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", part = 0x0, fops = 0xffffffffa00eef00, queue = 0x1023e379420, private_data = 0x1023f213e00, capacity = 4194304, flags = 16, devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", number = 0, driverfs_dev = 0x0, kobj = { k_name = 0x100bfc97eb0 "dm-14", name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000", kref = { refcount = { counter = 3 } }, entry = { next = 0x1023c09ecc8, prev = 0x1023f213cc8 }, parent = 0xffffffff804646e0, kset = 0xffffffff804646c0, ktype = 0x0, dentry = 0x1023e05da20 }, random = 0x1023de20f00, policy = 0, sync_io = { counter = 0 }, stamp = 4296507683, stamp_idle = 4296507683, in_flight = 1, dkstats = { read_sectors = 435498, write_sectors = 5866608, reads = 12972, writes = 733326, read_merges = 0, write_merges = 0, read_ticks = 55488, write_ticks = 1125481, io_ticks = 1096527, time_in_queue = 1943931 } } in_flight is 1 so that agrees. Device is 253:14 (dm-14) Let's look a the dm map and find out what's below this device. crash> px *((struct mapped_device *)0x000001023f213e00)->map $34 = { holders = { counter = 0x2 }, depth = 0x1, counts = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, index = {0xffffff000003c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, num_targets = 0x3, num_allocated = 0x10, highs = 0xffffff000003c000, targets = 0xffffff000003c080, mode = 0x3, devices = { next = 0x101f9374ac0, prev = 0x10235bb3e80 }, limits = { max_sectors = 0xffff, max_phys_segments = 0x80, max_hw_segments = 0xff, hardsect_size = 0x200, max_segment_size = 0x10000, seg_boundary_mask = 0xffffffff }, event_fn = 0xffffffffa00dd404 <event_callback>, event_context = 0x1023f213e00, md = 0x1023f213e00 } crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->type->name $38 = 0xffffffffa00e53c2 "linear" crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->private $39 = (void *) 0x101f93edfa0 crash> px *((struct linear_c *)0x101f93edfa0)->dev $41 = { list = { next = 0x1023c09cce8, prev = 0x101f9374ac0 }, count = { counter = 0x2 }, mode = 0x3, bdev = 0x101dfa6ca00, name = "253:15\000\000\000\000\000\000\000\000\000" } crash> block_device 0x101dfa6ca00 struct block_device { bd_dev = 265289743, bd_inode = 0x101dfa6cb10, bd_openers = 3, bd_sem = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffff803758bd "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x101dfa6ca50, prev = 0x101dfa6ca50 } } }, bd_mount_sem = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffff803758bd "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x101dfa6ca98, prev = 0x101dfa6ca98 } } }, bd_inodes = { next = 0x1022bb68970, prev = 0x1022bb68970 }, bd_holder = 0xffffffffa00e524e, bd_holders = 4, bd_contains = 0x101dfa6ca00, bd_block_size = 4096, bd_part = 0x0, bd_part_count = 0, bd_invalidated = 0, bd_disk = 0x1023c09ec00, bd_list = { next = 0x10006be7270, prev = 0x1023c1d5570 }, bd_inode_backing_dev_info = 0x0, bd_private = 0 } crash> gendisk 0x1023c09ec00 struct gendisk { major = 253, first_minor = 15, minors = 1, disk_name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", part = 0x0, fops = 0xffffffffa00eef00, queue = 0x1023e379810, private_data = 0x1023c09ce00, capacity = 3276800, flags = 16, devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", number = 0, driverfs_dev = 0x0, kobj = { k_name = 0x1023c09ecb0 "dm-15", name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000", kref = { refcount = { counter = 3 } }, entry = { next = 0xffffffff804646d0, prev = 0x100bfc97ec8 }, parent = 0xffffffff804646e0, kset = 0xffffffff804646c0, ktype = 0x0, dentry = 0x1020d87a908 }, random = 0x101f93edf80, policy = 0, sync_io = { counter = 0 }, stamp = 4296507683, stamp_idle = 4296507683, in_flight = 0, dkstats = { read_sectors = 40, write_sectors = 504856, reads = 5, writes = 63107, read_merges = 0, write_merges = 0, read_ticks = 79, write_ticks = 55902, io_ticks = 28089, time_in_queue = 55982 } } in_flight is 0. Device is 253:15 (dm-15) so that's another dm device. this dm's mapped_device is in private_data crash> mapped_device 0x1023c09ce00 struct mapped_device { io_lock = { activity = 0, wait_lock = { magic = 488917820, lock = 0, babble = 5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 302 }, wait_list = { next = 0x1023c09ce38, prev = 0x1023c09ce38 } }, suspend_lock = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x1023c09ce80, prev = 0x1023c09ce80 } } }, pushback_lock = { magic = 488917820, lock = 0, babble = 5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0 }, map_lock = {<No data fields>}, holders = { counter = 4 }, open_count = { counter = 3 }, flags = 3, queue = 0x1023e379810, disk = 0x1023c09ec00, interface_ptr = 0x1022e2c7f40, pending = { counter = 0 }, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 131 }, task_list = { next = 0x1023c09cf20, prev = 0x1023c09cf20 } }, deferred = { head = 0x1023f264980, tail = 0x1020aa44800 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023d05d000, io_pool = 0x1022eff23c0, tio_pool = 0x1022eff22c0, bs = 0x101f36e4300, event_nr = { counter = 1 }, eventq = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037a9ff "kernel/sched.c", oline = 2894 }, task_list = { next = 0x1023c09cfa8, prev = 0x1023c09cfa8 } }, frozen_sb = 0x0, suspended_bdev = 0x101dfa6ca00 } pending is 0. flags is 3 so that's DMF_BLOCK_IO and DMF_SUSPENDED. Aha! The underlying device is suspended. The pvmove is stuck trying to suspend dm-14 because an I/O is blocked by the underlying device dm-15 because it is suspended - dm-15 will need to resumed before dm-14 can continue. --- Additional comment from lmcilroy on 2010-08-26 03:02:05 EDT --- I've managed to reproduce this problem but I had to mimic the customer's lvm config exactly to get pvmove to hang. Here's what I did: - Installed a VM with RHEL4.6 - Created 4 x 2GB files on the host and attached them to the VM as SCSI disks $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd Must use a 64MB extent size: $ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second. The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again. The only way I know to create that is with this: $ lvcreate -n Ftest vgtest -l 16 /dev/sda $ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb $ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda $ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb $ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb This is what it all looks like after that. vgtest-Ftest: 0 2097152 linear 8:0 384 vgtest-Ftest: 2097152 1048576 linear 8:16 384 vgtest-Ftestdb: 0 2097152 linear 8:16 1048960 vgtest-Ftestdb: 2097152 1966080 linear 8:0 2097536 vgtest-Ftestdb: 4063232 131072 linear 8:16 3146112 --- Logical volume --- LV Name /dev/vgtest/Ftest VG Name vgtest LV UUID XaxY0e-TNVq-gwzS-2scb-yA6L-lZ8u-atAnHm LV Write Access read/write LV Status available # open 0 LV Size 1.50 GB Current LE 24 Segments 2 Allocation inherit Read ahead sectors 0 Block device 253:3 --- Logical volume --- LV Name /dev/vgtest/Ftestdb VG Name vgtest LV UUID 2a5AaI-enRN-DdAk-eqCK-JMJ5-Jm4p-0lAzdg LV Write Access read/write LV Status available # open 0 LV Size 2.00 GB Current LE 32 Segments 3 Allocation inherit Read ahead sectors 0 Block device 253:4 --- Volume group --- VG Name vgtest System ID Format lvm2 Metadata Areas 4 Metadata Sequence No 72 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 0 Max PV 0 Cur PV 4 Act PV 4 VG Size 7.75 GB PE Size 64.00 MB Total PE 124 Alloc PE / Size 56 / 3.50 GB Free PE / Size 68 / 4.25 GB VG UUID 42GW3U-1l0X-Sb5I-jD5b-S3Gy-w8gc-LtAlRB --- Physical volume --- PV Name /dev/sda VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes (but full) PE Size (KByte) 65536 Total PE 31 Free PE 0 Allocated PE 31 PV UUID gCJW0L-Lqrd-snDX-aswE-rLJH-wvlc-I73604 --- Physical volume --- PV Name /dev/sdb VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 6 Allocated PE 25 PV UUID cvzwHb-vYqS-uTz0-cclI-P1Nt-N35H-R11i8x --- Physical volume --- PV Name /dev/sdc VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 31 Allocated PE 0 PV UUID yDuC5g-Pfla-TUvF-kGk3-zPpg-f4o3-AEFc3p --- Physical volume --- PV Name /dev/sdd VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 31 Allocated PE 0 PV UUID DwVjpk-XvvM-Q0q3-0VPJ-lOGu-Ycze-CcDVAd Make filesystems and mount both volumes: $ mkfs.ext3 /dev/vgtest/Ftest $ mkfs.ext3 /dev/vgtest/Ftestdb $ mount /dev/vgtest/Ftest /mnt/Ftest $ mount /dev/vgtest/Ftestdb /mnt/Ftestdb Add an I/O load to the Ftestdb volume: $ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 And fire off pvmove, this one should succeed: $ pvmove -v /dev/sda /dev/sdc This one should hang (if it doesn't hang keep following the steps): $ pvmove -v /dev/sdb /dev/sdd This will succeed: $ pvmove -v /dev/sdc /dev/sda And this one should hang: $ pvmove -v /dev/sdd /dev/sdb The output from pvmove before it hangs shows that it is trying to suspend a device (vgtest-Ftestdb) that sits on top on another already suspended device (vgtest-pvmove0): Suspending vgtest-Ftest (253:2) without device flush Suspending vgtest-pvmove0 (253:4) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:3) without device flush --- Additional comment from dejohnso on 2011-02-16 15:28:36 EST --- Information that might prove helpful that one customer has noticed. Scenario. New bigger volume is added and data moved on it via pvmove. pvmove at some time stops in D state. dmsetup info will show suspended devices dmsetup resume <map> - on the suspended devices unblocks pvmove and it continues. This only happens with pvmove. --- Additional comment from lmcilroy on 2011-02-16 19:29:39 EST --- Manually resuming the devices will clear the problem but it may also undermine the integrity of pvmove since the devices are being resumed before it expects them to be. This problem is due to pvmove suspending a device (A) that sits on top of another device (B) that is already suspended. Any I/O sent to device B while it is suspended gets queued and wont complete until device B is resumed. For device A to be suspended it must wait for all I/O it has already issued to complete but device A is waiting for an I/O that's in device B's queue. The fix for this problem will need to be done in pvmove. It will need to avoid the sequence "suspend B, ... suspend A" by either always suspending A first whenever suspending B or by converting it to "suspend B, ... resume B, suspend A, suspend B".
Created attachment 500986 [details] a patch for solve pvmove hung problem Summary the problem: when doing pvmove, the lv tree is as below: +--------------+ | vgtest-Ftest | +--------------+ / / \ \ +-----+ +----------------+ | sdg | | pvm0 | +-----+ +----------------+ +----------------+ | vgtest-Ftestdb | +----------------+ / / | | \ \ +----+ +-----+ +----------------+ |pvm0| | sdg | | pvm0 | +----+ +-----+ +----------------+ pvmove will update metadata periodic. When doing this, pvmove will suspend all the lv, 1, it suspend Ftest and its underlying lv: pvm0 2, suspend Ftestdb (pvm0 is also Ftestdb's underlying lv, but if it is already been suspended, the program will skip it) After step 1, an io is written to pvm0 through Ftestdb, and because pvm0 is already been suspended, the io will not be implement. So there is an in-flight io on Ftestdb, it will alwasy in-flight unless the pvm0 is resume. Then do step 2, pvmove try to suspend Ftestdb. Before suspend it, kernel will wait until all the in-flight io complete. Because pvm0 is already been suspended, io on it will nevery been implement, so there is alwasy an io in-flight, and the pvmove program is hung. To solve this problem, I resume all the underlying lv before try to suspend a lv.
Is there potentially more needed to reproduce this? I've been unable to reproduce a hang while running fs I/O during multiple pvmoves. 2.6.18-261.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
(In reply to comment #3) > Is there potentially more needed to reproduce this? I've been unable to > reproduce a hang while running fs I/O during multiple pvmoves. Corey, do you have multiple logical volumes sharing the same physical disk? I think that's a requirement in order to reproduce the bug.
(In reply to comment #3) > Is there potentially more needed to reproduce this? I've been unable to > reproduce a hang while running fs I/O during multiple pvmoves. > > 2.6.18-261.el5 > > lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 > lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 > device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 > device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 > cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 > kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 This bug is cloned from rhel4. In the original bug, Lachlan provide a method to reprudoce the bug: - Installed a VM with RHEL4.6 - Created 4 x 2GB files on the host and attached them to the VM as SCSI disks $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd Must use a 64MB extent size: $ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second. The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again. The only way I know to create that is with this: $ lvcreate -n Ftest vgtest -l 16 /dev/sda $ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb $ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda $ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb $ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb ...... Make filesystems and mount both volumes: $ mkfs.ext3 /dev/vgtest/Ftest $ mkfs.ext3 /dev/vgtest/Ftestdb $ mount /dev/vgtest/Ftest /mnt/Ftest $ mount /dev/vgtest/Ftestdb /mnt/Ftestdb Add an I/O load to the Ftestdb volume: $ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 And fire off pvmove, this one should succeed: $ pvmove -v /dev/sda /dev/sdc This one should hang (if it doesn't hang keep following the steps): $ pvmove -v /dev/sdb /dev/sdd This will succeed: $ pvmove -v /dev/sdc /dev/sda And this one should hang: $ pvmove -v /dev/sdd /dev/sdb
Lots of information to plough through here, but it's not obvious to me yet: are you attempting to run two pvmoves simultaneously? Or waiting for one to finish before starting the next?
In other words, what is the *simplest* setup you have that recreates the problem?
(And if they don't have to run simultaneously, is it necessary to run two pvmoves to see the problem or can it be seen in some other way by just running one pvmove? The write-up here looks rather confused.)
Regarding the attached patch: If the devices are suspended, that's been done for a reason, and resuming them forcibly to release trapped i/o isn't a good idea. Better to make sure no i/o can become trapped in the first place.
(Also always think how how it works in cluster - you cannot force resume just on local node. And this code is below cluster locking which singalise suspend/resume to other nodes, so it will cause inconsistent state on cluster.)
Reply to Alasdair: To trigger this bug, need only one pvmove. Just follow Comment 5 can reproduce it. When do this step: pvmove -v /dev/sdb /dev/sdd And when pvmove update metadata, it will suspend all lvs on sdb/sdd, they are: Ftest and Ftestdb. When suspend Ftest, it will also suspend its underlying lv: pvmove0. pvmove0 is also an underlying device of Ftestdb. After Ftest and pvmove0 are suspended, and before Ftestdb is suspended, some io write through Ftestdb to pvmove0, these io will be pending on Ftestdb. Then suspend Ftestdb, it need wait until all Ftestdb's in-flight io complete. But because some io are write through Ftestdb to pvmove0, and pvmove0 is already been suspended, these io will not complete forever. Then pvmove hung. About resuming all the underlying lv forcibly, I agree it is not a good idea. Maybe we need add two flags. One for command context, indicate this suspend is called by pvmove for update metadata, only this kind of suspend trigger resume. Another flag for temp lv created by pvmove, when do resume, only this kind of lv will be resume.
Reply to Milan: Thanks for your remind. I don't consider carefully for cluster condition. I need sometime to consider how to deal with it.
I am working on a fix. In the meantime, as a workaround, try using the -n option of pvmove to move only one LV at once. List of LVS in VG: lvs --noheadings -o name $vg Move one LV: pvmove -i0 -n $lvname
I found a series of problems with pvmove which I have fixed in the upstream repository. However, I would like to wait some time to see that no unintended side-effects show up before seeing the change back-ported to RHEL. An audit for similar bugs found a problem in lvremove, which I have also fixed. lvconvert still needs to be audited.
The same problem is also on rhel4 and rhel6, so if the problem is fixed, please also back-port to rhel4 and rhel6.
This passed the upstream test suite for the first time last night. However, due to the complexity of the change and the amount of regression testing I believe it needs, I am not offering this as a Z-stream release, but only releasing it as part of the next scheduled update, viz. 5.8. In the meantime, I'm afraid the workaround of running pvmove on one LV at a time is the best I can offer. That can be scripted, and will not have a significant impact on performance because the same amount of data is still being moved, just in a different order.
Upstream release 2.02.86 include in Fedora rawhide. Please test.
Fixed in lvm2-2.02.88-1.el5.
Fix verified in the latest rpms. 2.6.18-274.el5 lvm2-2.02.88-4.el5 BUILT: Wed Nov 16 09:40:55 CST 2011 lvm2-cluster-2.02.88-4.el5 BUILT: Wed Nov 16 09:46:51 CST 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 [root@grant-01 tmp]# pvcreate /dev/sd[bc][12] Writing physical volume data to disk "/dev/sdb1" Physical volume "/dev/sdb1" successfully created Writing physical volume data to disk "/dev/sdb2" Physical volume "/dev/sdb2" successfully created Writing physical volume data to disk "/dev/sdc1" Physical volume "/dev/sdc1" successfully created Writing physical volume data to disk "/dev/sdc2" Physical volume "/dev/sdc2" successfully created [root@grant-01 tmp]# vgcreate -s 64M vgtest /dev/sdb1 /dev/sdb2 /dev/sdc1 /dev/sdc2 Volume group "vgtest" successfully created [root@grant-01 tmp]# lvcreate -n Ftest vgtest -l 16 /dev/sdb1 Logical volume "Ftest" created [root@grant-01 tmp]# lvcreate -n Ftestdb vgtest -l 16 /dev/sdb2 Logical volume "Ftestdb" created [root@grant-01 tmp]# lvextend -l 31 /dev/vgtest/Ftestdb /dev/sdb1 Extending logical volume Ftestdb to 1.94 GB Logical volume Ftestdb successfully resized [root@grant-01 tmp]# lvextend -l 24 /dev/vgtest/Ftest /dev/sdb2 Extending logical volume Ftest to 1.50 GB Logical volume Ftest successfully resized [root@grant-01 tmp]# lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb2 Extending logical volume Ftestdb to 2.00 GB Logical volume Ftestdb successfully resized [root@grant-01 tmp]# lvs -a -o +devices LV VG Attr LSize Devices Ftest vgtest -wi-a- 1.50G /dev/sdb1(0) Ftest vgtest -wi-a- 1.50G /dev/sdb2(16) Ftestdb vgtest -wi-a- 2.00G /dev/sdb2(0) Ftestdb vgtest -wi-a- 2.00G /dev/sdb1(16) Ftestdb vgtest -wi-a- 2.00G /dev/sdb2(24) [root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftest [root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftestdb [root@grant-01 tmp]# mkdir /mnt/Ftest [root@grant-01 tmp]# mkdir /mnt/Ftestdb [root@grant-01 tmp]# mount /dev/vgtest/Ftest /mnt/Ftest [root@grant-01 tmp]# mount /dev/vgtest/Ftestdb /mnt/Ftestdb [root@grant-01 tmp]# fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 seed = 1322167840 [root@grant-01 ~]# pvmove -v /dev/sdb1 /dev/sdc1 Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 6). Creating logical volume pvmove0 Moving 16 extents of logical volume vgtest/Ftest Moving 15 extents of logical volume vgtest/Ftestdb Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftest Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftestdb Updating volume group metadata Found volume group "vgtest" Found volume group "vgtest" Creating vgtest-pvmove0 Loading vgtest-pvmove0 table (253:4) Loading vgtest-Ftest table (253:2) Loading vgtest-Ftestdb table (253:3) Suspending vgtest-Ftest (253:2) with device flush Suspending vgtest-pvmove0 (253:4) with device flush Suspending vgtest-Ftestdb (253:3) with device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/pvmove0 Resuming vgtest-pvmove0 (253:4) Found volume group "vgtest" Loading vgtest-pvmove0 table (253:4) Suppressed vgtest-pvmove0 identical table reload. Resuming vgtest-Ftest (253:2) Resuming vgtest-Ftestdb (253:3) Found volume group "vgtest" Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 7). Checking progress before waiting every 15 seconds /dev/sdb1: Moved: 3.2% /dev/sdb1: Moved: 51.6% Updating volume group metadata Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-pvmove0 table (253:4) Suspending vgtest-pvmove0 (253:4) with device flush Found volume group "vgtest" Resuming vgtest-pvmove0 (253:4) Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 8). /dev/sdb1: Moved: 100.0% Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-Ftest table (253:2) Loading vgtest-Ftestdb table (253:3) Loading vgtest-pvmove0 table (253:4) Suspending vgtest-Ftest (253:2) with device flush Suspending vgtest-Ftestdb (253:3) with device flush Suspending vgtest-pvmove0 (253:4) with device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" Resuming vgtest-pvmove0 (253:4) Found volume group "vgtest" Resuming vgtest-Ftest (253:2) Found volume group "vgtest" Resuming vgtest-Ftestdb (253:3) Found volume group "vgtest" Removing vgtest-pvmove0 (253:4) Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 10). [root@grant-01 ~]# pvmove -v /dev/sdb2 /dev/sdc2 Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 7). Creating logical volume pvmove1 Skipping locked LV Ftest Skipping locked LV Ftestdb Skipping mirror LV pvmove0 All data on source PV skipped. It contains locked, hidden or non-top level LVs only. No data to move for vgtest
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0161.html