Bug 602516
Summary: | pvmove stuck waiting for I/O to complete | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Lachlan McIlroy <lmcilroy> | ||||||
Component: | lvm2 | Assignee: | Alasdair Kergon <agk> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.6 | CC: | agk, dejohnso, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, pyu, samuli.kaski, thornber, vgaikwad | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 684083 706036 (view as bug list) | Environment: | |||||||
Last Closed: | 2012-03-22 12:22:56 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 684083, 706036 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Lachlan McIlroy
2010-06-10 03:03:57 UTC
Looking at the vmcore, the pvmove process is blocked attempting to suspend the device - it hasn't run for 13 minutes: PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" #0 [101e05d5c48] schedule at ffffffff8035fb2c #1 [101e05d5cd0] io_schedule at ffffffff80360fcb #2 [101e05d5ce0] dm_suspend at ffffffffa00de43d #3 [101e05d5d80] dev_suspend at ffffffffa00e11a1 #4 [101e05d5db0] ctl_ioctl at ffffffffa00e2383 #5 [101e05d5f40] sys_ioctl at ffffffff801a668a RIP: 0000003185dc25d9 RSP: 0000007fbfffb688 RFLAGS: 00000206 RAX: 0000000000000010 RBX: ffffffff80110a92 RCX: 000000000000000a RDX: 000000000066c920 RSI: 00000000c138fd06 RDI: 0000000000000008 RBP: 0000000000000000 R8: 0000000000000005 R9: 000000000066c950 R10: 0000003186114700 R11: 0000000000000246 R12: 000000000066ca58 R13: 0000000000000000 R14: 00000000000000fd R15: 000000000066c920 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b I found this oracle process waiting for I/O to complete and according to the last_ran time got stuck just before the pvmove process so may be of interest: PID: 7240 TASK: 10222fbf650 CPU: 0 COMMAND: "oracle" #0 [10222fc1b38] schedule at ffffffff8035fb2c #1 [10222fc1bc0] io_schedule at ffffffff80360fcb #2 [10222fc1bd0] wait_on_page_bit at ffffffff801688ae #3 [10222fc1c70] wait_on_page_writeback_range at ffffffff80168935 #4 [10222fc1d40] sync_page_range_nolock at ffffffff80168aa2 #5 [10222fc1d70] generic_file_aio_write_nolock at ffffffff8016b159 #6 [10222fc1da0] generic_file_aio_write at ffffffff8016b1eb #7 [10222fc1df0] ext3_file_write at ffffffffa010b2f9 #8 [10222fc1e10] do_sync_write at ffffffff8018f412 #9 [10222fc1f10] vfs_write at ffffffff8018f50d #10 [10222fc1f40] sys_pwrite64 at ffffffff8018f6ea #11 [10222fc1f80] system_call at ffffffff80110a92 RIP: 000000318660bf75 RSP: 0000007fbfffdda8 RFLAGS: 00010246 RAX: 0000000000000012 RBX: ffffffff80110a92 RCX: 000000002ae835f7 RDX: 0000000000000400 RSI: 000000006071ec00 RDI: 0000000000000016 RBP: 00048753a5be767c R8: 0000000000092abc R9: 000000000020e0c2 R10: 000000000238e000 R11: 0000000000000246 R12: 0000000000000200 R13: 0000007fbfffd8f0 R14: 0000002a97d56c48 R15: 0000000000000b6f ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b And this oracle process got stuck just after the pvmove process: PID: 7238 TASK: 10223ba4530 CPU: 0 COMMAND: "oracle" #0 [1022302db38] schedule at ffffffff8035fb2c #1 [1022302dbc0] io_schedule at ffffffff80360fcb #2 [1022302dbd0] wait_on_page_bit at ffffffff801688ae #3 [1022302dc70] wait_on_page_writeback_range at ffffffff80168935 #4 [1022302dd40] sync_page_range_nolock at ffffffff80168aa2 #5 [1022302dd70] generic_file_aio_write_nolock at ffffffff8016b159 #6 [1022302dda0] generic_file_aio_write at ffffffff8016b1eb #7 [1022302ddf0] ext3_file_write at ffffffffa010b2f9 #8 [1022302de10] do_sync_write at ffffffff8018f412 #9 [1022302df10] vfs_write at ffffffff8018f50d #10 [1022302df40] sys_pwrite64 at ffffffff8018f6ea #11 [1022302df80] system_call at ffffffff80110a92 RIP: 000000318660bf75 RSP: 0000007fbfffd808 RFLAGS: 00010212 RAX: 0000000000000012 RBX: ffffffff80110a92 RCX: 000000008c663000 RDX: 00000000000a0000 RSI: 000000008c75f000 RDI: 0000000000000014 RBP: 00048753a5c610a1 R8: 00000000000182a1 R9: 000000000020e4a6 R10: 0000000029108000 R11: 0000000000000246 R12: 0000000000002000 R13: 0000007fbfffd750 R14: 0000002a975bf4b0 R15: 000000009758e4c8 ORIG_RAX: 0000000000000012 CS: 0033 SS: 002b The pvmove process appears to be stuck in this code in dm_suspend(): /* * Then we wait for the already mapped ios to * complete. */ while (1) { set_current_state(TASK_INTERRUPTIBLE); if (!atomic_read(&md->pending) || signal_pending(current)) break; io_schedule(); } The suspend will only continue if pvmove receives a signal or the count of pending I/Os drops to zero. Let's check both cases: static inline int signal_pending(struct task_struct *p) { return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING)); } crash> task 1020e063110 PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" struct task_struct { state = 1, thread_info = 0x101e05d4000, ... crash> thread_info 0x101e05d4000 struct thread_info { task = 0x1020e063110, exec_domain = 0xffffffff8043d0c0, flags = 0, ... The TIF_SIGPENDING flag is not set so it hasn't received a signal. The dm_suspend() function leaves the mapped_device pointer in %rbp which we can see here: crash> dis dm_suspend 0xffffffffa00de24f <dm_suspend>: push %r14 0xffffffffa00de251 <dm_suspend+2>: xor %eax,%eax 0xffffffffa00de253 <dm_suspend+4>: mov $0xa,%ecx 0xffffffffa00de258 <dm_suspend+9>: cld 0xffffffffa00de259 <dm_suspend+10>: push %r13 0xffffffffa00de25b <dm_suspend+12>: mov %esi,%edx 0xffffffffa00de25d <dm_suspend+14>: push %r12 0xffffffffa00de25f <dm_suspend+16>: push %rbp 0xffffffffa00de260 <dm_suspend+17>: mov %rdi,%rbp ... 0xffffffffa00de41a <dm_suspend+459>: mov 0xe8(%rbp),%eax # md->pending 0xffffffffa00de420 <dm_suspend+465>: test %eax,%eax 0xffffffffa00de422 <dm_suspend+467>: je 0xffffffffa00de43f 0xffffffffa00de424 <dm_suspend+469>: mov %gs:0x0,%rax 0xffffffffa00de42d <dm_suspend+478>: mov 0x8(%rax),%rax 0xffffffffa00de431 <dm_suspend+482>: mov 0x10(%rax),%eax 0xffffffffa00de434 <dm_suspend+485>: test $0x4,%al 0xffffffffa00de436 <dm_suspend+487>: jne 0xffffffffa00de43f 0xffffffffa00de438 <dm_suspend+489>: callq 0xffffffff80360fbc <io_schedule> And io_schedule() pushes it onto the stack here: crash> dis io_schedule 0xffffffff80360fbc <io_schedule>: push %rbp 0xffffffff80360fbd <io_schedule+1>: mov %rsp,%rbp 0xffffffff80360fc0 <io_schedule+4>: incl 0x166d1e(%rip) # 0xffffffff804c7ce4 <per_cpu__runqueues+4676> 0xffffffff80360fc6 <io_schedule+10>: callq 0xffffffff8035f58b <schedule> 0xffffffff80360fcb <io_schedule+15>: decl 0x166d13(%rip) # 0xffffffff804c7ce4 <per_cpu__runqueues+4676> 0xffffffff80360fd1 <io_schedule+21>: leaveq 0xffffffff80360fd2 <io_schedule+22>: retq crash> bt -f 7373 PID: 7373 TASK: 1020e063110 CPU: 0 COMMAND: "pvmove" #0 [101e05d5c48] schedule at ffffffff8035fb2c 101e05d5c50: 0000000000000006 00000101e05d5d28 101e05d5c60: 0000007300000046 000001020e063110 101e05d5c70: 00000000000bc4cd 000001022e9d3110 101e05d5c80: 000001020e0633f8 000001022e2c7200 101e05d5c90: ffffffffa00dfa35 00000101e05d5cd4 101e05d5ca0: 000001023c09cc00 0000000000000000 101e05d5cb0: 00000000ffffffea 0000000000000000 101e05d5cc0: ffffffffa00e1040 00000101e05d5cd8 101e05d5cd0: ffffffff80360fcb #1 [101e05d5cd0] io_schedule at ffffffff80360fcb 101e05d5cd8: 000001023f213e00 ffffffffa00de43d ... crash> px *(struct mapped_device *)0x000001023f213e00 $7 = { io_lock = { activity = 0x0, wait_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 0x12e }, wait_list = { next = 0x1023f213e38, prev = 0x1023f213e38 } }, suspend_lock = { count = { counter = 0x0 }, sleepers = 0x0, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213e80, prev = 0x1023f213e80 } } }, pushback_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0x0 }, map_lock = {<No data fields>}, holders = { counter = 0x3 }, open_count = { counter = 0x1 }, flags = 0x1, queue = 0x1023e379420, disk = 0x100bfc97e00, interface_ptr = 0x1023ee29200, pending = { counter = 0x1 }, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 0x6c }, task_list = { next = 0x101e05d5d40, prev = 0x101e05d5d40 } }, deferred = { head = 0x101e2a69980, tail = 0x1020289a380 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023c09cc00, io_pool = 0x1023d338b80, tio_pool = 0x1023d338c80, bs = 0x1023ee29cc0, event_nr = { counter = 0x0 }, eventq = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213fa8, prev = 0x1023f213fa8 } }, frozen_sb = 0x0, suspended_bdev = 0x1023c89a300 } And here we have md->pending == 1 so it's still waiting for a device mapped I/O to complete and call dec_pending()->end_io_acct() which will wake up the pvmove process. So we need to work out what is holding up that last I/O. Some analysis from Bryn: PV table: pv0 8 16 /dev/sdb KccCqL-ngfW-l0s9-EJZ0-SsGa-rL0Q-VZbUKe pv1 8 48 /dev/sdd jIXNDc-yBd3-dRKg-8m20-qWRV-jyhr-d5DTu4 pv2 8 96 /dev/sdg iqCYlL-Hl0j-SyCN-XwKF-dxgE-UwXx-VfSU22 pv3 8 128 /dev/sdi RGoqRm-BZ3H-b1sP-8JJ9-ijgi-gJW2-bV0a0c According to the dm data in the lvmdump we have two suspended devices: $ grep '\ L.s.\ ' lvmdump-hklxdv55-2010051831514/dmsetup_info vgtest-Ftest 253 13 L-sw 3 2 0 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG606m6TeYUkWcl1U3kfqjIh5zJEhorK53Emr vgtest-pvmove0 253 15 L-sw 3 3 1 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG6069vX6BhJ0Z7etvouBH8VZM82I3GnhRXu5 This is consistent with the state recorded in pvmove_2: root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi Wiping cache of LVM-capable devices Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 17). Creating logical volume pvmove0 [...] Suspending vgtest-Ftest (253:13) without device flush Suspending vgtest-pvmove0 (253:15) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:14) without device flush We can see that the temporary vgtest-pvmove0 LV has been created: vgtest-pvmove0: 0 1048576 mirror core 1 1024 2 8:48 2097536 8:128 384 vgtest-pvmove0: 1048576 2097152 linear 8:48 384 vgtest-pvmove0: 3145728 131072 linear 8:48 3146112 Looking at the device stack at this point: vgtest-Ftest: 0 2097152 linear 8:96 384 vgtest-Ftest: 2097152 1048576 linear 253:15 0 vgtest-Ftestdb: 0 2097152 linear 253:15 1048576 vgtest-Ftestdb: 2097152 1966080 linear 8:96 2097536 vgtest-Ftestdb: 4063232 131072 linear 253:15 3145728 vgtest-Ftest is a two-segment volume consisting of one region on /dev/sdg and one region on the temporary mirror: +--------------+ | vgtest-Ftest | [S] +--------------+ / / \ \ +-----+ +----------------+ | sdg | | vgtest-pvmove0 | [S] +-----+ +----------------+ vgtest-Ftestdb is a three-segment volume consisting of one region on /dev/sdg and two regions on the temporary mirror: +----------------+ | vgtest-Ftestdb | [*] +----------------+ / / | | \ \ +----+ +-----+ +----------------+ |pvm0| | sdg | | pvm0 | [S] +----+ +-----+ +----------------+ The pvmove command running on the system is blocked attempting to suspend vgtest-Ftestdb: kernel: pvmove S 00004aae9fdf7c94 0 25787 24814 (NOTLB) kernel: 0000010061f4dcc8 0000000000000006 0000010061f4dd28 0000007300000046 kernel: 00000101ad8b9110 00000000000c4342 0000010230b88230 00000101ad8b93f8 kernel: 0000010068967700 ffffffffa00dfa35 kernel: Call Trace:<ffffffffa00dfa35>{:dm_mod:dm_table_unplug_all+49} kernel: <ffffffffa00e1040>{:dm_mod:dev_suspend+0} <ffffffff80360fcb>{io_schedule+15} kernel: <ffffffffa00de43d>{:dm_mod:dm_suspend+494} <ffffffff80183b4b>{map_vm_area+447} kernel: <ffffffff801355dd>{default_wake_function+0} <ffffffff801355dd>{default_wake_function+0} kernel: <ffffffffa00e11a1>{:dm_mod:dev_suspend+353} <ffffffffa00e2383>{:dm_mod:ctl_ioctl+602} kernel: <ffffffff801a668a>{sys_ioctl+1006} <ffffffff80110a92>{system_call+126} The pvmove has made no progress at all between the three sets of call traces (register values are unchanged between the 3 sets). This data also indicates that the first pvmove did indeed complete correctly as reported (all segments have been transferred correctly to /dev/sdg and the updated metadata reflects this). Tried to reproduce this issue in a VM using four physical volumes (each backed by a 2GB file), combined them into one volume group, created logical volumes over the first two disks and formatted them with ext3. I ran a load on the filesystems (dd and fsstress) while pvmove'ing sda to sdc and sdb to sdd and vice versa. I could not cause pvmove to get stuck but it did run very slowly. --- Volume group --- VG Name vgtest System ID Format lvm2 Metadata Areas 4 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 4 Act PV 4 VG Size 7.75 GB PE Size 64.00 MB Total PE 124 Alloc PE / Size 62 / 3.88 GB Free PE / Size 62 / 3.88 GB VG UUID wWsTLq-NprO-E0Wn-OdYa-77Eq-MOtH-MrxxVf --- Logical volume --- LV Name /dev/vgtest/lv0 VG Name vgtest LV UUID 2SIki7-aIKj-yJMd-4zej-YoRQ-SyBF-3R3wKr LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 31 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Name /dev/vgtest/lv1 VG Name vgtest LV UUID iFfX3c-naiq-oPfo-bOxT-G3f8-hj0X-SLgvOG LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 31 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 --- Physical volumes --- PV Name /dev/sda PV UUID f4GWeC-f2zr-1wTm-dzlL-Q0ue-l4OY-SC2CC5 PV Status allocatable Total PE / Free PE 31 / 31 PV Name /dev/sdb PV UUID aPwCGC-X4fl-sMvF-fkiW-PXOS-zGGr-eDyc3o PV Status allocatable Total PE / Free PE 31 / 31 PV Name /dev/sdc PV UUID F1e9Yb-IcHS-Ae3u-OmLB-lR14-ytqg-fEnGV2 PV Status allocatable Total PE / Free PE 31 / 0 PV Name /dev/sdd PV UUID MhmBhu-Vk7q-g75v-mX8G-3lMP-b1RI-YthlbL PV Status allocatable Total PE / Free PE 31 / 0 Continuing the analysis: crash> px *(struct mapped_device *)0x000001023f213e00 $32 = { io_lock = { activity = 0x0, wait_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 0x12e }, wait_list = { next = 0x1023f213e38, prev = 0x1023f213e38 } }, suspend_lock = { count = { counter = 0x0 }, sleepers = 0x0, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213e80, prev = 0x1023f213e80 } } }, pushback_lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0x5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0x0 }, map_lock = {<No data fields>}, holders = { counter = 0x3 }, open_count = { counter = 0x1 }, flags = 0x1, queue = 0x1023e379420, disk = 0x100bfc97e00, interface_ptr = 0x1023ee29200, pending = { counter = 0x1 }, wait = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 0x6c }, task_list = { next = 0x101e05d5d40, prev = 0x101e05d5d40 } }, deferred = { head = 0x101e2a69980, tail = 0x1020289a380 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023c09cc00, io_pool = 0x1023d338b80, tio_pool = 0x1023d338c80, bs = 0x1023ee29cc0, event_nr = { counter = 0x0 }, eventq = { lock = { magic = 0x1d244b3c, lock = 0x0, babble = 0xa, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0x0 }, task_list = { next = 0x1023f213fa8, prev = 0x1023f213fa8 } }, frozen_sb = 0x0, suspended_bdev = 0x1023c89a300 } pending is 1 flags is DMF_BLOCK_IO (stalling I/Os while trying to suspend) what does the disk queue say? crash> gendisk 0x100bfc97e00 struct gendisk { major = 253, first_minor = 14, minors = 1, disk_name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", part = 0x0, fops = 0xffffffffa00eef00, queue = 0x1023e379420, private_data = 0x1023f213e00, capacity = 4194304, flags = 16, devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", number = 0, driverfs_dev = 0x0, kobj = { k_name = 0x100bfc97eb0 "dm-14", name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000", kref = { refcount = { counter = 3 } }, entry = { next = 0x1023c09ecc8, prev = 0x1023f213cc8 }, parent = 0xffffffff804646e0, kset = 0xffffffff804646c0, ktype = 0x0, dentry = 0x1023e05da20 }, random = 0x1023de20f00, policy = 0, sync_io = { counter = 0 }, stamp = 4296507683, stamp_idle = 4296507683, in_flight = 1, dkstats = { read_sectors = 435498, write_sectors = 5866608, reads = 12972, writes = 733326, read_merges = 0, write_merges = 0, read_ticks = 55488, write_ticks = 1125481, io_ticks = 1096527, time_in_queue = 1943931 } } in_flight is 1 so that agrees. Device is 253:14 (dm-14) Let's look a the dm map and find out what's below this device. crash> px *((struct mapped_device *)0x000001023f213e00)->map $34 = { holders = { counter = 0x2 }, depth = 0x1, counts = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, index = {0xffffff000003c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, num_targets = 0x3, num_allocated = 0x10, highs = 0xffffff000003c000, targets = 0xffffff000003c080, mode = 0x3, devices = { next = 0x101f9374ac0, prev = 0x10235bb3e80 }, limits = { max_sectors = 0xffff, max_phys_segments = 0x80, max_hw_segments = 0xff, hardsect_size = 0x200, max_segment_size = 0x10000, seg_boundary_mask = 0xffffffff }, event_fn = 0xffffffffa00dd404 <event_callback>, event_context = 0x1023f213e00, md = 0x1023f213e00 } crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->type->name $38 = 0xffffffffa00e53c2 "linear" crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->private $39 = (void *) 0x101f93edfa0 crash> px *((struct linear_c *)0x101f93edfa0)->dev $41 = { list = { next = 0x1023c09cce8, prev = 0x101f9374ac0 }, count = { counter = 0x2 }, mode = 0x3, bdev = 0x101dfa6ca00, name = "253:15\000\000\000\000\000\000\000\000\000" } crash> block_device 0x101dfa6ca00 struct block_device { bd_dev = 265289743, bd_inode = 0x101dfa6cb10, bd_openers = 3, bd_sem = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffff803758bd "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x101dfa6ca50, prev = 0x101dfa6ca50 } } }, bd_mount_sem = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffff803758bd "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x101dfa6ca98, prev = 0x101dfa6ca98 } } }, bd_inodes = { next = 0x1022bb68970, prev = 0x1022bb68970 }, bd_holder = 0xffffffffa00e524e, bd_holders = 4, bd_contains = 0x101dfa6ca00, bd_block_size = 4096, bd_part = 0x0, bd_part_count = 0, bd_invalidated = 0, bd_disk = 0x1023c09ec00, bd_list = { next = 0x10006be7270, prev = 0x1023c1d5570 }, bd_inode_backing_dev_info = 0x0, bd_private = 0 } crash> gendisk 0x1023c09ec00 struct gendisk { major = 253, first_minor = 15, minors = 1, disk_name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", part = 0x0, fops = 0xffffffffa00eef00, queue = 0x1023e379810, private_data = 0x1023c09ce00, capacity = 3276800, flags = 16, devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", number = 0, driverfs_dev = 0x0, kobj = { k_name = 0x1023c09ecb0 "dm-15", name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000", kref = { refcount = { counter = 3 } }, entry = { next = 0xffffffff804646d0, prev = 0x100bfc97ec8 }, parent = 0xffffffff804646e0, kset = 0xffffffff804646c0, ktype = 0x0, dentry = 0x1020d87a908 }, random = 0x101f93edf80, policy = 0, sync_io = { counter = 0 }, stamp = 4296507683, stamp_idle = 4296507683, in_flight = 0, dkstats = { read_sectors = 40, write_sectors = 504856, reads = 5, writes = 63107, read_merges = 0, write_merges = 0, read_ticks = 79, write_ticks = 55902, io_ticks = 28089, time_in_queue = 55982 } } in_flight is 0. Device is 253:15 (dm-15) so that's another dm device. this dm's mapped_device is in private_data crash> mapped_device 0x1023c09ce00 struct mapped_device { io_lock = { activity = 0, wait_lock = { magic = 488917820, lock = 0, babble = 5, module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", oline = 302 }, wait_list = { next = 0x1023c09ce38, prev = 0x1023c09ce38 } }, suspend_lock = { count = { counter = 1 }, sleepers = 0, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0x0, oline = 0 }, task_list = { next = 0x1023c09ce80, prev = 0x1023c09ce80 } } }, pushback_lock = { magic = 488917820, lock = 0, babble = 5, module = 0xffffffffa00e4f96 "drivers/md/dm.c", owner = 0x0, oline = 0 }, map_lock = {<No data fields>}, holders = { counter = 4 }, open_count = { counter = 3 }, flags = 3, queue = 0x1023e379810, disk = 0x1023c09ec00, interface_ptr = 0x1022e2c7f40, pending = { counter = 0 }, wait = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037aba9 "kernel/fork.c", oline = 131 }, task_list = { next = 0x1023c09cf20, prev = 0x1023c09cf20 } }, deferred = { head = 0x1023f264980, tail = 0x1020aa44800 }, pushback = { head = 0x0, tail = 0x0 }, map = 0x1023d05d000, io_pool = 0x1022eff23c0, tio_pool = 0x1022eff22c0, bs = 0x101f36e4300, event_nr = { counter = 1 }, eventq = { lock = { magic = 488917820, lock = 0, babble = 10, module = 0xffffffffa00e5161 "include/linux/wait.h", owner = 0xffffffff8037a9ff "kernel/sched.c", oline = 2894 }, task_list = { next = 0x1023c09cfa8, prev = 0x1023c09cfa8 } }, frozen_sb = 0x0, suspended_bdev = 0x101dfa6ca00 } pending is 0. flags is 3 so that's DMF_BLOCK_IO and DMF_SUSPENDED. Aha! The underlying device is suspended. The pvmove is stuck trying to suspend dm-14 because an I/O is blocked by the underlying device dm-15 because it is suspended - dm-15 will need to resumed before dm-14 can continue. I've managed to reproduce this problem but I had to mimic the customer's lvm config exactly to get pvmove to hang. Here's what I did: - Installed a VM with RHEL4.6 - Created 4 x 2GB files on the host and attached them to the VM as SCSI disks $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd Must use a 64MB extent size: $ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second. The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again. The only way I know to create that is with this: $ lvcreate -n Ftest vgtest -l 16 /dev/sda $ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb $ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda $ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb $ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb This is what it all looks like after that. vgtest-Ftest: 0 2097152 linear 8:0 384 vgtest-Ftest: 2097152 1048576 linear 8:16 384 vgtest-Ftestdb: 0 2097152 linear 8:16 1048960 vgtest-Ftestdb: 2097152 1966080 linear 8:0 2097536 vgtest-Ftestdb: 4063232 131072 linear 8:16 3146112 --- Logical volume --- LV Name /dev/vgtest/Ftest VG Name vgtest LV UUID XaxY0e-TNVq-gwzS-2scb-yA6L-lZ8u-atAnHm LV Write Access read/write LV Status available # open 0 LV Size 1.50 GB Current LE 24 Segments 2 Allocation inherit Read ahead sectors 0 Block device 253:3 --- Logical volume --- LV Name /dev/vgtest/Ftestdb VG Name vgtest LV UUID 2a5AaI-enRN-DdAk-eqCK-JMJ5-Jm4p-0lAzdg LV Write Access read/write LV Status available # open 0 LV Size 2.00 GB Current LE 32 Segments 3 Allocation inherit Read ahead sectors 0 Block device 253:4 --- Volume group --- VG Name vgtest System ID Format lvm2 Metadata Areas 4 Metadata Sequence No 72 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 0 Max PV 0 Cur PV 4 Act PV 4 VG Size 7.75 GB PE Size 64.00 MB Total PE 124 Alloc PE / Size 56 / 3.50 GB Free PE / Size 68 / 4.25 GB VG UUID 42GW3U-1l0X-Sb5I-jD5b-S3Gy-w8gc-LtAlRB --- Physical volume --- PV Name /dev/sda VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes (but full) PE Size (KByte) 65536 Total PE 31 Free PE 0 Allocated PE 31 PV UUID gCJW0L-Lqrd-snDX-aswE-rLJH-wvlc-I73604 --- Physical volume --- PV Name /dev/sdb VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 6 Allocated PE 25 PV UUID cvzwHb-vYqS-uTz0-cclI-P1Nt-N35H-R11i8x --- Physical volume --- PV Name /dev/sdc VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 31 Allocated PE 0 PV UUID yDuC5g-Pfla-TUvF-kGk3-zPpg-f4o3-AEFc3p --- Physical volume --- PV Name /dev/sdd VG Name vgtest PV Size 1.95 GB / not usable 16.00 MB Allocatable yes PE Size (KByte) 65536 Total PE 31 Free PE 31 Allocated PE 0 PV UUID DwVjpk-XvvM-Q0q3-0VPJ-lOGu-Ycze-CcDVAd Make filesystems and mount both volumes: $ mkfs.ext3 /dev/vgtest/Ftest $ mkfs.ext3 /dev/vgtest/Ftestdb $ mount /dev/vgtest/Ftest /mnt/Ftest $ mount /dev/vgtest/Ftestdb /mnt/Ftestdb Add an I/O load to the Ftestdb volume: $ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 And fire off pvmove, this one should succeed: $ pvmove -v /dev/sda /dev/sdc This one should hang (if it doesn't hang keep following the steps): $ pvmove -v /dev/sdb /dev/sdd This will succeed: $ pvmove -v /dev/sdc /dev/sda And this one should hang: $ pvmove -v /dev/sdd /dev/sdb The output from pvmove before it hangs shows that it is trying to suspend a device (vgtest-Ftestdb) that sits on top on another already suspended device (vgtest-pvmove0): Suspending vgtest-Ftest (253:2) without device flush Suspending vgtest-pvmove0 (253:4) without device flush Found volume group "vgtest" Found volume group "vgtest" Suspending vgtest-Ftestdb (253:3) without device flush Can you please also attach "lvmdump -m" from the system where you are able to reproduce it? Created attachment 441128 [details]
lvmdump of reproducer system
Created attachment 479206 [details]
lvmdump of new customer's system experiencing problem
Information that might prove helpful that one customer has noticed. Scenario. New bigger volume is added and data moved on it via pvmove. pvmove at some time stops in D state. dmsetup info will show suspended devices dmsetup resume <map> - on the suspended devices unblocks pvmove and it continues. This only happens with pvmove. Manually resuming the devices will clear the problem but it may also undermine the integrity of pvmove since the devices are being resumed before it expects them to be. This problem is due to pvmove suspending a device (A) that sits on top of another device (B) that is already suspended. Any I/O sent to device B while it is suspended gets queued and wont complete until device B is resumed. For device A to be suspended it must wait for all I/O it has already issued to complete but device A is waiting for an I/O that's in device B's queue. The fix for this problem will need to be done in pvmove. It will need to avoid the sequence "suspend B, ... suspend A" by either always suspending A first whenever suspending B or by converting it to "suspend B, ... resume B, suspend A, suspend B". OK - I may have found a simple way to reproduce this now, and a logic error in the code. Summary: If you are using pvmove to move more than one LV sharing the same underlying PV at once, you are vulnerable to this problem. Simple case, PV1 is used by LV1 and LV2. You move PV1 to PV2. At the end of the moving the first segment, the code suspends the devices in an incorrect order which allows I/O to become trapped between them. I was just facing this very same problem. It was infact the LV used for moving (pvmove) which caused the system to hang. I manually resumed pvmove0 twice, doing so I was able to complete the operation. This was on a 2.6.32 kernel (x86_64) with LVM2 tools 2.02.39. The same bug is reported in rhel5: https://bugzilla.redhat.com/show_bug.cgi?id=706036 And someone is working on fix it. On comment 13, Alasdair provide a method to work around: In the meantime, as a workaround, try using the -n option of pvmove to move only one LV at once. List of LVS in VG: lvs --noheadings -o name $vg Move one LV: pvmove -i0 -n $lvname So, the patched upstream tools passed the upstream test suite for the first time last night. However, a backport to RHEL4 is looking unlikely at this stage: RHEL5 and RHEL6 are the priority, and there is an adequate workaround on RHEL4. Upstream release 2.02.86 include in Fedora rawhide. Please test. RHEL4 is no longer actively developed, let's track this bug for more recent releases. AFAIK the bug is already fixed there. |