Bug 684083

Summary: pvmove stuck waiting for I/O to complete
Product: Red Hat Enterprise Linux 6 Reporter: Lachlan McIlroy <lmcilroy>
Component: lvm2Assignee: Alasdair Kergon <agk>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: medium Docs Contact:
Priority: high    
Version: 6.0CC: agk, dejohnso, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, pyu, thornber, vgaikwad, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.86-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 602516 Environment:
Last Closed: 2011-12-06 16:54:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 602516, 706036    

Comment 1 Lachlan McIlroy 2011-03-11 04:19:28 UTC
Customer has reproduced this bug on RHEL6.

"Were are trying to move the content of a physical disk in a volume group with 20 open logical volumes via the command "pvmove" to another physical disk freshly added to this volume group. To simulate database I/O we started two parallel iozone programs on two different logical volumes of the mentioned 20 logical volumes which are all mounted. We can reproduce that the pvmove command hangs after some time and the two iozone processes are stalled, too.

We would expect that the pvmove command moves the physical volume even when there is some load on the volumes as we often have to move disks when there is a database accessing this volume."

$ grep -E 'Suspend|Resume' pvmove_verbose_1.txt 
#libdm-deptree.c:1077     Suspending TEST1-test.1 (253:22) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.2 (253:23) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.3 (253:24) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.4 (253:25) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.5 (253:26) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.6 (253:27) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.7 (253:28) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.8 (253:29) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.9 (253:30) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.10 (253:31) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.11 (253:32) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.12 (253:33) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.13 (253:34) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.14 (253:35) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.15 (253:36) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.16 (253:37) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.17 (253:38) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.18 (253:39) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.19 (253:40) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.20 (253:41) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.1 (253:22) with device flush
#libdm-deptree.c:1077     Suspending TEST1-pvmove0 (253:42) with device flush    <---- pvmove0 suspended
#libdm-deptree.c:1077     Suspending TEST1-test.2 (253:23) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.3 (253:24) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.4 (253:25) with device flush
#libdm-deptree.c:1077     Suspending TEST1-test.5 (253:26) with device flush     <---- TEST1-test.5 is waiting for I/O to complete that's stuck in pvmove0



Mar  8 10:20:57 degtlun1843 kernel: pvmove        S ffff8801a7828800     0 17762  17761 0x00000001
Mar  8 10:20:57 degtlun1843 kernel: ffff88018011dcb8 0000000000000082 0000000000000000 ffff88019258ec00
Mar  8 10:20:57 degtlun1843 kernel: ffff88018011dc38 ffffffff8123b274 ffff88019bc58ec0 0000000103f42701
Mar  8 10:20:57 degtlun1843 kernel: ffff88019d116678 ffff88018011dfd8 0000000000010518 ffff88019d116678
Mar  8 10:20:57 degtlun1843 kernel: Call Trace:
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff8123b274>] ? blk_unplug+0x34/0x70
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff814c9533>] io_schedule+0x73/0xc0
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa000298b>] dm_wait_for_completion+0x9b/0x100 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff8105c530>] ? default_wake_function+0x0/0x20
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa0002af8>] dm_suspend+0x108/0x1f0 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa00085a6>] dev_suspend+0x76/0x240 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa0008530>] ? dev_suspend+0x0/0x240 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa0008fc3>] ctl_ioctl+0x1a3/0x240 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffffa0009073>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff8117fa12>] vfs_ioctl+0x22/0xa0
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff810c711c>] ? utrace_stop+0x12c/0x1e0
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff8117fbb4>] do_vfs_ioctl+0x84/0x580
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff810c865e>] ? utrace_report_syscall_entry+0x10e/0x160
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff81180131>] sys_ioctl+0x81/0xa0
Mar  8 10:20:57 degtlun1843 kernel: [<ffffffff81013387>] tracesys+0xd9/0xde

Comment 3 RHEL Program Management 2011-04-04 02:03:34 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Corey Marthaler 2011-06-03 18:04:24 UTC
Adding QA ack for 6.2.

Devel will need to provide unit testing results however before this bug can be
ultimately verified by QA.

Comment 6 hank 2011-07-01 09:33:19 UTC
The same bug is reported in rhel5:
https://bugzilla.redhat.com/show_bug.cgi?id=706036

And someone is working on fix it.

On comment 13, Alasdair provide a method to work around:

In the meantime, as a workaround, try using the -n option of pvmove to move
only one LV at once.

List of LVS in VG:  lvs --noheadings -o name $vg
Move one LV:  pvmove -i0 -n $lvname

Comment 7 Alasdair Kergon 2011-07-06 16:57:56 UTC
This passed the upstream test suite for the first time last night.  However, due to the complexity of the change and the amount of regression testing I believe it needs, I am not offering this as a Z-stream release, but only releasing it as part of the next scheduled update, viz. 6.2.  In the meantime, I'm afraid the above workaround is the best I can offer.

Comment 8 Alasdair Kergon 2011-07-08 21:36:10 UTC
Upstream release 2.02.86 include in Fedora rawhide.  Please test.

Comment 10 Corey Marthaler 2011-10-07 20:17:25 UTC
I added a basic pvmove during I/O regression test case. I didn't see any issues while running it on the latest rpms. Marking this verified (SanityOnly).

SCENARIO - [pvmove_during_io]
Pvmove a volume during active I/O
grant-01: lvcreate -n move_during_io -L 800M mirror_sanity
Starting io to linear to be pvmoved
Attempting pvmove of /dev/sdc6 on grant-01
Deactivating mirror move_during_io... and removing


2.6.32-203.el6.x86_64

lvm2-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
lvm2-libs-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
lvm2-cluster-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-libs-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-event-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-event-libs-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
cmirror-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011

Comment 11 errata-xmlrpc 2011-12-06 16:54:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1522.html