Bug 706036
Summary: | pvmove stuck waiting for I/O to complete | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | hank <pyu> | ||||
Component: | lvm2 | Assignee: | Alasdair Kergon <agk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.6 | CC: | agk, bmr, dejohnso, dwysocha, esammons, heinzm, jbrassow, lmcilroy, mbroz, nmurray, prajnoha, prockai, pyu, rdassen, thornber, walter, zkabelac | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | lvm2-2.02.88-1.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 602516 | ||||||
: | 1020385 (view as bug list) | Environment: | |||||
Last Closed: | 2012-02-21 06:04:22 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 684083 | ||||||
Bug Blocks: | 602516, 668957 | ||||||
Attachments: |
|
Description
hank
2011-05-19 09:13:57 UTC
Created attachment 500986 [details]
a patch for solve pvmove hung problem
Summary the problem:
when doing pvmove, the lv tree is as below:
+--------------+
| vgtest-Ftest |
+--------------+
/ / \ \
+-----+ +----------------+
| sdg | | pvm0 |
+-----+ +----------------+
+----------------+
| vgtest-Ftestdb |
+----------------+
/ / | | \ \
+----+ +-----+ +----------------+
|pvm0| | sdg | | pvm0 |
+----+ +-----+ +----------------+
pvmove will update metadata periodic.
When doing this, pvmove will suspend all the lv,
1, it suspend Ftest and its underlying lv: pvm0
2, suspend Ftestdb (pvm0 is also Ftestdb's underlying lv, but
if it is already been suspended, the program will skip it)
After step 1, an io is written to pvm0 through Ftestdb, and because
pvm0 is already been suspended, the io will not be implement. So there
is an in-flight io on Ftestdb, it will alwasy in-flight unless the pvm0
is resume.
Then do step 2, pvmove try to suspend Ftestdb. Before suspend it, kernel will
wait until all the in-flight io complete. Because pvm0 is already been suspended,
io on it will nevery been implement, so there is alwasy an io in-flight, and
the pvmove program is hung.
To solve this problem, I resume all the underlying lv before try to suspend a lv.
Is there potentially more needed to reproduce this? I've been unable to reproduce a hang while running fs I/O during multiple pvmoves. 2.6.18-261.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 (In reply to comment #3) > Is there potentially more needed to reproduce this? I've been unable to > reproduce a hang while running fs I/O during multiple pvmoves. Corey, do you have multiple logical volumes sharing the same physical disk? I think that's a requirement in order to reproduce the bug. (In reply to comment #3) > Is there potentially more needed to reproduce this? I've been unable to > reproduce a hang while running fs I/O during multiple pvmoves. > > 2.6.18-261.el5 > > lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 > lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 > device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 > device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 > cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 > kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 This bug is cloned from rhel4. In the original bug, Lachlan provide a method to reprudoce the bug: - Installed a VM with RHEL4.6 - Created 4 x 2GB files on the host and attached them to the VM as SCSI disks $ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd Must use a 64MB extent size: $ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second. The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again. The only way I know to create that is with this: $ lvcreate -n Ftest vgtest -l 16 /dev/sda $ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb $ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda $ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb $ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb ...... Make filesystems and mount both volumes: $ mkfs.ext3 /dev/vgtest/Ftest $ mkfs.ext3 /dev/vgtest/Ftestdb $ mount /dev/vgtest/Ftest /mnt/Ftest $ mount /dev/vgtest/Ftestdb /mnt/Ftestdb Add an I/O load to the Ftestdb volume: $ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 And fire off pvmove, this one should succeed: $ pvmove -v /dev/sda /dev/sdc This one should hang (if it doesn't hang keep following the steps): $ pvmove -v /dev/sdb /dev/sdd This will succeed: $ pvmove -v /dev/sdc /dev/sda And this one should hang: $ pvmove -v /dev/sdd /dev/sdb Lots of information to plough through here, but it's not obvious to me yet: are you attempting to run two pvmoves simultaneously? Or waiting for one to finish before starting the next? In other words, what is the *simplest* setup you have that recreates the problem? (And if they don't have to run simultaneously, is it necessary to run two pvmoves to see the problem or can it be seen in some other way by just running one pvmove? The write-up here looks rather confused.) Regarding the attached patch: If the devices are suspended, that's been done for a reason, and resuming them forcibly to release trapped i/o isn't a good idea. Better to make sure no i/o can become trapped in the first place. (Also always think how how it works in cluster - you cannot force resume just on local node. And this code is below cluster locking which singalise suspend/resume to other nodes, so it will cause inconsistent state on cluster.) Reply to Alasdair: To trigger this bug, need only one pvmove. Just follow Comment 5 can reproduce it. When do this step: pvmove -v /dev/sdb /dev/sdd And when pvmove update metadata, it will suspend all lvs on sdb/sdd, they are: Ftest and Ftestdb. When suspend Ftest, it will also suspend its underlying lv: pvmove0. pvmove0 is also an underlying device of Ftestdb. After Ftest and pvmove0 are suspended, and before Ftestdb is suspended, some io write through Ftestdb to pvmove0, these io will be pending on Ftestdb. Then suspend Ftestdb, it need wait until all Ftestdb's in-flight io complete. But because some io are write through Ftestdb to pvmove0, and pvmove0 is already been suspended, these io will not complete forever. Then pvmove hung. About resuming all the underlying lv forcibly, I agree it is not a good idea. Maybe we need add two flags. One for command context, indicate this suspend is called by pvmove for update metadata, only this kind of suspend trigger resume. Another flag for temp lv created by pvmove, when do resume, only this kind of lv will be resume. Reply to Milan: Thanks for your remind. I don't consider carefully for cluster condition. I need sometime to consider how to deal with it. I am working on a fix. In the meantime, as a workaround, try using the -n option of pvmove to move only one LV at once. List of LVS in VG: lvs --noheadings -o name $vg Move one LV: pvmove -i0 -n $lvname I found a series of problems with pvmove which I have fixed in the upstream repository. However, I would like to wait some time to see that no unintended side-effects show up before seeing the change back-ported to RHEL. An audit for similar bugs found a problem in lvremove, which I have also fixed. lvconvert still needs to be audited. The same problem is also on rhel4 and rhel6, so if the problem is fixed, please also back-port to rhel4 and rhel6. This passed the upstream test suite for the first time last night. However, due to the complexity of the change and the amount of regression testing I believe it needs, I am not offering this as a Z-stream release, but only releasing it as part of the next scheduled update, viz. 5.8. In the meantime, I'm afraid the workaround of running pvmove on one LV at a time is the best I can offer. That can be scripted, and will not have a significant impact on performance because the same amount of data is still being moved, just in a different order. Upstream release 2.02.86 include in Fedora rawhide. Please test. Fixed in lvm2-2.02.88-1.el5. Fix verified in the latest rpms. 2.6.18-274.el5 lvm2-2.02.88-4.el5 BUILT: Wed Nov 16 09:40:55 CST 2011 lvm2-cluster-2.02.88-4.el5 BUILT: Wed Nov 16 09:46:51 CST 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 [root@grant-01 tmp]# pvcreate /dev/sd[bc][12] Writing physical volume data to disk "/dev/sdb1" Physical volume "/dev/sdb1" successfully created Writing physical volume data to disk "/dev/sdb2" Physical volume "/dev/sdb2" successfully created Writing physical volume data to disk "/dev/sdc1" Physical volume "/dev/sdc1" successfully created Writing physical volume data to disk "/dev/sdc2" Physical volume "/dev/sdc2" successfully created [root@grant-01 tmp]# vgcreate -s 64M vgtest /dev/sdb1 /dev/sdb2 /dev/sdc1 /dev/sdc2 Volume group "vgtest" successfully created [root@grant-01 tmp]# lvcreate -n Ftest vgtest -l 16 /dev/sdb1 Logical volume "Ftest" created [root@grant-01 tmp]# lvcreate -n Ftestdb vgtest -l 16 /dev/sdb2 Logical volume "Ftestdb" created [root@grant-01 tmp]# lvextend -l 31 /dev/vgtest/Ftestdb /dev/sdb1 Extending logical volume Ftestdb to 1.94 GB Logical volume Ftestdb successfully resized [root@grant-01 tmp]# lvextend -l 24 /dev/vgtest/Ftest /dev/sdb2 Extending logical volume Ftest to 1.50 GB Logical volume Ftest successfully resized [root@grant-01 tmp]# lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb2 Extending logical volume Ftestdb to 2.00 GB Logical volume Ftestdb successfully resized [root@grant-01 tmp]# lvs -a -o +devices LV VG Attr LSize Devices Ftest vgtest -wi-a- 1.50G /dev/sdb1(0) Ftest vgtest -wi-a- 1.50G /dev/sdb2(16) Ftestdb vgtest -wi-a- 2.00G /dev/sdb2(0) Ftestdb vgtest -wi-a- 2.00G /dev/sdb1(16) Ftestdb vgtest -wi-a- 2.00G /dev/sdb2(24) [root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftest [root@grant-01 tmp]# mkfs.ext3 /dev/vgtest/Ftestdb [root@grant-01 tmp]# mkdir /mnt/Ftest [root@grant-01 tmp]# mkdir /mnt/Ftestdb [root@grant-01 tmp]# mount /dev/vgtest/Ftest /mnt/Ftest [root@grant-01 tmp]# mount /dev/vgtest/Ftestdb /mnt/Ftestdb [root@grant-01 tmp]# fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1 seed = 1322167840 [root@grant-01 ~]# pvmove -v /dev/sdb1 /dev/sdc1 Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 6). Creating logical volume pvmove0 Moving 16 extents of logical volume vgtest/Ftest Moving 15 extents of logical volume vgtest/Ftestdb Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftest Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/Ftestdb Updating volume group metadata Found volume group "vgtest" Found volume group "vgtest" Creating vgtest-pvmove0 Loading vgtest-pvmove0 table (253:4) Loading vgtest-Ftest table (253:2) Loading vgtest-Ftestdb table (253:3) Suspending vgtest-Ftest (253:2) with device flush Suspending vgtest-pvmove0 (253:4) with device flush Suspending vgtest-Ftestdb (253:3) with device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" activation/volume_list configuration setting not defined, checking only host tags for vgtest/pvmove0 Resuming vgtest-pvmove0 (253:4) Found volume group "vgtest" Loading vgtest-pvmove0 table (253:4) Suppressed vgtest-pvmove0 identical table reload. Resuming vgtest-Ftest (253:2) Resuming vgtest-Ftestdb (253:3) Found volume group "vgtest" Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 7). Checking progress before waiting every 15 seconds /dev/sdb1: Moved: 3.2% /dev/sdb1: Moved: 51.6% Updating volume group metadata Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-pvmove0 table (253:4) Suspending vgtest-pvmove0 (253:4) with device flush Found volume group "vgtest" Resuming vgtest-pvmove0 (253:4) Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 8). /dev/sdb1: Moved: 100.0% Found volume group "vgtest" Found volume group "vgtest" Loading vgtest-Ftest table (253:2) Loading vgtest-Ftestdb table (253:3) Loading vgtest-pvmove0 table (253:4) Suspending vgtest-Ftest (253:2) with device flush Suspending vgtest-Ftestdb (253:3) with device flush Suspending vgtest-pvmove0 (253:4) with device flush Found volume group "vgtest" Found volume group "vgtest" Found volume group "vgtest" Resuming vgtest-pvmove0 (253:4) Found volume group "vgtest" Resuming vgtest-Ftest (253:2) Found volume group "vgtest" Resuming vgtest-Ftestdb (253:3) Found volume group "vgtest" Removing vgtest-pvmove0 (253:4) Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/vgtest" (seqno 10). [root@grant-01 ~]# pvmove -v /dev/sdb2 /dev/sdc2 Finding volume group "vgtest" Archiving volume group "vgtest" metadata (seqno 7). Creating logical volume pvmove1 Skipping locked LV Ftest Skipping locked LV Ftestdb Skipping mirror LV pvmove0 All data on source PV skipped. It contains locked, hidden or non-top level LVs only. No data to move for vgtest Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0161.html |