Escalated to Bugzilla from IssueTracker
Event posted on 04-08-2010 07:37am EDT by edamato All Issues: Problem Description --------------------------------------------------- 1. Time and date of problem: reproducible any time. 2. System architecture(s): Any, RHEL5 3. Provide a clear and concise problem description as it is understood at the time of escalation. Please be as specific as possible in your description. It is possible to do a 'lvconvert -m0' (and succeed immediately) on a volume that is being converted to a mirror with an ongoing 'lvconvert -m1'. This operation causes the sync of the mirror to be halted and return success immediately. In addition to this the LV becomes linear and the sync is stopped. This can be problematic if the downconvert excludes the original PV of the mirror which contains the full data. For instance lv1 is linear on mpath1, one converts it up to be mpath1(image0), mpath2(image1), mpath3(log), shortly after the sync started, one can do a downcovert (before sync completes) excluding mpath1 from the mirror, which yields the LV to be linear containing mpath2 only. Mpath2 does not contain the full data from the original PV. If there is a filesystem mounted on top of this LV it will encounter problems. Observed behavior: It is possible to do a downconvert removing the source PV on a mirror before sync is finished. Desired behavior: LVM not allow the original PV (or PE ranges) to be removed from the LV before sync is finished. 4. Specific action requested of SEG: Fix the bug. 5. Is a defect (bug) in the product suspected? yes/no None found. 6. Does a proposed patch exist? yes/no No. 7. What is the impact to the customer when they experience this problem? High. Has caused data loss. It is arguable that one should not do a downconvert before LVs are fully in sync, but also LVM should not allow the original PV to be removed before sync is finished. This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 737643
Event posted on 04-08-2010 07:38am EDT by edamato Example of reproducer: shell1: lvm> lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 132.69G LogVol01 VolGroup00 -wi-ao 3.91G rootvol rootvg -wi-a- 2.91G varvol rootvg -wi-a- 992.00M lvol0 tst -wi-a- 5.00G lvm> lvconvert -m1 tst/lvol0 tst/lvol0: Converted: 7.0% tst/lvol0: Converted: 13.2% tst/lvol0: Converted: 20.0% (during this, run the command on shell2) tst/lvol0: Converted: 100.0% Logical volume lvol0 converted. lvm> lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 132.69G LogVol01 VolGroup00 -wi-ao 3.91G rootvol rootvg -wi-a- 2.91G varvol rootvg -wi-a- 992.00M lvol0 tst -wi-a- 5.00G shell2: lvm> lvconvert -m0 /dev/dm-3 "/dev/dm-3": Invalid path for Logical Volume Please provide a valid volume group name lvm> lvconvert -m0 tst/lvol0 /dev/dm-3 Logical volume lvol0 converted. lvm> edamato assigned to issue for EMEA Production Escalation. Status set to: Waiting on Tech This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 737643
Event posted on 04-08-2010 07:39am EDT by edamato More detailed reproducer: 1- Create the LV as linear: # lvcreate -L 5G tst Logical volume "lvol0" created # lvs -a --segments -o +seg_pe_ranges LV VG Attr #Str Type SSize PE Ranges lvol0 tst -wi-a- 1 linear 5.00G /dev/dm-2:0-1279 2- Zero IO on the PVs for this VG (sampled every 2 seconds with 'iostat -x 2 dm-2 dm-3 dm-4'): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.75 0.00 99.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3- Start the convert, sleep, show progress, and then suddenly convert it back to linear while the first convert is still running and removing the source data PV from the mirror (dm-2): # lvconvert -m1 tst/lvol0 ; sleep 10; lvs; sleep 10; lvs; lvconvert -m0 tst/lvol0 /dev/dm-2 3.1- ^C to background the convert (this can also be achieved by doing the commands in separate shells. I wanted to keep them all together so that it became clear). 3.2- copy is running: avg-cpu: %user %nice %system %iowait %steal %idle 2.26 0.00 1.25 6.27 0.00 90.23 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 9.50 1.50 76.00 12.00 8.00 0.02 1.41 1.41 1.55 dm-3 0.00 0.00 9.50 1.50 76.00 12.00 8.00 0.01 0.95 0.95 1.05 dm-4 0.00 0.00 26.00 17.50 208.00 140.00 8.00 0.07 1.71 1.71 7.45 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 4.98 24.38 0.00 69.65 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 338.00 1.50 41584.00 12.00 122.52 0.95 2.80 0.95 32.15 dm-3 0.00 0.00 14.00 325.50 112.00 41484.00 122.52 1.16 3.40 1.13 38.40 dm-4 0.00 0.00 510.50 538.50 4082.50 4186.50 7.88 0.42 0.40 0.40 42.40 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.75 5.01 0.00 94.24 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 432.00 0.00 55296.00 0.00 128.00 2.00 4.60 1.40 60.30 dm-3 0.00 0.00 0.00 430.00 0.00 55040.00 128.00 1.54 3.58 1.17 50.25 dm-4 0.00 0.00 0.00 54.00 0.00 270.00 5.00 0.10 1.77 1.77 9.55 3.3- we can see progress: tst/lvol0: Converted: 6.9% 3.4- copy still running and progress: avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 1.00 2.00 0.00 97.01 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 360.00 0.00 46080.00 0.00 128.00 2.16 5.82 1.95 70.35 dm-3 0.00 0.00 0.00 360.50 0.00 46144.00 128.00 1.37 3.80 1.27 45.95 dm-4 0.00 0.00 0.00 45.00 0.00 225.00 5.00 0.03 0.57 0.57 2.55 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.50 1.50 0.00 97.99 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 401.49 0.00 51391.04 0.00 128.00 1.94 5.03 1.72 69.10 dm-3 0.00 0.00 0.00 402.49 0.00 51518.41 128.00 1.46 3.62 1.22 49.30 dm-4 0.00 0.00 0.00 50.25 0.00 251.24 5.00 0.03 0.54 0.54 2.74 tst/lvol0: Converted: 13.1% 3.5- copy running and status of LV: avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.75 1.00 0.00 98.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 288.00 0.00 36864.00 0.00 128.00 2.18 7.56 2.44 70.25 dm-3 0.00 0.00 0.00 288.00 0.00 36864.00 128.00 1.34 4.64 1.48 42.60 dm-4 0.00 0.00 0.00 35.50 0.00 177.50 5.00 0.02 0.59 0.59 2.10 LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 132.69G LogVol01 VolGroup00 -wi-ao 3.91G rootvol rootvg -wi-a- 2.91G varvol rootvg -wi-a- 992.00M lvol0 tst mwi-a- 5.00G lvol0_mlog 17.89 3.6- Copy running, status of LV, downconvert, and IO becomes ZERO immediately: (copy still running) avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.75 1.50 0.00 97.75 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 356.00 0.00 45568.00 0.00 128.00 2.19 5.70 2.01 71.55 dm-3 0.00 0.00 0.00 356.00 0.00 45568.00 128.00 1.32 3.70 1.25 44.60 dm-4 0.00 0.00 0.00 44.50 0.00 222.50 5.00 0.03 0.62 0.62 2.75 (LV 22.27% in sync) LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 132.69G LogVol01 VolGroup00 -wi-ao 3.91G rootvol rootvg -wi-a- 2.91G varvol rootvg -wi-a- 992.00M lvol0 tst mwi-a- 5.00G lvol0_mlog 22.27 (downconvert takes place) Logical volume lvol0 converted. (IO is flushed and becomes ZERO, showing that sync has not continued before LV became linear) avg-cpu: %user %nice %system %iowait %steal %idle 3.24 0.00 4.24 22.94 0.00 69.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 333.50 7.50 35340.00 68.00 103.84 1.72 5.57 1.82 62.10 dm-3 0.00 0.00 61.50 281.50 524.00 35140.00 103.98 1.06 3.09 1.12 38.35 dm-4 0.00 0.00 61.50 42.00 524.00 240.50 7.39 0.05 0.50 0.45 4.70 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.25 0.00 0.00 99.75 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.25 0.00 0.25 0.00 0.00 99.50 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.75 0.00 99.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 4- LV becomes linear: # lvs -a --segments -o +seg_pe_ranges LV VG Attr #Str Type SSize PE Ranges lvol0 tst -wi-a- 1 linear 5.00G /dev/dm-3:0-1279 5- The initial PV on the LV was dm-2, the sync did not finishe, but it's now linear and contains only dm-3, being potentially incomplete and able to lead to kernel panics if there is an FS mounted on top. This event sent from IssueTracker by dejohnso [Support Engineering Group] issue 737643
I'm able to reproduce this issue, and will add a test case for this in the future. shell1: [root@hayes-01 ~]# lvconvert -m 1 hayes/lv hayes/lv: Converted: 5.8% [ down convert on shell2 ] hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% hayes/lv: Converted: 100.0% ( STUCK ATTEMPTING TO SYNC ) shell2: [root@hayes-01 ~]# lvconvert -m 0 hayes/lv Logical volume lv converted. [root@hayes-01 ~]# lvs -a -o +devices LV VG Attr LSize Move Log Copy% Convert Devices lv hayes -wi-a- 8.00G /dev/etherd/e1.1p1(0)
There are unfortunately several other related problems in this code which I discovered last week - all must be fixed together. bug 582251
Created attachment 407207 [details] preliminary patch This patch fixes the problem, but I'd like to test a bit yet and see what the other issues are in 582251 before checking in.
Created attachment 407679 [details] RHEL5 version of fix Found a couple bugs with the last patch, fixed them and posted it to lvm-devel. This attachment contains the RHEL5 version of that patch.
Upstream patch has been checked in.
Fix in lvm2-2.02.56-10.el5.
This bug doesn't appear to be fixed yet. SCENARIO - [down_convert_to_linear_during_up_convert_to_mirror] Create a linear and attempt to upconvert it to a mirror, then attempt to down convert it back with the up convert still running taft-01: lvcreate -n multi_convert -L 4G mirror_sanity taft-01: lvconvert -m 1 mirror_sanity/multi_convert & sleeping a bit... current copy percent: 8.09 attempting down convert with up convert in progress should not have been able to downconvert with an upconvert in progress Now the original lvconvert is stuck like in comment #4. [root@taft-01 ~]# ps -ef | grep lvconvert root 7266 7265 0 16:08 ? 00:00:00 lvconvert -m 1 mirror_sanity/multi_convert root 7415 6867 0 16:09 pts/0 00:00:00 grep lvconvert [root@taft-01 ~]# strace -p 7266 Process 7266 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> <unfinished ...> Process 7266 detached [root@taft-01 ~]# rpm -q lvm2 lvm2-2.02.56-10.el5 [root@taft-01 ~]# rpm -q device-mapper device-mapper-1.02.39-2.el5 [root@taft-01 ~]#
The stuck in comment#4 - yes, it is another problem. Probably need new bug... But I think you are testing something different - you can downconvert during upconvert IF remaining image is fully synced. You cannot downconvert if remaining image is not yet synced. I see this (with lvm2-2.02.56-8.el5_5.2): # lvs -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices lv1 vg_test -wi-a- 80.00G /dev/sdc(0) # lvconvert -b -m1 vg_test/lv1 Logical volume lv1 converted. # lvs -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert lv1 vg_test mwi-a- 80.00G lv1_mlog 0.07 lv1_mimage_0(0),lv1_mimage_1(0) # lvconvert -m0 vg_test/lv1 /dev/sdc Unable to remove primary mirror image while mirror is not in-sync
I cannot reproduce the polling (lvconvert hangs) issue. Corey, would you be willing to create a new bug on this issue with steps to reproduce. My method is perhaps overly simplistic: while lvremove -ff vg; do lvcreate -L 500M -n lv vg; lvconvert -m1 vg/lv; done ... and wait for hang.
It seems a little shaddy to change the subject of a bug in the middle of an errata test and build, especially when it changes the meaning of what was originally ack'ed. The original bug said "lvm allows a mirror down convert on an ongoing upconvert", which to me says the fix should no longer allow a a down convert on an ongoing upconvert. Granted the hang may be a different issue, and I'll file a different bug for that, but the fact remains that lvm still shouldn't allow a user to downconvert during an upconvert.
Filed bug 585328 for the hang mentioned above.
FWIW, I can't reproduce the case with the old rpm where the primary device even gets deleted. # primary would be sdb1 lv taft -wi-a- 800.00M /dev/sdb1(0) [root@taft-04 tmp]# lvconvert -m 1 taft/lv taft/lv: Converted: 65.5% [root@taft-04 ~]# lvconvert -m 0 taft/lv Logical volume lv converted. # still sdb1 lv taft -wi-a- 800.00M /dev/sdb1(0)
So this bugzilla has failed and not quite fixed all these related problems with lvconvert. Perhaps I should have changed the title in the first place to make it clear I was hoping all the related lvconvert problems could be fixed together incl. those referenced on the fedora bugzilla above.
Fix verified in lvm2-2.02.56-10.el5. FYI - all that was tested wrt this bug was that down converting (by specifying to remove the primary leg) with an in progress up convert no longer is allowed. "Unable to remove primary mirror image while mirror is not in-sync"
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0052.html