Bug 463431
Summary: | [RHEL5.3] Excessive LVM volume alignment for MD device | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Burke <jburke> |
Component: | lvm2 | Assignee: | Milan Broz <mbroz> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.3 | CC: | agk, atodorov, borgan, bpeck, ddumas, duck, dwysocha, dzickus, edamato, hdegoede, heinzm, jbrassow, k.georgiou, lwang, mbroz, pbunyan, prockai, pvrabec, rvykydal, syeghiay, vijay.majagaonkar, wenzhuo |
Target Milestone: | beta | Keywords: | TestBlocker |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4380587 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-20 21:34:48 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jeff Burke
2008-09-23 13:03:19 UTC
Can you remove cmdline from your kickstart file, re-run the test, and see what question it's trying to ask you? Chris, Unfortunately I can't. These systems are part of RHTS. To modify the kickstart template files you need root privileges, Which I do not have. I think the only thing we can do at this point is grab the kickstart and try and reproduce it locally in the anaconda test network. Unless you can work with the engineering operations folks to hack up the kickstart on the RHTS server. Chris, Does this log help? http://rhts.redhat.com/testlogs/30544/110317/937822/anaconda.log Looking at that I see alot of the following repeated.. 18:52:26 DEBUG : self.driveList(): ['hda', 'sda', 'sdb'] 18:52:26 DEBUG : DiskSet.skippedDisks: [] 18:52:26 DEBUG : DiskSet.skippedDisks: [] 18:52:26 DEBUG : done starting mpaths. Drivelist: ['hda', 'sda', 'sdb'] 18:52:26 DEBUG : adding drive hda to disk list 18:52:26 DEBUG : adding drive sda to disk list 18:52:26 DEBUG : adding drive sdb to disk list 18:52:26 DEBUG : no preexisting size for volume group VolGroup00 18:52:26 DEBUG : got pv.size of 7.84423828125, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : total space: 0 18:52:26 DEBUG : no preexisting size for volume group VolGroup00 18:52:26 DEBUG : got pv.size of 7.84423828125, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : got pv.size of 5122.25683594, clamped to 5120 18:52:26 DEBUG : total space: 5120 18:52:26 DEBUG : no preexisting size for volume group VolGroup00 18:52:26 DEBUG : got pv.size of 7.84423828125, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : got pv.size of 7679.47851562, clamped to 7648 18:52:26 DEBUG : total space: 7648 18:52:26 DEBUG : no preexisting size for volume group VolGroup00 18:52:26 DEBUG : got pv.size of 7.84423828125, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : got pv.size of 8958.08935547, clamped to 8928 18:52:26 DEBUG : total space: 8928 18:52:26 DEBUG : no preexisting size for volume group VolGroup00 18:52:26 DEBUG : got pv.size of 7.84423828125, clamped to 0 18:52:26 DEBUG : got pv.size of 7.8134765625, clamped to 0 18:52:26 DEBUG : got pv.size of 9593.47265625, clamped to 9568 18:52:26 DEBUG : total space: 9568 Well, the real error here is found in the lvmout.log file: Physical volume '/dev/sdb1' listed more than once. Unable to add physical volume '/dev/sdb1' to volume group 'VolGroup00'. Then when we hit that error, we usually bring up a messageWindow. In cmdline mode, messageWindow just displays the error message and then says "You can't have a question in command line mode!" because there's nothing more we can do from that situation. So the question is why we're seeing that lvm error. Then when we hit that error, we usually bring up a messageWindow. In cmdline mode, messageWindow just displays the error message and then says "You can't have a question in command line mode!" because there's nothing more we can do from that situation. Don't you think you could print the error even in cmdline mode? We did: LVM operation failed lvcreate failed for swap0 That's the same information you would get in the usual graphical installer too. The other information to be found is in anaconda.log and lvmout.log. ok, after further investigation I think we have two bugs here. the lvmout.log for lvcreate for swap0 is this: Insufficient free extents (60) in volume group VolGroup00: 64 required The problem referenced in comment 3 has to do with multipath. both sda and sdb are the same disk and anaconda does not handle this correctly. *** Bug 460602 has been marked as a duplicate of this bug. *** The link to the ks.cfg file in the original description is no longer working (non of the links are) next time please attach files instead of putting in links to volatile locations. Can you attach ks.cfg I would like to take a look at what the ks is doing with regards to partition creation. I think that it is trying to fit more on the disk then will fit. Assuming this worked before I guess we got stricter with regards to this. Maybe you can even do another test run with the same ks with a somewhat smaller swap0 ? sorry about that. to save space a cron job gzips all the log files. zerombr clearpart --all --initlabel #PART_DETAILS# part /boot/efi --fstype vfat --size=100 --ondisk=sda --asprimary part raid.9 --size=100 --grow --ondisk=sdb part raid.8 --size=100 --grow --ondisk=sda raid pv.10 --fstype "physical volume (LVM)" --level=RAID0 --device=md0 raid.8 raid.9 volgroup VolGroup00 --pesize=32768 pv.10 logvol swap --fstype swap --name=swap0 --vgname=VolGroup00 --size=2048 logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow Thats from the ks.cfg file. <5>SCSI device sda: 71132960 512-byte hdwr sectors (36420 MB) <5>SCSI device sdb: 71132960 512-byte hdwr sectors (36420 MB) 6>HP CISS Driver (v 3.6.20-RH2) <6>ACPI: PCI Interrupt 0000:08:00.0[A] -> GSI 63 (level, low) -> IRQ 55 <6>cciss0: <0x3230> at PCI 0000:08:00.0 IRQ 69 using DAC <6> blocks= 234281760 block_size= 512 <6> heads= 255, sectors= 32, cylinders= 28711 <4> <6> blocks= 143305920 block_size= 512 <6> heads= 255, sectors= 32, cylinders= 17562 <4> <6> blocks= 234281760 block_size= 512 <6> heads= 255, sectors= 32, cylinders= 28711 <4> <6> cciss/c0d0: <6> blocks= 143305920 block_size= 512 <6> heads= 255, sectors= 32, cylinders= 17562 <4> <6> cciss/c0d1: Hm, removing RAID from the kickstart file makes it work fine. Can I get a list of which nightlies worked and which failed this test? Trying to narrow down the anaconda versions where it changed. Chris, Unfortunately I can not give you an exact list. We had other issues that caused us not to get even this far with some other distros. Also if the rhts scheduler selected a machine that did not duplicate this issue. IE it selected a machine that did not have raid in it's kickstart then that distro would say passed but it would still have the issue. I think you will have to do some testing with system(s) that are "known" to fail. Binary searching the nightly trees until we find the tree that it started in. Job 31001 is in process http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=31001 RHEL5.3-Server-20080917.nightly RHEL5.3-Server-20080918.nightly RHEL5.3-Server-20080919.nightly RHEL5.3-Server-20080924.nightly RHEL5.3-Server-20080925.nightly RHEL5.3-Server-20080926.nightly RHEL5.3-Server-20080922.0 RHEL5.3-Server-20080919.1 RHEL5.3-Server-20080912.1 I reproduced the bug, using the ks below (similar to that from comment #14 only using one physical drive): zerombr clearpart --all --initlabel #PART_DETAILS# part /boot --ondisk hda --fstype ext3 --size=00100 --asprimary part raid.9 --size=100 --grow --ondisk=hda part raid.8 --size=100 --grow --ondisk=hda raid pv.10 --fstype "physical volume (LVM)" --level=RAID0 --device=md0 raid.8 raid.9 volgroup VolGroup00 --pesize=32768 pv.10 logvol swap --fstype swap --name=swap0 --vgname=VolGroup00 --size=2048 logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow Which gave the same result: Insufficient free extents (60) in volume group VolGroup00: 64 required ... 4 PE missing, which i found in pvdisplay output as "not usable": --- Physical volume --- PV Name /dev/md0 VG Name VolGroup00 PV Size 9.90 GB / not usable 150.50 MB Allocatable yes PE Size (KByte) 32768 Total PE 312 Free PE 60 Allocated PE 252 PV UUID V2EVjW-zrwZ-ksk1-pCGl-GHfi-nFGh-pPr6GX I wonder what the "not usable" means and why is that, should our getActualSize take it into account? According to RHTS job 31001 RHEL5.3-Server-20080917.nightly InProcess (I expect it to pass or SEGV) RHEL5.3-Server-20080918.nightly Success RHEL5.3-Server-20080919.nightly Success RHEL5.3-Server-20080924.nightly Fails RHEL5.3-Server-20080925.nightly Fails RHEL5.3-Server-20080926.nightly Fails RHEL5.3-Server-20080922.0 Fails RHEL5.3-Server-20080919.1 Fails RHEL5.3-Server-20080912.1 Fails - Different known failure SEGV So 0919.nightly works but 0919.1 does not. Should be easy to tell what changed there? RHEL5.3-Server-20080919.nightly has lvm2-2.02.32-4.el5.i386.rpm, whereas RHEL5.3-Server-20080919.1 has lvm2-2.02.40-2.el5.i386.rpm. These two trees also have different versions of anaconda, but the only thing we did was move from a per-device encryption passphrase to a system-wide one and nothing in that patch looks suspect. I can reproduce this problem after stripping all the RAID out of the original kickstart file, so this could be a problem with the rebase of the LVM tools. Thoughts? Maybe its not as much a problem with the new LVM tools as it is a problem in the interaction between anaconda and those tools ? IOW maybe the output of the lvm command has changed (subtly) and that is biting us? (In reply to comment #19) I am attaching some more log info. I don't see anything significant here, but someone else might. ---------------------------------------------- After running lvm pvcreate -ff -y -v /dev/md0 by anaconda: stdout: Physical volume "/dev/md0" successfully created stderr: WARNING: Forcing physical volume creation on /dev/md0 of volume group "VolGroup00" Set up physical volume for "/dev/md0" with 20755456 available sectors Zeroing start of device /dev/md0 pvdisplay output: "/dev/md0" is a new physical volume of "9.90 GB" --- NEW Physical volume --- PV Name /dev/md0 VG Name PV Size 9.90 GB Allocatable NO PE Size (KByte) 0 Total PE 0 Free PE 0 Allocated PE 0 PV UUID O7G0bY-8JxY-90Tb-uSxV-gO8N-3OPu-SsTmkc ----------------------------------------------- After running lvm vgcreate -v -An -s 32768 VolGroup00 by anaconda: stdout: Volume group "VolGroup00" successfully created stderr: Wiping cache of LVM-capable devices Adding physical volume '/dev/md0' to volume group 'VolGroup00' WARNING: This metadata update is NOT backed up pvdisplay output: --- Physical volume --- PV Name /dev/md0 VG Name VolGroup00 PV Size 9.90 GB / not usable 150.50 MB Allocatable yes PE Size (KByte) 32768 Total PE 312 Free PE 312 Allocated PE 0 PV UUID O7G0bY-8JxY-90Tb-uSxV-gO8N-3OPu-SsTmkc vgdisplay output: --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size 9.75 GB PE Size 32.00 MB Total PE 312 Alloc PE / Size 0 / 0 Free PE / Size 312 / 9.75 GB VG UUID tQJGOU-FCJS-J0iN-9OUg-QSmA-i2gH-aTwbqW ---------------------------------------------------------- fdisk -l output: Disk /dev/hda: 10.7 GB, 10737418240 bytes 255 heads, 63 sectors/track, 1305 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 659 5188995 fd Linux raid autodetect /dev/hda3 660 1305 5188995 fd Linux raid autodetect Disk /dev/md0: 10.6 GB, 10626793472 bytes 2 heads, 4 sectors/track, 2594432 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md0 doesn't contain a valid partition table Try adding --config 'device { md_chunk_alignment = 0 }' to the vgcreate command as a temporary workaround. We added a performance tweak when LVM volumes are above MD volumes to align the I/O through the stack. This may involve a small effective size reduction to achieve better alignment. The above lvm.conf setting disables the tweak. Worth investigating further though - 128MB lost sounds rather a lot. (It may be that a smaller extent size should be chosen in the kickstart file). to comment #24: Adding --config 'devices { md_chunk_alignment = 0 }' to vgcreate (note that the option is 'devices', not 'device' as in comment #24) worked (installed successfully), with "not usable" reduced to 22,20 MB: --- Physical volume --- PV Name /dev/md0 VG Name VolGroup00 PV Size 9.90 GB / not usable 22.50 MB Allocatable yes (but full) PE Size (KByte) 32768 Total PE 316 Free PE 0 Allocated PE 316 PV UUID BpbZGx-IDY0-ff4g-UT78-OY4W-EpIc-ue0C8N to comment #21: > I can reproduce this problem after stripping all the RAID out of the original > kickstart file, I can't, without RAID the ks works for me. (In reply to comment #26) The "not usable" 22.50MB seems just due to extent size now, and it is clamped accordingly during anaconda actual pv size computations. What's the value of /sys/block/md0/md/chunk_size ? (and what do the md tools say the chunk size is i.e. to confirm lvm2 gets the units right) chunk size is 256K /sys/block/md0/md/chunk_size = 262144 /dev/md0: Version : 00.90.03 Creation Time : Thu Oct 2 13:49:49 2008 Raid Level : raid0 Array Size : 10377728 (9.90 GiB 10.63 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Oct 2 13:49:49 2008 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Chunk Size : 256K UUID : d91ac49c:17c55c1c:d2df1b05:4f6ae343 Events : 0.1 Number Major Minor RaidDevice State 0 3 2 0 active sync /dev/hda2 1 3 3 1 active sync /dev/hda3 Note lv is using extends of 32 *Megs* each though, which perfectly explains the wasted 28 MB's when using the lvm cmdline option to revert to the old behavior. These 32 Megs extends may also be the cause of the "large" loss of 150Mb usable space with the new lvm behavior. I think the biggest problem here is though that anaconda gets the volgroup size wrong with the new lvm behavior. I've added that to our lvm.conf blurb that we write out, so anaconda-11.1.2.135-1 should have this workaround included. That should help out testing while we figure out what we should do for real. Well I reckon there's a conversion from bytes to sectors missing, so the alignment boundary is 512 times larger than intended. Need to test the patch I have for this then build a new lvm2 package. Note to lvm2 developers: By default all sizes in lvm2 code are in sectors - exceptions to that should be obvious e.g. if the variable name says bytes. A variable should never be sometimes size-in-sectors and sometimes size-in-bytes, depending where you are in the function or calculation e.g. avoid that by using two variables. Fix in lvm2-2.02.40-4.el5 lvm2-cluster-2.02.40-4.el5 I suggest we need a release note like this: For performance reasons, LVM2 Logical Volumes are now aligned to MD (Multiple Device) chunk size. It means that Logical Volume will always start on offset which is multiple of MD chunk size. To use previous mode of alignment set md_chunk_alignment variable to 0 in lvm.conf. --- More info for discussion: Previous LVM2 version aligns Logical Volumes to 64k (or to pagesize if pagesize is greater than 64k, also note that there is metadata area in the beginning of PV). Example: /dev/md0 has chunk size 512k Without md alignment, pe_start is at 384 sector (see in metadata or using "dmsetup table") # dmsetup table vg_test-lv: 0 204800 linear 9:0 384 With md alignment on, it changes pe_start to 1024 (iow 512k - value is in 512 byte sectors) # dmsetup table vg_test-lv: 0 204800 linear 9:0 1024 It means that in some situation there can be some unused space (up to MD chunksize per LV) and creating of volume which was previously misaligned (but fits into space) can fail (max 1 extent missing because of offset increase). Usually LVM extent is multiple of MD chunksize, so the real problem is in offset of the first volume. The importance of aligment is that high level code (like ext3) usually optimize writes according to MD chunk and if the LVM layer doesn't respect this alignment, IO requests are split into pieces, and underlying RAID have to compute another XOR for next chunk, and runs more IO requests than necessary. The question is, if anaconda need to change its partitioning code (which compute the size itself) of will disable this behaviour for the next release... With fixed lvm2 package (lvm2-2.02.40-4.el5) used with ks from comment #19, I got expected results. With chunks of size 256k, and chunk alignment on, the offset is 512 sectors (256k). # dmsetup table VolGroup00-LogVol00: 0 16515072 linear 9:0 512 VolGroup00-swap0: 0 4194304 linear 9:0 16515584 # vgdisplay --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 9.88 GB PE Size 32.00 MB Total PE 316 Alloc PE / Size 316 / 9.88 GB Free PE / Size 0 / 0 VG UUID ATATgq-XxRg-QD3V-sX0n-5SqA-MRtx-7qMcKJ # pvdisplay --- Physical volume --- PV Name /dev/md0 VG Name VolGroup00 PV Size 9.90 GB / not usable 22.50 MB Allocatable yes (but full) PE Size (KByte) 32768 Total PE 316 Free PE 0 Allocated PE 316 PV UUID xCGS42-x1Gr-BrBX-dAt9-cggS-IXBV-fPo1gL (In reply to comment #36) > The question is, if anaconda need to change its partitioning code (which > compute the size itself) of will disable this behaviour for the next release... I think that if the change of anaconda code should be only something like counting the chunk size in when computing actual (available) size of pv above raid, it can be easy to make. (In reply to comment #38) > (In reply to comment #36) > > > The question is, if anaconda need to change its partitioning code (which > > compute the size itself) of will disable this behaviour for the next release... > > I think that if the change of anaconda code should be only something like > counting the chunk size in when computing actual (available) size of > pv above raid, it can be easy to make. I think it can be as easy as just substract 1 LVM extend size from the computed VG size if its on top of raid, atleast if Milan Broz is correct that we loose at max 1 extend. Milan, what happens if I have 4 disks and create 2 raid0 pairs using these 4 disks and then do one volumegroup over those 2 raid "arrays", can we then still loose max 1 extend compared to the old situation? (In reply to comment #39) > I think it can be as easy as just substract 1 LVM extend size from the computed > VG size if its on top of raid, atleast if Milan Broz is correct that we loose > at max 1 extend. Milan, what happens if I have 4 disks and create 2 raid0 pairs > using these 4 disks and then do one volumegroup over those 2 raid "arrays",can > we then still loose max 1 extend compared to the old situation? Well, I expect that LVM extend size is multiple of MD chunk size - so the problem is *only* with first offset on PV (all subsequnt LVs are alligned automatically - no space lost). pe_start is now property every PV - so if there is more underlying MD PVs, each of them can have aligned offset. So if I count correctly, in the wors case we can lost maximal 1 LVM extent per every underlying MD device. (In reply to comment #39) > (In reply to comment #38) > > (In reply to comment #36) > > > > > The question is, if anaconda need to change its partitioning code (which > > > compute the size itself) of will disable this behaviour for the next release... > > > > I think that if the change of anaconda code should be only something like > > counting the chunk size in when computing actual (available) size of > > pv above raid, it can be easy to make. > > I think it can be as easy as just substract 1 LVM extend size from the computed > VG size if its on top of raid, atleast if Milan Broz is correct that we loose > at max 1 extend. Comparing the default PE size and chunk size, the case when we lost 1 PE due to chunk alignment shouldn't be too frequent, so perhaps it is overkill to reduce the space available always by one PE, we can just subtract 1 chunk size (perhaps with some reserve) before aligning (clamping) the available space to PE size. It is easy too. > Milan, what happens if I have 4 disks and create 2 raid0 pairs > using these 4 disks and then do one volumegroup over those 2 raid "arrays", can > we then still loose max 1 extend compared to the old situation? In anaconda we compute the actual available size on VG (e.g. with reductions due to PE size) as a result of computing it for each PV of VG, and so we would in case of chunk alignment (which is property of physical volume as Milan Broz said above). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0164.html Hi all, I am new to this anaconda, can somebody please tell me the problem or point me the way to fix this. here is my .ks file anaconda version : anaconda-11.1.2.209-1.el5 lvm2 version : lvm2-2.02.56-8.el5.x86_64 [ks] # zerombr removes invalid parition tables which may exist zerombr clearpart --all --initlabel # /maint is the maintenance partition partition /maint --asprimary --fstype=ext3 --size=5120 partition /boot --asprimary --fstype=ext3 --size=128 partition pv.01 --size=1 --grow volgroup system_vg pv.01 logvol / --vgname=system_vg --fstype=ext3 --size=2048 --name=root_vol logvol /tmp --vgname=system_vg --fstype=ext3 --size=2048 --name=tmp_vol logvol /var --vgname=system_vg --fstype=ext3 --size=2048 --name=var_vol logvol swap --vgname=system_vg --fstype=swap --recommended --name=swap_vol logvol /opt --vgname=system_vg --fstype=ext3 --size=2048 --name=opt_vol [/ks] [LOG] Running pre-install scripts Retrieving installation information... In progress... Completed Completed Checking dependencies in packages selected for installation... In progress... Can't have a question in command line mode! LVM operation failed lvcreate failed for tmp_vol The installer will now exit... custom ['_Reboot'] [/LOG] I am not able to find out the root cause for this issue, but it solved after increasing HDD size for VM, Note : Same size worked for lower version of anaconda. |