Description of problem: using LVM on top of mdraid in my case (or dmraid confirmed by David Neilsen dmraid) drastically reduces performance, ie. by 50% Version-Release number of selected component (if applicable): 2.6.20-1.2990.fc7 an F7/rawhide machine with 6x400GB SATA II disks all partitioned as 100MB + 399.9GB /boot is on /dev/sda1 (would be RAID1 across /dev/sd[abc]1 partitions except for mkinitrd raid1.ko breakage) with /dev/md1 as RAID5 on /dev/sd[abcdef]2 Then a single LVM vg00 on top of /dev/md1 root and swap as LVs within vg00 and plenty of spare space. Doing some timings of the various block devices (so far just roughly with hdparm, bonnie installed for more detail later). Results of "hdparm -Tt" averaged over a few runs, showing cached and buffered speeds /dev/sda1 gives 1995MB/s and 72MB/s which seems quite good for a single spindle /dev/md1 gives 2040MB/s and 260MB/s also fairly good (I had hoped with six spindles and parity to get closer to 5x performance than 3.5x) /dev/mapper/vg00-lv01 (my root fs) gives 2100MB/s and 135MB/s which is a little disappointing Is there any slow debug code currently in rawhide that might explain it? Could I have some bad choices of block sizes between RAID/LVM layers which are reducing throughput by splitting reads? Anything else?
Created attachment 150488 [details] Try running this script... I'd like to see the attached program run on both the md device and through the LVM device. Please capature the output, and attach to this bug. Example: # Warning, this will destroy data on the MD device prompt> perf_matrix.pl if=/dev/md1 iter=5 > md-read-results.txt prompt> perf_matrix.pl of=/dev/md1 iter=5 > md-write-results.txt # remake your LVM objects ontop of the MD device again prompt> perf_matrix.pl if=/dev/<VG>/<LV> iter=5 > lvm-read-results.txt prompt> perf_matrix.pl of=/dev/<VG>/<LV> iter=5 > lvm-write-results.txt To see performance difference for yourself, you could run: prompt> perf_matrix.pl diff md-read-results.txt lvm-read-results.txt
Ok, I'll need to reformat and layout partitions, so I have enough space outside the md device for my root and swap, but that is easy enough this is purely a test box, will report back in a few hours.
I've noticed when installing lmbench (a package required by the script), that the location for some of the binaries is strange. I usually install the lmbench rpm like this: prompt> rpm -Uvh /pub/lmbench-3.0-0.a7.1.el4.rf.i386.rpm --relocate /usr/lib/lmbench/bin/i686-pc-linux-gnu=/usr/bin
Please note that small readahead can cause some performance problems too. You can try to increase it using blockdev command - see bug 147679. (Check if /dev/mdX and /dev/mapper/<your device> readahead differ.) Also please put here your md (/proc/mdstat) and dm configuraton (dmsetup table) used in test above (dmsetup table) - or you can use lvmdump command to collect information automatically.
Should readahead values for /dev/sdX, /dev/mdX and LVM volumes be the same? What if the values differ?
(In reply to comment #4) > Please note that small readahead can cause some performance problems too. > You can try to increase it using blockdev command - see bug 147679. > (Check if /dev/mdX and /dev/mapper/<your device> readahead differ.) original partitions now blown away and re-installing with root on separate disk to allow tests perf_matrix.pl > Also please put here your md (/proc/mdstat) and dm configuraton (dmsetup table) all history now, IIRC /proc/mdstat showed 256k chunk size on the raid, and the LVM was 32MB blocks per anaconda default.
(In reply to comment #5) > Should readahead values for /dev/sdX, /dev/mdX and LVM volumes be the same? It depends - if you have e.g. md raid0 and then simple lvm linear mapping over it, it should be the same. If the lvm volume (on top of md) has smaller readahead, sequential read will be slower... > What if the values differ? Then you can try to set it to the same value with blockdev --setra. LVM tools should set it automatically for particular mappings - but currently it is not implemented yet. (I am trying to find out how the readahead is involved in this low performance... maybe it is not important at all in this case)
OK, that all took a little longer than planned, partly due to time to finish parity generation on the array and partly because anaconda seems very reluctant to create raid md devices or LVM VGs when root, boot and swap are not destined for them. Re-partitioned as drives so that /dev/sd[abcdef]1 are all 4GB and contain /boot, root and swap, leaving ~396GB per disk for raid, created as follows, I picked the same chunk size that anaconda did, rather than mdadm's default. # mdadm --create /dev/md0 --level=5 --chunk=256 --raid-devices=6 --spare-devices=0 --verbose /dev/sd[abcdef]2 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] 1932578560 blocks level 5, 256k chunk, algorithm 2 [6/6] [UUUUUU] # hdparm -Tt /dev/md0 /dev/md0: Timing cached reads: 5082 MB in 1.99 seconds = 2549.01 MB/sec Timing buffered disk reads: 780 MB in 3.01 seconds = 259.30 MB/sec So the speed of the array is the same as before, which I'd expect. # vgs VG #PV #LV #SN Attr VSize VFree VolGroup00 1 0 0 wz--n- 1.80T 1.80T # pvs PV VG Fmt Attr PSize PFree /dev/md0 VolGroup00 lvm2 a- 1.80T 1.80T # lvcreate --name lv00_test --size 20G VolGroup00 Logical volume "lv00_test" created # hdparm -Tt /dev/mapper/VolGroup00-lv00_test /dev/mapper/VolGroup00-lv00_test: Timing cached reads: 4310 MB in 1.99 seconds = 2160.97 MB/sec Timing buffered disk reads: 374 MB in 3.01 seconds = 124.07 MB/sec So the LVM LV is still about 50% of the speed of the md array, I must confess I'd never heard of blockdev before, here is what it shows # blockdev --report RO RA SSZ BSZ StartSec Size Device rw 256 512 4096 0 781422768 /dev/sda rw 256 512 4096 63 8385867 /dev/sda1 rw 256 512 512 8385930 773031735 /dev/sda2 rw 256 512 4096 0 781422768 /dev/sdb rw 256 512 4096 63 8385867 /dev/sdb1 rw 256 512 512 8385930 773031735 /dev/sdb2 rw 256 512 4096 0 781422768 /dev/sdc rw 256 512 4096 63 8385867 /dev/sdc1 rw 256 512 512 8385930 773031735 /dev/sdc2 rw 256 512 4096 0 781422768 /dev/sdd rw 256 512 512 8385930 773031735 /dev/sdd2 rw 256 512 4096 0 781422768 /dev/sde rw 256 512 512 8385930 773031735 /dev/sde2 rw 256 512 4096 0 781422768 /dev/sdf rw 256 512 512 8385930 773031735 /dev/sdf2 rw 5120 512 4096 0 3865157120 /dev/md0 rw 256 512 4096 0 41943040 /dev/dm-0 Again I've never encountered dmsetup, I hope this is the information you wanted. # dmsetup table /dev/mapper/VolGroup00-lv00_test 0 41943040 linear 9:0 384 Noticing the small readahead value I increased it # blockdev --setra 5120 /dev/dm-0 This increased the speed of the LV a little # hdparm -Tt /dev/mapper/VolGroup00-lv00_test /dev/mapper/VolGroup00-lv00_test: Timing cached reads: 4258 MB in 1.99 seconds = 2134.95 MB/sec Timing buffered disk reads: 528 MB in 3.01 seconds = 175.33 MB/sec but it's still only 70%, what would be considered good >90%?
Increasing the readahead of /dev/dm-0 above the size for /dev/md0 (by a factor of two) didn't further increase the speed.
After the hdparm tests, I removed the LVM configs with lv/vg/pv/remove couldn't find lmbench built for f7, tried to install v3 from the bitmover source, but get gmake[2]: *** No rule to make target `../SCCS/s.ChangeSet', needed by `bk.ver'. any repo have an F7 build?
ok rebuilt from the srpm at ftp://rpmfind.net/linux/dag/redhat/el5/en/x86_64/SRPMS.dag/lmbench-3.0-0.a7.1.rf.src.rpm installed with rpm -Uvh /usr/src/redhat/RPMS/x86_64/lmbench-3.0-0.a7.1.rf.x86_64.rpm --relocate /usr/lib/lmbench/bin/x86_64-linux-gnu=/usr/bin had to remove ImageMagick and a2ps for conflicts. tests running now.
Created attachment 150566 [details] md read results The md-read test completed and is attached the md-write test failed with a series of Argument "write:" isn't numeric in sort at ./perf_matrix.pl line 191. but I've attached what output it did create
Created attachment 150567 [details] (partial?) md write results
Created attachment 150569 [details] lvm read results
Created attachment 150578 [details] lvm write results The LV created was 100GB, should it have been that whole size of the raid device for a valid comparison? I re-adjusted the /dev/md-0 readahead to 5120 after re-creating it, before running the tests, I had to refer to it as /dev/mapper/VG0-LV1 rather than /dev/VG0/LV1 as suggested. The same errors did NOT re-occur on the lvm-write test as the md-write test, I'll re-run the latter just in case ...
Created attachment 150588 [details] corrected md-write results OK, bash history reveals I'd fumbled the command to refer to a non-existent device, here is the corrected version The diff results between the md-read and lvm-read show nearly all red results, with 3 green results The diff results between the md-write and lvm-write pretty much the same thing, so apart from the "sweet spots" lvm perfomance seems to be 30% down on md performance. Interesting (but likely not-related to this problem) is the red/green split between the md-read and md-write, in about 20% of cases writing is faster than reading, if only I could figure out what the rows/columns signify :-) Looks like a nice little script that.
Oh, sorry... The script is varying request sizes and transfer sizes. Request sizes (the size of the chunks that are used to write to disk) increase by powers of 2 from left to right starting at 1kiB. Transfer sizes (the file size) increase by powers of 2 from top to bottom starting at 1MiB. So, top-left would simulate small writes to small(ish) files; bottom-right would be large writes to large files. You can set tmin/tmax and rmin/rmax to adjust the default test ranges.
(In reply to comment #17) > The script is varying request sizes and transfer sizes. I've sucked the files into OOo calc and made 3D graphs out of them, the trends per file/block size seem to follow what you'd expect and look similar "shape" to graphs that Bonnie produces, the general the "shape" of the MD and LVM graphs are the same, just on a different scale, as I mentioned there are three points where LVM beats MD. Overall it seems that apart from the occasional corner-case, LVM performance is still 30% down compared to MD across the board, after increasing the LVM readahead, this seems quite a penalty to me, any further comments/suggestions?
I don't like to sound like I'm expecting someone to come up with an answer for me, but having piqued interest initially this one has gone a bit quiet, has anyone got any more comments? After adjusting the /dev/md-0 readahead is 30% seen as acceptable overhead for LVM? Any more good/bad datapoints from anyone else? Have I overlooked supplying any information requested earlier?
Sorry, one more thing, please adjust the readahead on the LVM device itself, not only the underlying md device. Although i have been testing the scenario on a two-disk setup and i could not reproduce any of this slowness. Can you try bumping LVM readahead to be same as md0 readahead and try hdparm again? Also, does hdparm --direct -tT give significantly different results?
(In reply to comment #20) > Sorry, one more thing, please adjust the readahead on the LVM device itself, > not only the underlying md device. That was a typo on my behalf, the md device gets a readahead of 20x the individual disks, I meant I'd increased the readahead on the /dev/dm-0 not /dev/md0 > Although i have been testing the scenario > on a two-disk setup and i could not reproduce any of this slowness. Are your two disks raid0 or raid1? I'm quite happy to pull the array apart and test it in different configurations, rather than just raid5. > Can you > try bumping LVM readahead to be same as md0 readahead and try hdparm again? > Also, does hdparm --direct -tT give significantly different results? My current setup /dev/sda1 is boot /dev/sdb1 is swap /dev/sd[cdef]1 is md0 as RAID0 for root /dev/sd[abcdef]2 is md1 as RAID5 for LVM PV # blockdev --report RO RA SSZ BSZ StartSec Size Device rw 256 512 4096 0 781422768 /dev/sd[abcdef] rw 256 512 4096 63 8385867 /dev/sd[abcdef]1 rw 256 512 512 8385930 773031735 /dev/sd[abcdef]2 rw 4096 512 4096 0 33542144 /dev/md0 rw 5120 512 4096 0 3865157120 /dev/md1 # hdparm -Tt /dev/md0 /dev/md0: Timing cached reads: 3748 MB in 2.00 seconds = 1878.47 MB/sec Timing buffered disk reads: 862 MB in 3.00 seconds = 287.33 MB/sec # hdparm -Tt /dev/md1 /dev/md1: Timing cached reads: 3994 MB in 2.00 seconds = 2001.82 MB/sec Timing buffered disk reads: 776 MB in 3.00 seconds = 258.42 MB/sec # hdparm --direct -Tt /dev/md0 /dev/md0: Timing O_DIRECT cached reads: 870 MB in 2.00 seconds = 434.77 MB/sec Timing O_DIRECT disk reads: 782 MB in 3.00 seconds = 260.47 MB/sec # hdparm --direct -Tt /dev/md1 /dev/md1: Timing O_DIRECT cached reads: 340 MB in 2.01 seconds = 169.57 MB/sec Timing O_DIRECT disk reads: 802 MB in 3.01 seconds = 266.52 MB/sec Then I pvcreate/vgcreate/lvcreate on top of /dev/md1 to create /dev/dm-0 which has initial readahead of 256 # hdparm -Tt /dev/dm-0 /dev/dm-0: Timing cached reads: 3788 MB in 2.00 seconds = 1898.27 MB/sec Timing buffered disk reads: 370 MB in 3.01 seconds = 123.00 MB/sec # hdparm --direct -Tt /dev/dm-0 /dev/dm-0: Timing O_DIRECT cached reads: 210 MB in 2.00 seconds = 104.87 MB/sec Timing O_DIRECT disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec After increasing the readahead on /dev/dm-0 to 5120 (same as /dev/md1) # hdparm -Tt /dev/dm-0 /dev/dm-0: Timing cached reads: 3752 MB in 2.00 seconds = 1880.47 MB/sec Timing buffered disk reads: 520 MB in 3.00 seconds = 173.24 MB/sec # hdparm --direct -Tt /dev/dm-0 /dev/dm-0: Timing O_DIRECT cached reads: 216 MB in 2.02 seconds = 107.03 MB/sec Timing O_DIRECT disk reads: 622 MB in 3.01 seconds = 206.73 MB/sec Within one or two MB/s these numbers are consistent with previous runs, though I didn't try the --direct tests before. Tried increasing the readhead on /dev/sd[abcdef]2 to 512 and on /dev/md1 and /dev/dm-0 to 10240, this no significant difference from the final numbers shown above. Need any further details of the hardware?
Would be nice to know if this is 32-bit or 64-bit installation. I have skimmed through the report and comments and can't see that info. It seems your problem is not related to readahead, after all. My test environment was (md) raid0 (mdadm defaults) on top of 2 SCSI drives and device-mapper linear mapping (same as yours). I have used different kernel though, so i can try to reproduce with rawhide version, just need to know architecture. Thanks.
this is intel x86_64 I was just waiting for new raid arrays to initialise, I created /dev/md2 as 800GB raid0 and /dev/md3 and 400GB raid1 then layered PV/VG/LM on top of each of them, these got speeds of ~140MB/s and ~70MB/s from the md *and* the dm devices, it seems the speed loss only occurs with LVM in top of RAID5 Are you able to test with three disks? I'll try with 3 instead of 6 to see if the slowdown still exists, or is less pronounced.
Yes, if you can, please try 3 disk array too. You are using standard rawhide x86_64 kernel (2.6.20), right ? I think that this can be x86_64 problem only, so I will focus on this arch. And thanks for all testing !
yep, 2.6.20-1.3023.fc7 x86_64 just got a 3 disk raid5 initialising now
Results with three disk raid5. The md device was created with readahead=512, the LV was created with readahead=256, so I increased that to 512. It gets ~140MB/s on all hdparm tests. which is fine. # hdparm -Tt /dev/md4 /dev/md4: Timing cached reads: 4024 MB in 1.99 seconds = 2017.05 MB/sec Timing buffered disk reads: 430 MB in 3.00 seconds = 143.22 MB/sec # hdparm -Tt /dev/dm-0 /dev/dm-0: Timing cached reads: 4074 MB in 1.99 seconds = 2042.14 MB/sec Timing buffered disk reads: 430 MB in 3.00 seconds = 143.14 MB/sec # hdparm --direct -Tt /dev/md4 /dev/md4: Timing O_DIRECT cached reads: 650 MB in 2.00 seconds = 324.24 MB/sec Timing O_DIRECT disk reads: 430 MB in 3.00 seconds = 143.10 MB/sec # hdparm --direct -Tt /dev/dm-0 /dev/dm-0: Timing O_DIRECT cached reads: 168 MB in 2.00 seconds = 83.81 MB/sec Timing O_DIRECT disk reads: 430 MB in 3.00 seconds = 143.21 MB/sec When I was using the six disk array, the md device got a default readahead of 5120, I set the LV to the same, could this somehow be _too_ big? Does your x86_64 specific theory still hold up?
I have simulated significant slowdown on virtual machines with 6 discs & md raid5 on both archs i386 & x86_64 with O_DIRECT and zero readahead. There is problem with big block read operations only (hdparm -tT internally uses read of 2MB buffers). Further investigation will follow.
Slowdown is caused by explicit restriction to one page io requests in dm code. MD layer directly can process bigger request. Changes are needed in core device mapper to remove this safe restriction completely, an attempt to solve this are these patches in dm devel queue: http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-introduce-merge_bvec_fn.patch http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-linear-add-merge.patch http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-table-remove-merge_bvec-sector-restriction.patch Please note that these patches are under review and are not suitable for stable system yet.
Doing some testing using RAID0 instead of RAID5 on the same hardware, ignoring LVM so far, just testing the /dev/md1 speed # hdparm -Tt --direct /dev/md1 /dev/md1: Timing O_DIRECT cached reads: 1144 MB in 2.00 seconds = 571.63 MB/sec Timing O_DIRECT disk reads: 1160 MB in 3.00 seconds = 386.26 MB/sec # hdparm -Tt /dev/md1 /dev/md1: Timing cached reads: 3956 MB in 1.99 seconds = 1983.03 MB/sec Timing buffered disk reads: 826 MB in 3.00 seconds = 275.20 MB/sec I notice in this case the direct I/O speed is considerably higher, in previous tests with and without --direct has given very similar numbers on the md devices The readhead that got assigned to the md1 (6 x 396GB partitions raid0) device was smaller than that assigned to md0 (4 x 4GB partitions raid0) # cat /proc/mdstat Personalities : [raid0] [raid6] [raid5] [raid4] md1 : active raid0 sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0] 2319094656 blocks 64k chunks md0 : active raid0 sdf1[3] sde1[2] sdd1[1] sdc1[0] 16771072 blocks 256k chunks # blockdev --report RO RA SSZ BSZ StartSec Size Device rw 4096 512 4096 0 33542144 /dev/md0 rw 1536 512 4096 0 4638189312 /dev/md1 though the chunk size is different between the two arrays, what units is the RA measured in? Is the difference in speed with and without --direct a concern?
With the F7 release coming up and this bug still appearing to not having been fixed the QA team is labeling this as FC7Target with the intention to make it an FC8Blocker post release. For now this will be noted as a known issue.
+1 I experienced this on my system. A x86_64 machine with 5 x 500 GB devices in a RAID5. /dev/md0 had decent numbers, but the lvm came with a significant performance hit. Unfortunately the machine is in production, so I can't perform any tests on it. But Andy is by no means alone in experiencing this.
Milan, did these patches ever make it upstream?
Currently there is discussion about submitting this (at least patches in comment #28 are ready in DM devel queue but not yet upstream). And there are possible changes in block layer which can make these patches obsolete. Anyway, this is important performance problem which must be solved...
Did the previously-mentioned patches make it into 2.6.23? If not, we'll probably need to defer this to Fedora 9.
I'm interested to know too, sometime within the next fortnight or so I will have a small window of opportunity to re-format and test the same machine/disks on which I reported this issue ...
Userspace readahead patches are under review. Merge patches (comment #28) are ready but were just deferred to 2.6.25 (by block layer maintainer).
I knew there was a small performance hit when using LVM over MD raid, but I didn't expect such a *huge* gap. I was only getting ~80MB/s with LVM over a 4xRaid5 array which itself gets ~275MB/s, and 80MB/s was slower than a SINGLE component drive @ ~110MB/s. Obviously unacceptable. Tweaking the LV readahead fixed the problem. Will keep an eye out for these fixes to be made *DEFAULT*. (I realize this is probably a low priority as software raid isn't considered 'enterprise'. rahrah 3ware, etc.) below are my benchmarks demonstrating the readahead changes: ------------------------------------------------------------- [root@nano media]# blockdev --getra /dev/sda5 /dev/md3 /dev/vgr5/home 256 3072 256 [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 41.0892 s, 105 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 40.5985 s, 106 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 15.7997 s, 272 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 15.6459 s, 275 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 52.0912 s, 82.5 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 51.8481 s, 82.8 MB/s [root@nano media]# blockdev --setra 3072 /dev/vgr5/home [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 16.0691 s, 267 MB/s [root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 16.1722 s, 266 MB/s
Time for a status update perhaps? 2.6.25 is almost out, but are the patches in there?
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
This is still an issue with the 2.6.25.3 kernel in Fedora 9. I get a paltry 83MB/s with the default readahead of 256, and ~290MB/s after manually setting the higher readahead it should have. Workaround is to add "blockdev --setra 4096 /dev/dm-*" to /etc/rc.local
What's the status of these patches? I cannot see them in Linus' HEAD and I cannot find the discussion explaining the holdup either.
- Patches mentioned in comment #28 are now in upstream kernel. - readahead setting should work for lvm stripe (see bug 450922), we should probably add some MD readahead tunning setting too. There is still problem, that if LV over MD is not aligned to md chunk size (iow beginning of lv is shifted comparing to start sector of md chunk) and some IOs are splitted which slow down performance. (This is probably most common situation for performance loss now. Manualy it can be fixed when creating VG - just align metadata to end on md chunk size "pvcreate --metadatasize ...". Unfortunatelly it cannot be fixed if VG is already created...) LV metadata area should be probably aligned by default if underlying md device has different chunk size - please report new bug if you want to track this issue (it is on my todo list). Also readahead should be optimized here. I think we can modify pvcreate to detect md chunk size. Closing this rawhide (where is the 2.6.27-rc kernel which contain merge patches above.)
Any chance of seeing this in F9? (or F8 for that matter) And a big thanks for finally fixing this. :) For the second problem, should something like that following be done: c = md chunk size pvcreate --metadatasize c*n /dev/md# vgcreate --physicalextentsize c*m foo /dev/md# ? And what is the size of the metadata? I.e. what's the minimum value of c*n?
I submitted new bug for MD align issue - please add to cc if you want to track it. Bug 460796 - lvm2 should try to align new PV to MD chunksize if possible (I have simple patch reading chunksize from sysfs MD interface, will try to test it and submit it soon.)
FWIW - this bug may be closed, but this problem is still present in Fedora 10. Reproducing the same steps (from comment #37 above), on the same hardware, on kernel-2.6.27.5-117.fc10.x86_64 produces the same pitifully slow LVM over MD results, until I manually increase the readahead for each LV to match the MD. # default readaheads [root@nano ~]# blockdev --getra /dev/sda5 /dev/md3 /dev/vgr5/home 256 3072 256 # raw bottom-layer disk device speed is ~106 MB/s [root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 40.6367 s, 106 MB/s # raw middle-layer MD device read rate is ~289 MB/s [root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3 of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 14.8595 s, 289 MB/s # raw top-layer LVM device read rate is only ~83 MB/s [root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 52.1008 s, 82.4 MB/s # increase the LV readahead to match the underlying MD readahead: [root@nano ~]# blockdev --setra 3072 /dev/vgr5/home # raw LVM device now comes very close to matching the MD device @ 288 MB/s [root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096)) 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 14.9148 s, 288 MB/s so, it seems I'll still be having to add "blockdev --setra 4096 /dev/dm-*" to /etc/rc.local as a workaround.
In F10 is still old lvm2 package which do not align LV volumes to MD chunksize. rawhide has already new build which should help here (but you have to create the whole mapping from scratch...). (For some reason was too late to rebase for F10.)
(and you are right that the readahead is not properly increased for LV over MD devices yet, please let this bug open, we eventually fix that:-)
I created new bug #473273 for improving lvm2 readahead setting (LVM over MD). This bug covered kernel patches which are already merged.