Bug 232843

Summary: LVM + mdraid/dmraid = low LVM performance
Product: [Fedora] Fedora Reporter: Andy Burns <fedora>
Component: kernelAssignee: Milan Broz <mbroz>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 10CC: agk, cra, dwysocha, farrellj, gilboad, gnomeuser, jbrassow, k.georgiou, mattdm, mbroz, mishu, pierre-bugzilla, prockai, pvrabec, redhat, tmraz, wwoods
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
URL: https://www.redhat.com/archives/fedora-devel-list/2007-March/msg00884.html
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-27 08:31:59 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 235705    
Attachments:
Description Flags
Try running this script...
none
md read results
none
(partial?) md write results
none
lvm read results
none
lvm write results
none
corrected md-write results none

Description Andy Burns 2007-03-18 14:57:17 EDT
Description of problem:

using LVM on top of mdraid in my case (or dmraid confirmed by David Neilsen
dmraid) drastically reduces performance, ie. by 50%

Version-Release number of selected component (if applicable):
2.6.20-1.2990.fc7

an F7/rawhide machine with 6x400GB SATA II disks all
partitioned as 100MB + 399.9GB

/boot is on /dev/sda1 (would be RAID1 across /dev/sd[abc]1 partitions
except for mkinitrd raid1.ko breakage) with /dev/md1 as RAID5 on
/dev/sd[abcdef]2

Then a single LVM vg00 on top of /dev/md1

root and swap as LVs within vg00 and plenty of spare space.

Doing some timings of the various block devices (so far just roughly
with hdparm, bonnie installed for more detail later). Results of
"hdparm -Tt" averaged over a few runs, showing cached and buffered
speeds

/dev/sda1 gives 1995MB/s and 72MB/s which seems quite good for a single spindle
/dev/md1 gives 2040MB/s and 260MB/s also fairly good (I had hoped with
six spindles and parity to get closer to 5x performance than 3.5x)
/dev/mapper/vg00-lv01 (my root fs) gives 2100MB/s and 135MB/s which is
a little disappointing

Is there any slow debug code currently in rawhide that might explain it?
Could I have some bad choices of block sizes between RAID/LVM layers
which are reducing throughput by splitting reads? Anything  else?
Comment 1 Jonathan Earl Brassow 2007-03-20 10:44:35 EDT
Created attachment 150488 [details]
Try running this script...

I'd like to see the attached program run on both the md device and through the
LVM device.  Please capature the output, and attach to this bug.

Example:
# Warning, this will destroy data on the MD device
prompt> perf_matrix.pl if=/dev/md1 iter=5 > md-read-results.txt
prompt> perf_matrix.pl of=/dev/md1 iter=5 > md-write-results.txt
# remake your LVM objects ontop of the MD device again
prompt> perf_matrix.pl if=/dev/<VG>/<LV> iter=5 > lvm-read-results.txt
prompt> perf_matrix.pl of=/dev/<VG>/<LV> iter=5 > lvm-write-results.txt

To see performance difference for yourself, you could run:
prompt> perf_matrix.pl diff md-read-results.txt lvm-read-results.txt
Comment 2 Andy Burns 2007-03-20 12:24:30 EDT
Ok, I'll need to reformat and layout partitions, so I have enough space outside
the md device for my root and swap, but that is easy enough this is purely a
test box, will report back in a few hours.

Comment 3 Jonathan Earl Brassow 2007-03-20 12:45:21 EDT
I've noticed when installing lmbench (a package required by the script), that
the location for some of the binaries is strange.  I usually install the lmbench
rpm like this:

prompt> rpm -Uvh /pub/lmbench-3.0-0.a7.1.el4.rf.i386.rpm --relocate
/usr/lib/lmbench/bin/i686-pc-linux-gnu=/usr/bin

Comment 4 Milan Broz 2007-03-20 13:32:36 EDT
Please note that small readahead can cause some performance problems too.
You can try to increase it using blockdev command - see bug 147679.
(Check if /dev/mdX and /dev/mapper/<your device> readahead differ.)

Also please put here your md (/proc/mdstat) and dm configuraton (dmsetup table)
used in test above (dmsetup table) - or you can use lvmdump command to collect
information automatically.
Comment 5 Pasi Karkkainen 2007-03-20 17:45:07 EDT
Should readahead values for /dev/sdX, /dev/mdX and LVM volumes be the same? 

What if the values differ? 
Comment 6 Andy Burns 2007-03-20 18:58:59 EDT
(In reply to comment #4)

> Please note that small readahead can cause some performance problems too.
> You can try to increase it using blockdev command - see bug 147679.
> (Check if /dev/mdX and /dev/mapper/<your device> readahead differ.)

original partitions now blown away and re-installing with root on separate disk
to allow tests perf_matrix.pl
 
> Also please put here your md (/proc/mdstat) and dm configuraton (dmsetup table)

all history now, IIRC /proc/mdstat showed 256k chunk size on the raid, and the
LVM was 32MB blocks per anaconda default.

Comment 7 Milan Broz 2007-03-21 05:31:34 EDT
(In reply to comment #5)
> Should readahead values for /dev/sdX, /dev/mdX and LVM volumes be the same? 

It depends - if you have e.g. md raid0 and then simple lvm linear mapping over
it, it should be the same. If the lvm volume (on top of md) has smaller
readahead, sequential read will be slower...
 
> What if the values differ? 
Then you can try to set it to the same value with blockdev --setra.
LVM tools should set it automatically for particular mappings - but currently it
 is not implemented yet.
(I am trying to find out how the readahead is involved in this low
performance... maybe it is not important at all in this case)

Comment 8 Andy Burns 2007-03-21 06:52:12 EDT
OK, that all took a little longer than planned, partly due to time to finish
parity generation on the array and partly because anaconda seems very reluctant
to create raid md devices or LVM VGs when root, boot and swap are not destined
for them.

Re-partitioned as drives so that /dev/sd[abcdef]1 are all 4GB and contain /boot,
root and swap, leaving ~396GB per disk for raid, created as follows, I picked
the same chunk size that anaconda did, rather than mdadm's default.

# mdadm --create /dev/md0 --level=5 --chunk=256 --raid-devices=6
--spare-devices=0 --verbose /dev/sd[abcdef]2

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      1932578560 blocks level 5, 256k chunk, algorithm 2 [6/6] [UUUUUU]

# hdparm -Tt /dev/md0

/dev/md0:
 Timing cached reads:   5082 MB in  1.99 seconds = 2549.01 MB/sec
 Timing buffered disk reads:  780 MB in  3.01 seconds = 259.30 MB/sec

So the speed of the array is the same as before, which I'd expect.

# vgs
  VG         #PV #LV #SN Attr   VSize VFree
  VolGroup00   1   0   0 wz--n- 1.80T 1.80T

# pvs
  PV         VG         Fmt  Attr PSize PFree
  /dev/md0   VolGroup00 lvm2 a-   1.80T 1.80T

# lvcreate --name lv00_test --size 20G VolGroup00
  Logical volume "lv00_test" created

# hdparm -Tt /dev/mapper/VolGroup00-lv00_test

/dev/mapper/VolGroup00-lv00_test:
 Timing cached reads:   4310 MB in  1.99 seconds = 2160.97 MB/sec
 Timing buffered disk reads:  374 MB in  3.01 seconds = 124.07 MB/sec

So the LVM LV is still about 50% of the speed of the md array, I must confess
I'd never heard of blockdev before, here is what it shows

# blockdev --report
RO    RA   SSZ   BSZ   StartSec     Size    Device
rw   256   512  4096          0  781422768  /dev/sda
rw   256   512  4096         63    8385867  /dev/sda1
rw   256   512   512    8385930  773031735  /dev/sda2
rw   256   512  4096          0  781422768  /dev/sdb
rw   256   512  4096         63    8385867  /dev/sdb1
rw   256   512   512    8385930  773031735  /dev/sdb2
rw   256   512  4096          0  781422768  /dev/sdc
rw   256   512  4096         63    8385867  /dev/sdc1
rw   256   512   512    8385930  773031735  /dev/sdc2
rw   256   512  4096          0  781422768  /dev/sdd
rw   256   512   512    8385930  773031735  /dev/sdd2
rw   256   512  4096          0  781422768  /dev/sde
rw   256   512   512    8385930  773031735  /dev/sde2
rw   256   512  4096          0  781422768  /dev/sdf
rw   256   512   512    8385930  773031735  /dev/sdf2
rw  5120   512  4096          0 3865157120  /dev/md0
rw   256   512  4096          0   41943040  /dev/dm-0

Again I've never encountered dmsetup, I hope this is the information you wanted.

# dmsetup table /dev/mapper/VolGroup00-lv00_test
0 41943040 linear 9:0 384

Noticing the small readahead value I increased it

# blockdev --setra 5120 /dev/dm-0

This increased the speed of the LV a little

# hdparm -Tt /dev/mapper/VolGroup00-lv00_test

/dev/mapper/VolGroup00-lv00_test:
 Timing cached reads:   4258 MB in  1.99 seconds = 2134.95 MB/sec
 Timing buffered disk reads:  528 MB in  3.01 seconds = 175.33 MB/sec

but it's still only 70%, what would be considered good >90%?

Comment 9 Andy Burns 2007-03-21 06:58:36 EDT
Increasing the readahead of /dev/dm-0 above the size for /dev/md0 (by a factor
of two) didn't further increase the speed.

Comment 10 Andy Burns 2007-03-21 08:21:08 EDT
After the hdparm tests, I removed the LVM configs with lv/vg/pv/remove

couldn't find lmbench built for f7, tried to install v3 from the bitmover
source, but get 

gmake[2]: *** No rule to make target `../SCCS/s.ChangeSet', needed by `bk.ver'. 

any repo have an F7 build?

Comment 11 Andy Burns 2007-03-21 08:49:18 EDT
ok rebuilt from the srpm at
ftp://rpmfind.net/linux/dag/redhat/el5/en/x86_64/SRPMS.dag/lmbench-3.0-0.a7.1.rf.src.rpm

installed with
rpm -Uvh /usr/src/redhat/RPMS/x86_64/lmbench-3.0-0.a7.1.rf.x86_64.rpm --relocate
/usr/lib/lmbench/bin/x86_64-linux-gnu=/usr/bin

had to remove ImageMagick and a2ps for conflicts.

tests running now.
Comment 12 Andy Burns 2007-03-21 09:11:46 EDT
Created attachment 150566 [details]
md read results

The md-read test completed and is attached
the md-write test failed with a series of 

Argument "write:" isn't numeric in sort at ./perf_matrix.pl line 191.

but I've attached what output it did create
Comment 13 Andy Burns 2007-03-21 09:12:27 EDT
Created attachment 150567 [details]
(partial?) md write results
Comment 14 Andy Burns 2007-03-21 09:35:14 EDT
Created attachment 150569 [details]
lvm read results
Comment 15 Andy Burns 2007-03-21 10:13:40 EDT
Created attachment 150578 [details]
lvm write results

The LV created was 100GB, should it have been that whole size of the raid
device for a valid comparison?

I re-adjusted the /dev/md-0 readahead to 5120 after re-creating it, before
running the tests, I had to refer to it as /dev/mapper/VG0-LV1 rather than
/dev/VG0/LV1 as suggested.

The same errors did NOT re-occur on the lvm-write test as the md-write test,
I'll re-run the latter just in case ...
Comment 16 Andy Burns 2007-03-21 11:03:17 EDT
Created attachment 150588 [details]
corrected md-write results

OK, bash history reveals I'd fumbled the command to refer to a non-existent
device, here is the corrected version

The diff results between the md-read and lvm-read show nearly all red results,
with 3 green results

The diff results between the md-write and lvm-write pretty much the same thing,
so apart from the "sweet spots" lvm perfomance seems to be 30% down on md
performance.

Interesting (but likely not-related to this problem) is the red/green split
between the md-read and md-write, in about 20% of cases writing is faster than
reading, if only I could figure out what the rows/columns signify :-)

Looks like a nice little script that.
Comment 17 Jonathan Earl Brassow 2007-03-21 11:17:36 EDT
Oh, sorry...

The script is varying request sizes and transfer sizes.  Request sizes (the size
of the chunks that are used to write to disk) increase by powers of 2 from left
to right starting at 1kiB.  Transfer sizes (the file size) increase by powers of
2 from top to bottom starting at 1MiB.

So, top-left would simulate small writes to small(ish) files; bottom-right would
be large writes to large files.

You can set tmin/tmax and rmin/rmax to adjust the default test ranges.
Comment 18 Andy Burns 2007-03-23 05:34:21 EDT
(In reply to comment #17)

> The script is varying request sizes and transfer sizes. 

I've sucked the files into OOo calc and made 3D graphs out of them, the trends
per file/block size seem to follow what you'd expect and look similar "shape" to
graphs that Bonnie produces, the general the "shape" of the MD and LVM graphs
are the same, just on a different scale, as I mentioned there are three points
where LVM beats MD.

Overall it seems that apart from the occasional corner-case, LVM performance is
still 30% down compared to MD across the board, after increasing the LVM
readahead, this seems quite a penalty to me, any further comments/suggestions?

Comment 19 Andy Burns 2007-03-27 09:10:56 EDT
I don't like to sound like I'm expecting someone to come up with an answer for
me, but having piqued interest initially this one has gone a bit quiet, has
anyone got any more comments? 

After adjusting the /dev/md-0 readahead is 30% seen as acceptable overhead for
LVM? Any more good/bad datapoints from anyone else? Have I overlooked supplying
any information requested earlier?

Comment 20 Petr Rockai 2007-03-27 11:43:36 EDT
Sorry, one more thing, please adjust the readahead on the LVM device itself, 
not only the underlying md device. Although i have been testing the scenario 
on a two-disk setup and i could not reproduce any of this slowness. Can you 
try bumping LVM readahead to be same as md0 readahead and try hdparm again? 
Also, does hdparm --direct -tT give significantly different results?
Comment 21 Andy Burns 2007-03-27 13:11:13 EDT
(In reply to comment #20)

> Sorry, one more thing, please adjust the readahead on the LVM device itself, 
> not only the underlying md device. 

That was a typo on my behalf, the md device gets a readahead of 20x the
individual disks, I meant I'd increased the readahead on the /dev/dm-0 not /dev/md0

> Although i have been testing the scenario 
> on a two-disk setup and i could not reproduce any of this slowness. 

Are your two disks raid0 or raid1? I'm quite happy to pull the array apart and
test it in different configurations, rather than just raid5.

> Can you 
> try bumping LVM readahead to be same as md0 readahead and try hdparm again? 
> Also, does hdparm --direct -tT give significantly different results?

My current setup 
/dev/sda1 is boot
/dev/sdb1 is swap
/dev/sd[cdef]1 is md0 as RAID0 for root 
/dev/sd[abcdef]2 is md1 as RAID5 for LVM PV

# blockdev --report
RO    RA   SSZ   BSZ   StartSec     Size    Device
rw   256   512  4096          0  781422768  /dev/sd[abcdef]
rw   256   512  4096         63    8385867  /dev/sd[abcdef]1
rw   256   512   512    8385930  773031735  /dev/sd[abcdef]2
rw  4096   512  4096          0   33542144  /dev/md0
rw  5120   512  4096          0 3865157120  /dev/md1

# hdparm -Tt /dev/md0

/dev/md0:
 Timing cached reads:   3748 MB in  2.00 seconds = 1878.47 MB/sec
 Timing buffered disk reads:  862 MB in  3.00 seconds = 287.33 MB/sec

# hdparm -Tt /dev/md1

/dev/md1:
 Timing cached reads:   3994 MB in  2.00 seconds = 2001.82 MB/sec
 Timing buffered disk reads:  776 MB in  3.00 seconds = 258.42 MB/sec

# hdparm --direct -Tt /dev/md0

/dev/md0:
 Timing O_DIRECT cached reads:   870 MB in  2.00 seconds = 434.77 MB/sec
 Timing O_DIRECT disk reads:  782 MB in  3.00 seconds = 260.47 MB/sec

# hdparm --direct -Tt /dev/md1

/dev/md1:
 Timing O_DIRECT cached reads:   340 MB in  2.01 seconds = 169.57 MB/sec
 Timing O_DIRECT disk reads:  802 MB in  3.01 seconds = 266.52 MB/sec

Then I pvcreate/vgcreate/lvcreate on top of /dev/md1 to create /dev/dm-0 which
has initial readahead of 256

# hdparm -Tt /dev/dm-0

/dev/dm-0:
 Timing cached reads:   3788 MB in  2.00 seconds = 1898.27 MB/sec
 Timing buffered disk reads:  370 MB in  3.01 seconds = 123.00 MB/sec

# hdparm --direct -Tt /dev/dm-0

/dev/dm-0:
 Timing O_DIRECT cached reads:   210 MB in  2.00 seconds = 104.87 MB/sec
 Timing O_DIRECT disk reads:  620 MB in  3.01 seconds = 206.09 MB/sec

After increasing the readahead on /dev/dm-0 to 5120 (same as /dev/md1)

# hdparm -Tt /dev/dm-0

/dev/dm-0:
 Timing cached reads:   3752 MB in  2.00 seconds = 1880.47 MB/sec
 Timing buffered disk reads:  520 MB in  3.00 seconds = 173.24 MB/sec

# hdparm --direct -Tt /dev/dm-0

/dev/dm-0:
 Timing O_DIRECT cached reads:   216 MB in  2.02 seconds = 107.03 MB/sec
 Timing O_DIRECT disk reads:  622 MB in  3.01 seconds = 206.73 MB/sec

Within one or two MB/s these numbers are consistent with previous runs, though I
didn't try the --direct tests before.

Tried increasing the readhead on /dev/sd[abcdef]2 to 512
and on /dev/md1 and /dev/dm-0 to 10240, this no significant difference from the
final numbers shown above.

Need any further details of the hardware?

Comment 22 Petr Rockai 2007-03-27 17:22:28 EDT
Would be nice to know if this is 32-bit or 64-bit installation. I have skimmed 
through the report and comments and can't see that info. It seems your problem 
is not related to readahead, after all.

My test environment was (md) raid0 (mdadm defaults) on top of 2 SCSI drives 
and device-mapper linear mapping (same as yours). I have used different kernel 
though, so i can try to reproduce with rawhide version, just need to know 
architecture.

Thanks.
Comment 23 Andy Burns 2007-03-27 17:47:05 EDT
this is intel x86_64

I was just waiting for new raid arrays to initialise,

I created /dev/md2 as 800GB raid0 and /dev/md3 and 400GB raid1
then layered PV/VG/LM on top of each of them,

these got speeds of ~140MB/s and ~70MB/s from the md *and* the dm devices,
it seems the speed loss only occurs with LVM in top of RAID5

Are you able to test with three disks? I'll try with 3 instead of 6 to see if
the slowdown still exists, or is less pronounced.
Comment 24 Milan Broz 2007-03-27 18:16:34 EDT
Yes, if you can, please try 3 disk array too.
You are using standard rawhide x86_64 kernel (2.6.20), right ?
I think that this can be x86_64 problem only, so I will focus on this arch.

And thanks for all testing ! 
Comment 25 Andy Burns 2007-03-27 19:30:42 EDT
yep, 2.6.20-1.3023.fc7 x86_64

just got a 3 disk raid5 initialising now
Comment 26 Andy Burns 2007-03-28 01:52:03 EDT
Results with three disk raid5.

The md device was created with readahead=512, the LV was created with
readahead=256, so I increased that to 512. It gets ~140MB/s on all hdparm tests.
which is fine.

# hdparm -Tt /dev/md4

/dev/md4:
 Timing cached reads:   4024 MB in  1.99 seconds = 2017.05 MB/sec
 Timing buffered disk reads:  430 MB in  3.00 seconds = 143.22 MB/sec

# hdparm -Tt /dev/dm-0

/dev/dm-0:
 Timing cached reads:   4074 MB in  1.99 seconds = 2042.14 MB/sec
 Timing buffered disk reads:  430 MB in  3.00 seconds = 143.14 MB/sec

# hdparm --direct -Tt /dev/md4

/dev/md4:
 Timing O_DIRECT cached reads:   650 MB in  2.00 seconds = 324.24 MB/sec
 Timing O_DIRECT disk reads:  430 MB in  3.00 seconds = 143.10 MB/sec

# hdparm --direct -Tt /dev/dm-0

/dev/dm-0:
 Timing O_DIRECT cached reads:   168 MB in  2.00 seconds =  83.81 MB/sec
 Timing O_DIRECT disk reads:  430 MB in  3.00 seconds = 143.21 MB/sec

When I was using the six disk array, the md device got a default readahead of
5120, I set the LV to the same, could this somehow be _too_ big?

Does your x86_64 specific theory still hold up?

Comment 27 Milan Broz 2007-03-29 08:29:16 EDT
I have simulated significant slowdown on virtual machines with 6 discs & md
raid5 on both archs i386 & x86_64 with O_DIRECT and zero readahead.

There is problem with big block read operations only (hdparm -tT internally uses
read of 2MB buffers).

Further investigation will follow.
Comment 28 Milan Broz 2007-04-04 06:08:34 EDT
Slowdown is caused by explicit restriction to one page io requests in dm code.
MD layer directly can process bigger request.

Changes are needed in core device mapper to remove this safe restriction
completely, an attempt to solve this are these patches in dm devel queue:
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-introduce-merge_bvec_fn.patch
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-linear-add-merge.patch
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-table-remove-merge_bvec-sector-restriction.patch

Please note that these patches are under review and are not suitable for stable
system yet.
Comment 29 Andy Burns 2007-04-06 14:55:24 EDT
Doing some testing using RAID0 instead of RAID5 on the same hardware, ignoring
LVM so far, just testing the /dev/md1 speed

# hdparm -Tt --direct /dev/md1

/dev/md1:
 Timing O_DIRECT cached reads:   1144 MB in  2.00 seconds = 571.63 MB/sec
 Timing O_DIRECT disk reads:  1160 MB in  3.00 seconds = 386.26 MB/sec

# hdparm -Tt /dev/md1

/dev/md1:
 Timing cached reads:   3956 MB in  1.99 seconds = 1983.03 MB/sec
 Timing buffered disk reads:  826 MB in  3.00 seconds = 275.20 MB/sec

I notice in this case the direct I/O speed is considerably higher, in previous
tests with and without --direct has given very similar numbers on the md devices

The readhead that got assigned to the md1 (6 x 396GB partitions raid0) device
was smaller than that assigned to md0 (4 x 4GB partitions raid0)

# cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md1 : active raid0 sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2319094656 blocks 64k chunks

md0 : active raid0 sdf1[3] sde1[2] sdd1[1] sdc1[0]
      16771072 blocks 256k chunks

# blockdev --report
RO    RA   SSZ   BSZ   StartSec     Size    Device
rw  4096   512  4096          0   33542144  /dev/md0
rw  1536   512  4096          0 4638189312  /dev/md1

though the chunk size is different between the two arrays, what units is the RA
measured in?

Is the difference in speed with and without --direct a concern?
Comment 30 David Nielsen 2007-05-11 12:26:46 EDT
With the F7 release coming up and this bug still appearing to not having been
fixed the QA team is labeling this as FC7Target with the intention to make it an
FC8Blocker post release. For now this will be noted as a known issue.
Comment 31 Pierre Ossman 2007-07-12 00:05:33 EDT
+1

I experienced this on my system. A x86_64 machine with 5 x 500 GB devices in a
RAID5. /dev/md0 had decent numbers, but the lvm came with a significant
performance hit.

Unfortunately the machine is in production, so I can't perform any tests on it.
But Andy is by no means alone in experiencing this.
Comment 32 Will Woods 2007-09-12 16:59:40 EDT
Milan, did these patches ever make it upstream?
Comment 33 Milan Broz 2007-09-12 17:59:47 EDT
Currently there is discussion about submitting this (at least patches in comment
#28 are ready in DM devel queue but not yet upstream).
And there are possible changes in block layer which can make these patches obsolete.

Anyway, this is important performance problem which must be solved...


Comment 34 Will Woods 2007-10-18 11:31:07 EDT
Did the previously-mentioned patches make it into 2.6.23? If not, we'll probably
need to defer this to Fedora 9.
Comment 35 Andy Burns 2007-10-18 15:24:37 EDT
I'm interested to know too, sometime within the next fortnight or so I will have
a small window of opportunity to re-format and test the same machine/disks on
which I reported this issue ...

Comment 36 Milan Broz 2007-10-18 15:57:16 EDT
Userspace readahead patches are under review.

Merge patches (comment #28) are ready but were just deferred to 2.6.25 (by block
layer maintainer).
Comment 37 Jason Farrell 2008-03-15 23:03:49 EDT
I knew there was a small performance hit when using LVM over MD raid, but I
didn't expect such a *huge* gap. I was only getting ~80MB/s with LVM over a
4xRaid5 array which itself gets ~275MB/s, and 80MB/s was slower than a SINGLE
component drive @ ~110MB/s. Obviously unacceptable.

Tweaking the LV readahead fixed the problem. Will keep an eye out for these
fixes to be made *DEFAULT*. (I realize this is probably a low priority as
software raid isn't considered 'enterprise'. rahrah 3ware, etc.)

below are my benchmarks demonstrating the readahead changes:
-------------------------------------------------------------
[root@nano media]# blockdev --getra /dev/sda5 /dev/md3 /dev/vgr5/home
256
3072
256
 
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5
of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 41.0892 s, 105 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5
of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 40.5985 s, 106 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3
of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 15.7997 s, 272 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3
of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 15.6459 s, 275 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd
if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 52.0912 s, 82.5 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd
if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 51.8481 s, 82.8 MB/s
 
[root@nano media]# blockdev --setra 3072 /dev/vgr5/home
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd
if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 16.0691 s, 267 MB/s
[root@nano media]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd
if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 16.1722 s, 266 MB/s
Comment 38 Pierre Ossman 2008-04-13 11:30:54 EDT
Time for a status update perhaps? 2.6.25 is almost out, but are the patches in
there?
Comment 39 Bug Zapper 2008-05-13 22:40:36 EDT
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 40 Jason Farrell 2008-05-15 16:52:25 EDT
This is still an issue with the 2.6.25.3 kernel in Fedora 9. I get a paltry
83MB/s with the default readahead of 256, and ~290MB/s after manually setting
the higher readahead it should have.

Workaround is to add "blockdev --setra 4096 /dev/dm-*" to /etc/rc.local
Comment 41 Pierre Ossman 2008-07-15 15:43:30 EDT
What's the status of these patches? I cannot see them in Linus' HEAD and I
cannot find the discussion explaining the holdup either.
Comment 42 Milan Broz 2008-08-26 07:05:19 EDT
- Patches mentioned in comment #28 are now in upstream kernel.

- readahead setting should work for lvm stripe (see bug 450922), we should probably add some MD readahead tunning setting too.

There is still problem, that if LV over MD is not aligned to md chunk size
(iow beginning of lv is shifted comparing to start sector of md chunk) and some IOs are splitted which slow down performance.
(This is probably most common situation for performance loss now.
Manualy it can be fixed when creating VG - just align metadata to end on md chunk size "pvcreate --metadatasize ...". Unfortunatelly it cannot be fixed if VG is already created...)

LV metadata area should be probably aligned by default if underlying md device has different chunk size - please report new bug if you want to track this issue (it is on my todo list). Also readahead should be optimized here.
I think we can modify pvcreate to detect md chunk size.

Closing this rawhide (where is the 2.6.27-rc kernel which contain merge patches above.)
Comment 43 Pierre Ossman 2008-08-26 11:40:28 EDT
Any chance of seeing this in F9? (or F8 for that matter)

And a big thanks for finally fixing this. :)


For the second problem, should something like that following be done:

c = md chunk size

pvcreate --metadatasize c*n /dev/md#
vgcreate --physicalextentsize c*m foo /dev/md#

?

And what is the size of the metadata? I.e. what's the minimum value of c*n?
Comment 44 Milan Broz 2008-09-01 04:44:32 EDT
I submitted new bug for MD align issue - please add to cc if you want to track it.

Bug 460796 - lvm2 should try to align new PV to MD chunksize if possible

(I have simple patch reading chunksize from sysfs MD interface, will try to test it and submit it soon.)
Comment 45 Jason Farrell 2008-11-26 22:28:43 EST
FWIW - this bug may be closed, but this problem is still present in Fedora 10.

Reproducing the same steps (from comment #37 above), on the same hardware, on kernel-2.6.27.5-117.fc10.x86_64 produces the same pitifully slow LVM over MD results, until I manually increase the readahead for each LV to match the MD.

# default readaheads
[root@nano ~]# blockdev --getra /dev/sda5 /dev/md3 /dev/vgr5/home
256
3072
256

# raw bottom-layer disk device speed is ~106 MB/s
[root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/sda5 of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 40.6367 s, 106 MB/s

# raw middle-layer MD device read rate is ~289 MB/s
[root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/md3 of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 14.8595 s, 289 MB/s

# raw top-layer LVM device read rate is only ~83 MB/s
[root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 52.1008 s, 82.4 MB/s


# increase the LV readahead to match the underlying MD readahead:
[root@nano ~]# blockdev --setra 3072 /dev/vgr5/home

# raw LVM device now comes very close to matching the MD device @ 288 MB/s
[root@nano ~]# sync ; echo 1 > /proc/sys/vm/drop_caches ; dd if=/dev/vgr5/home of=/dev/null bs=4096 count=$((2**30*4/4096))
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 14.9148 s, 288 MB/s


so, it seems I'll still be having to add  "blockdev --setra 4096 /dev/dm-*"  to /etc/rc.local as a workaround.
Comment 46 Milan Broz 2008-11-27 05:13:32 EST
In F10 is still old lvm2 package which do not align LV volumes to MD chunksize.

rawhide has already new build which should help here (but you have to create the whole mapping from scratch...).

(For some reason was too late to rebase for F10.)
Comment 47 Milan Broz 2008-11-27 05:19:20 EST
(and you are right that the readahead is not properly increased for LV over MD devices yet, please let this bug open, we eventually fix that:-)
Comment 48 Milan Broz 2008-11-27 08:31:59 EST
I created new bug #473273 for improving lvm2 readahead setting (LVM over MD).

This bug covered kernel patches which are already merged.