Bug 510861

Summary:	Storage performance regression between Redhat 4 up 3 and Redhat 5 up 3
Product:	Red Hat Enterprise Linux 5	Reporter:	Jean Blouin <blouin>
Component:	kernel	Assignee:	Jeff Moyer <jmoyer>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	urgent	Docs Contact:
Priority:	low
Version:	5.3	CC:	agk, cluster-maint, dwysocha, edamato, heinzm, jbrassow, jmoyer, mbroz, msnitzer, prockai
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-11-29 19:47:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jean Blouin 2009-07-11 19:00:08 UTC

Description of problem:Storage performance regression between Redhat 4 up 3 and Redhat 5 up 3

We are noticing a large slowdown of performance with a fibre channel disk array that is configured with lvm. When we compare the performance on Redhat 4 update 3 to Redhat 5 up 3 we see a large regression.

Performance goes from 678 MB/s to 372 MB/s.

Our configuration is using 4 hardware raid physical volumes (xyratek 6412 storage with 4Gb Atto fibre channel hba) we configure these device as one logical volume;

Here are the results 


I did some tests on an xw8600 configured with 2 loops XR+XE comparing what "dd" could read directly from the devices.  The goal was to eliminate LVM (and mdadm) as well as xfs from the equation.

I believe the results indicate that we are experiencing an issue due to changes in LVM in RHEL5u3.

Here's the data:


Under RHEL4u3, running four dd processes each reading a seperate underlying LUN directly, and then running a single dd reading the LVM raid0 device, we see the following:
> (FOUR TIMES)  dd if=/dev/sdb of=/dev/null iflag=direct,nonblock bs=512K
= 824Mb/s
> iostat -x
avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00    1.50   23.60   74.91
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.00   0.00 470.00  0.00 481280.00    0.00 240640.00     0.00  1024.00     0.97    2.07   2.07  97.20
sdc          0.00   0.00 313.00  0.00 320512.00    0.00 160256.00     0.00  1024.00     0.98    3.14   3.14  98.40
sdd          0.00   0.00 312.00  0.00 319488.00    0.00 159744.00     0.00  1024.00     0.98    3.13   3.13  97.60
sde          0.00   0.00 468.00  0.00 479232.00    0.00 239616.00     0.00  1024.00     0.98    2.09   2.09  97.80
> dd if=/dev/vg00/lvol1 of=/dev/null iflag=direct,nonblock bs=4096K
= 678 MB/s
> iostat -x
avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.12    0.00    2.25   10.49   87.14
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb        4830.00   0.00 322.00  0.00 329728.00    0.00 164864.00     0.00  1024.00     1.86    5.80   2.99  96.20
sdc        4830.00   0.00 322.00  0.00 329728.00    0.00 164864.00     0.00  1024.00     1.57    4.88   2.96  95.30
sdd        4830.00   0.00 322.00  0.00 329728.00    0.00 164864.00     0.00  1024.00     1.85    5.77   2.98  95.80
sde        4830.00   0.00 322.00  0.00 329728.00    0.00 164864.00     0.00  1024.00     1.54    4.81   2.94  94.60


Under RHEL5u3, again, running four dd processes each reading a seperate underlying LUN directly, and then running a single dd reading the LVM raid0 device, we see the following:
> (FOUR TIEMS) dd if=/dev/sdd  of=/dev/null iflag=direct,nonblock bs=512K
= 824MB/s
> iostat -x
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.50   23.47    0.00   75.03
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00 310.00  0.00 317440.00     0.00  1024.00     0.98    3.17   3.16  98.00
sdd               0.00     0.00 463.00  0.00 474112.00     0.00  1024.00     0.98    2.12   2.12  98.00
sde               0.00     0.00 417.00  0.00 427008.00     0.00  1024.00     0.98    2.36   2.36  98.30
sdf               0.00     0.00 364.00  0.00 372736.00     0.00  1024.00     0.98    2.68   2.68  97.60
> dd if=/dev/vg00/lvol1 of=/dev/null iflag=direct,nonblock bs=2048K
= 372 MB/s
> iostat -x
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.75   11.86    0.00   86.39
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc            2313.00     0.00 550.00  0.00 182720.00     0.00   332.22     0.83    1.51   1.05  57.60
sdd            2374.00     0.00 489.00  0.00 182336.00     0.00   372.88     0.96    1.95   1.37  66.90
sde            2299.00     0.00 564.00  0.00 182912.00     0.00   324.31     0.86    1.54   1.09  61.50
sdf            2303.00     0.00 559.00  0.00 182400.00     0.00   326.30     0.97    1.72   1.06  59.50


Interpretation:
In both RHEL4u3 and 5u3, we see that dd reading directly from each LUN device give pretty much identical results.
However, in RHEL5u3 when dd reads from the LVM device, performance is almost 40% slower.
This indicates to me that we can concentrate our efforts on finding what has changed in LVM when upgrading to RHEL5u3.


Please note that these numbers do not correspond to what we should expect when reading from a real filesystem using our benchmarking tools, but they do show that there seems to be a problem with LVM.


Also, note that if I can increase the block size used by dd in the tests where we are reading from the LVM device, we do see very high throughputs.  eg.  If I set bs=4Gb, we see amazing bandwidth.  I do not understand what is going on in this case however; a block size of 4Gb is ridiculously high and must be being cut into smaller chunks.  Any theories on this are welcome.



Version-Release number of selected component (if applicable):
system-config-lvm-1.1.5-1.0.el5
lvm2-2.02.40-6.el5

Comment 1 Jean Blouin 2009-07-15 15:02:09 UTC

We have found a workaround for the above problem. By modifying the IO scheduler from CFQ to deadline we were able to improve the performance on read on both RedHat 4 up 3 and RedHat 5 up 3.

The problem as we understand it now is that the Completely Fair Queueing IO scheduler of RedHat 4 up 3 did a much better job in merging disk request than the verions in Redhat 5 up 3

The following knowledgebase article help us out resolving the issue

http://kbase.redhat.com/faq/docs/DOC-15355

The Completely Fair Queuing (cfq) scheduler in RHEL5 appears to have worse I/O read performance than RHEL4.

This bug can be closed

Comment 2 Mike Snitzer 2009-08-06 16:22:29 UTC


*** This bug has been marked as a duplicate of bug 448130 ***

Comment 3 Jeff Moyer 2009-08-06 16:50:57 UTC

(In reply to comment #0)

> I did some tests on an xw8600 configured with 2 loops XR+XE comparing what "dd"
> could read directly from the devices.  The goal was to eliminate LVM (and
> mdadm) as well as xfs from the equation.
> 
> I believe the results indicate that we are experiencing an issue due to changes
> in LVM in RHEL5u3.
> 
> Here's the data:
> 
> 
> Under RHEL4u3, running four dd processes each reading a seperate underlying LUN
> directly, and then running a single dd reading the LVM raid0 device, we see the
> following:
> > (FOUR TIMES)  dd if=/dev/sdb of=/dev/null iflag=direct,nonblock bs=512K

iflag=direct isn't supported by the dd shipped with RHEL 4.  Did you update your coreutils package to a version that does support this flag?

Is your LVM a striped logical volume (I presume)?

Comment 4 Mike Snitzer 2009-08-06 18:14:46 UTC

Not closing as a dup of 448130 ... DM uses a single queue to dispatch IO to the N devices in a striped LV.  448130 is concerned with using multiple threads that are interleaving i/o on behalf of a common task.  

I'm not convinced this issue isn't somehow related to 448130 but I'll separate the bugs regardless.  However, this bug seems like a kernel (cfq) bug not an lvm2 bug.  Switching from cfq to deadline apparently resolves the reporter's issue.

Comment 5 Jeff Moyer 2009-08-06 18:26:20 UTC

Jean,

If it wouldn't be an inconvenience, could you please re-run your single dd over the logical volume and get me blktrace data?  You'd run something like:

blktrace -d /dev/<logical-volume>

in a directory not on said volume.  Then run your test.  When the test is complete, kill off blktrace and upload the resulting data files somewhere (if they're not too big, you can attach them to this bugzilla).

If you can't do this for some reason, I'll try to reproduce here, but that will take some extra time.

Thanks!

Comment 6 Jeff Moyer 2009-08-06 18:50:26 UTC

(In reply to comment #5)
> Jean,
> 
> If it wouldn't be an inconvenience, could you please re-run your single dd over
> the logical volume and get me blktrace data?  You'd run something like:
> 
> blktrace -d /dev/<logical-volume>

actually, it would be better if you just got the blktrace data for a single of the underlying luns while doing the dd to the whole raid.

Thanks!

Comment 7 Jean Blouin 2009-08-08 00:35:56 UTC

(In reply to comment #3)
> (In reply to comment #0)
> 
> > I did some tests on an xw8600 configured with 2 loops XR+XE comparing what "dd"
> > could read directly from the devices.  The goal was to eliminate LVM (and
> > mdadm) as well as xfs from the equation.
> > 
> > I believe the results indicate that we are experiencing an issue due to changes
> > in LVM in RHEL5u3.
> > 
> > Here's the data:
> > 
> > 
> > Under RHEL4u3, running four dd processes each reading a seperate underlying LUN
> > directly, and then running a single dd reading the LVM raid0 device, we see the
> > following:
> > > (FOUR TIMES)  dd if=/dev/sdb of=/dev/null iflag=direct,nonblock bs=512K
> 
> iflag=direct isn't supported by the dd shipped with RHEL 4.  Did you update
> your coreutils package to a version that does support this flag?
> 
> Is your LVM a striped logical volume (I presume)?  

Hi Jeffrey,

Yes we updated the dd with one that supports the direct flag.
I will be out of the office for the next 2 weeks so I forwarded your questions to the engineer that actually did the test on this. He will be able to answer your questions more precisely.

As I mentioned above by modifying the IO scheduler  from CFQ to deadline we effectively resolved the issue for us. And you are right to say that the problem is most likely related to a bug with the kernel CFQ scheduler.

Thanks,

Jean

Comment 8 Jeff Moyer 2009-08-24 14:40:43 UTC

I'm still waiting for blktrace data and details on the lvm volume used for testing.

Comment 9 Jeff Moyer 2009-09-10 20:14:56 UTC

I have one further question.  What is the value for /sys/block/sdX/queue/max_sectors_kb for RHEL 4 and RHEL 5?

Comment 10 Jeff Moyer 2009-09-30 17:18:08 UTC

I'm still waiting for information, here.  If you can't provide any data, then I'm not going to be able to help you!

Comment 11 Jeff Moyer 2009-11-02 16:25:47 UTC

I've posted a test kernel to the following location:
  http://people.redhat.com/jmoyer/cfq-cc

My test system is bandwidth limited, but it did show a speedup for CFQ when striping over two disks.  Could you please give this kernel a try with the CFQ I/O scheduler and report your results?  I'd appreciate it.

Comment 12 Jeff Moyer 2009-11-29 19:47:47 UTC

Given that there have been no updates from the reporter in the past several months, I'm closing this bug.  In the event that the requested information is provided, I'll reopen the bug and we can take it from there.