427709 – dump and large file ops are slow, please implement kernel workaround

Bug 427709 - dump and large file ops are slow, please implement kernel workaround

Summary: dump and large file ops are slow, please implement kernel workaround

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Moyer
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	5.5TechNotes-Updates
TreeView+	depends on / blocked

Reported:	2008-01-06 23:48 UTC by Todd
Modified:	2014-06-09 20:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.
Clone Of:
Environment:
Last Closed:	2010-03-30 07:43:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Todd 2008-01-06 23:48:34 UTC

Dump and other large file operations are up to 7 times slower than previous
kernels in rhel4.x.  This is due to a bug in the kernel io schedular's
"cfq" implementation.

Please reference the following two Kernel.org bugs for details on this issue:

Dump of ext3 runs very slowly:
http://bugzilla.kernel.org/show_bug.cgi?id=8636

Unusable system (ie slow) when copying large files:
http://bugzilla.kernel.org/show_bug.cgi?id=7372

Until our heroes at Kernel.org corrects this problem,
would you consider implementing the following workaround in
our kernels?

modify the kernel's .config
from:

   CONFIG_DEFAULT_IOSCHED="cfq"

to:

   CONFIG_DEFAULT_IOSCHED="anticipatory"


Many thanks,
-T

Comment 1 Todd 2008-05-18 00:49:08 UTC

Hi,

   Found an easier workaround.   Just add "elevator=as" to the end of
the grub.conf "kernel" line.  For example: 

	kernel /boot/vmlinuz-2.6.18-53.1.14.el5 ro root=LABEL=/1 rhgb quiet elevator=as

Sped up my "dump" backups by a factor of four.

This suggestions should be a lot easier to implement than recompiling the kernel.

Many thanks,
-T

Comment 2 Jeff Moyer 2008-10-09 17:25:11 UTC

I'll take this bug.

Comment 3 Jeff Moyer 2008-10-09 20:54:08 UTC

I've attached a preliminary patch to bug 456181 for this:
  https://bugzilla.redhat.com/attachment.cgi?id=319934

Comment 4 Masaki MAENO 2009-01-13 01:46:45 UTC

Please backport the following patch for resolving the problem from community-kernel to RHEL5-kernel.

> Subject:
> [PATCH] cfq-iosched: fix queue depth detection
> From:
> Aaron Carroll <aaronc.edu.au>
> Date:
> Fri, 22 Aug 2008 16:42:42 +1000
> To:
> Jens Axboe <jens.axboe>
> CC:
> LKML <linux-kernel.org>
> 
> Hi Jens,
> 
> This patch fixes a bug in the hw_tag detection logic causing a huge performance
> hit under certain workloads on real queuing devices.  For example, an FIO load
> of 16k direct random reads on an 8-disk hardware RAID yields about 2 MiB/s on
> default CFQ, while noop achieves over 20 MiB/s.
> 
> While the solution is pretty ugly, it does have the advantage of adapting to
> queue depth changes.  Such a situation might occur if the queue depth is
> configured in userspace late in the boot process.
> 
> Thanks,
>   Aaron.
> 
> --
> 
> CFQ's detection of queueing devices assumes a non-queuing device and detects
> if the queue depth reaches a certain threshold.  Under some workloads (e.g.
> synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
> the detection logic.  This leads to poor performance on queuing hardware,
> since the idle window remains enabled.
> 
> This patch inverts the sense of the logic: assume a queuing-capable device,
> and detect if the depth does not exceed the threshold.
> 
> Signed-off-by: Aaron Carroll <aaronc.edu.au>
> ---
>  block/cfq-iosched.c |   47 ++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 1e2aff8..01ebb75 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -39,6 +39,7 @@ static int cfq_slice_idle = HZ / 125;
>  #define CFQ_MIN_TT		(2)
>  
>  #define CFQ_SLICE_SCALE		(5)
> +#define CFQ_HW_QUEUE_MIN	(5)
>  
>  #define RQ_CIC(rq)		\
>  	((struct cfq_io_context *) (rq)->elevator_private)
> @@ -86,7 +87,14 @@ struct cfq_data {
>  
>  	int rq_in_driver;
>  	int sync_flight;
> +
> +	/*
> +	 * queue-depth detection
> +	 */
> +	int rq_queued;
>  	int hw_tag;
> +	int hw_tag_samples;
> +	int rq_in_driver_peak;
>  
>  	/*
>  	 * idle window management
> @@ -654,15 +662,6 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
>  	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
>  						cfqd->rq_in_driver);
>  
> -	/*
> -	 * If the depth is larger 1, it really could be queueing. But lets
> -	 * make the mark a little higher - idling could still be good for
> -	 * low queueing, and a low queueing number could also just indicate
> -	 * a SCSI mid layer like behaviour where limit+1 is often seen.
> -	 */
> -	if (!cfqd->hw_tag && cfqd->rq_in_driver > 4)
> -		cfqd->hw_tag = 1;
> -
>  	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
>  }
>  
> @@ -686,6 +685,7 @@ static void cfq_remove_request(struct request *rq)
>  	list_del_init(&rq->queuelist);
>  	cfq_del_rq_rb(rq);
>  
> +	cfqq->cfqd->rq_queued--;
>  	if (rq_is_meta(rq)) {
>  		WARN_ON(!cfqq->meta_pending);
>  		cfqq->meta_pending--;
> @@ -1833,6 +1833,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  {
>  	struct cfq_io_context *cic = RQ_CIC(rq);
>  
> +	cfqd->rq_queued++;
>  	if (rq_is_meta(rq))
>  		cfqq->meta_pending++;
>  
> @@ -1880,6 +1881,31 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
>  	cfq_rq_enqueued(cfqd, cfqq, rq);
>  }
>  
> +/*
> + * Update hw_tag based on peak queue depth over 50 samples under
> + * sufficient load.
> + */
> +static void cfq_update_hw_tag(struct cfq_data *cfqd)
> +{
> +	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
> +		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
> +
> +	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
> +	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
> +		return;
> +
> +	if (cfqd->hw_tag_samples++ < 50)
> +		return;
> +
> +	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
> +		cfqd->hw_tag = 1;
> +	else
> +		cfqd->hw_tag = 0;
> +
> +	cfqd->hw_tag_samples = 0;
> +	cfqd->rq_in_driver_peak = 0;
> +}
> +
>  static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  {
>  	struct cfq_queue *cfqq = RQ_CFQQ(rq);
> @@ -1890,6 +1916,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  	now = jiffies;
>  	cfq_log_cfqq(cfqd, cfqq, "complete");
>  
> +	cfq_update_hw_tag(cfqd);
> +
>  	WARN_ON(!cfqd->rq_in_driver);
>  	WARN_ON(!cfqq->dispatched);
>  	cfqd->rq_in_driver--;
> @@ -2200,6 +2228,7 @@ static void *cfq_init_queue(struct request_queue *q)
>  	cfqd->cfq_slice[1] = cfq_slice_sync;
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
> +	cfqd->hw_tag = 1;
>  
>  	return cfqd;
>  }

Comment 5 Jeff Moyer 2009-05-02 00:22:17 UTC

This work will not make the 5.4 release.

When the problem was initially reported, I talked to Jens Axboe about it, and he seemed receptive to the idea of adding some code to CFQ to detect processes interleaving I/Os. When I came up with a first patch for this, he then suggested that we would be better off solving the problem in the applications themselves, by having the applications explicitly share I/O contexts (using sys_clone and the CLONE_IO flag*). I wrote a patch for dump to do this very thing, and it did solve the problem. However, the list of applications suffering from this kept growing. The applications I know of that perform interleaved reads between multiple processes include:

dump
nfsd
qemu's posix aio backend
one of the iSCSI target mode implementations
a third-party volume manager

It is evident that this is not too uncommon of a programming paradigm, so Jens decided to take the close cooperator patch set into 2.6.30. However, the implementation he merged was not quite ready for merging as it can cause some processes to be starved. I've been working with him to fix the problem properly while preserving fairness. In the end, the solution may involve a combination of detecting cooperating processes and sharing I/O contexts between them automatically.

This issue is my number one priority, and I will keep this bugzilla updated as progress is made.

* Note that shared I/O contexts (and the CLONE_IO flag) are not supported in RHEL 5, otherwise I would have made that fix available for the 5.4 release.

Comment 6 Jeff Moyer 2009-10-30 21:20:42 UTC

I put together another test kernel that implements close cooperator detection logic, and merges the cfq_queue's associated with cooperating processes.  The result is that we get a good speedup.  In 100 runs of the read-test2 program (written to simulate the I/O pattern of the dump utility), these are the throughput numbers in MB/s:

Deadline:
Avg:       101.26907
Std. Dev.:  17.59767

CFQ:
Avg:       100.14914
Std. Dev.:  17.42747

Most of the runs saw 105MB/s, but there were some outliers in the 28-30MB/s range.  I looked into those cases, and found that the cause was processes were scheduled in just the wrong order to introduce seeks into the workload.  Unfortunately, I haven't come up with a good solution for that particular problem, though I'll note that the problem affects other I/O schedulers as well.  Upstream does not exhibit this behaviour, and I believe it may be due to the rewritten readahead code, but I can't be certain without further investigation.

Without the patch set applied, the numbers for cfq were in the 7-10MB/s range.

I wasn't able to test nfs server performance as my test lab was experiencing some networking issue.  I'll get that testing underway once that problem is resolved.

I've uploaded a test kernel here:
  http://people.redhat.com/jmoyer/cfq-cc/

Please take it for a spin and report your results.  If you'd like to test on an architecture other than x86_64, just let me know and I'll kick off a build for whatever architecture is required.

Comment 7 RHEL Program Management 2009-11-09 20:20:58 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Don Zickus 2009-11-10 16:50:04 UTC

in kernel-2.6.18-173.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 10 Jeff Moyer 2009-11-25 20:14:06 UTC

I posted one additional patch for this to rhkernel-list for review.

Comment 11 Don Zickus 2009-12-04 18:58:16 UTC

in kernel-2.6.18-177.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 13 Jeff Moyer 2010-01-11 14:45:04 UTC

Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads.  When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads.  This kernel contains a fix for this problem by detecting cooperating queues and merging them together.  If the queues stop issuing requests close to one another, then they are broken apart again.

Comment 15 Ryan Lerch 2010-02-02 04:50:05 UTC

Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads.  When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads.  This kernel contains a fix for this problem by detecting cooperating queues and merging them together.  If the queues stop issuing requests close to one another, then they are broken apart again.+Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.

Comment 20 errata-xmlrpc 2010-03-30 07:43:42 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 21 Todd 2014-06-09 20:01:52 UTC

Got a note that this had an info request.  Couldn't find the request, but I am happy with what has been done.  So, I guess ...

Note You need to log in before you can comment on or make changes to this bug.