Dump and other large file operations are up to 7 times slower than previous kernels in rhel4.x. This is due to a bug in the kernel io schedular's "cfq" implementation. Please reference the following two Kernel.org bugs for details on this issue: Dump of ext3 runs very slowly: http://bugzilla.kernel.org/show_bug.cgi?id=8636 Unusable system (ie slow) when copying large files: http://bugzilla.kernel.org/show_bug.cgi?id=7372 Until our heroes at Kernel.org corrects this problem, would you consider implementing the following workaround in our kernels? modify the kernel's .config from: CONFIG_DEFAULT_IOSCHED="cfq" to: CONFIG_DEFAULT_IOSCHED="anticipatory" Many thanks, -T
Hi, Found an easier workaround. Just add "elevator=as" to the end of the grub.conf "kernel" line. For example: kernel /boot/vmlinuz-2.6.18-53.1.14.el5 ro root=LABEL=/1 rhgb quiet elevator=as Sped up my "dump" backups by a factor of four. This suggestions should be a lot easier to implement than recompiling the kernel. Many thanks, -T
I'll take this bug.
I've attached a preliminary patch to bug 456181 for this: https://bugzilla.redhat.com/attachment.cgi?id=319934
Please backport the following patch for resolving the problem from community-kernel to RHEL5-kernel. > Subject: > [PATCH] cfq-iosched: fix queue depth detection > From: > Aaron Carroll <aaronc.edu.au> > Date: > Fri, 22 Aug 2008 16:42:42 +1000 > To: > Jens Axboe <jens.axboe> > CC: > LKML <linux-kernel.org> > > Hi Jens, > > This patch fixes a bug in the hw_tag detection logic causing a huge performance > hit under certain workloads on real queuing devices. For example, an FIO load > of 16k direct random reads on an 8-disk hardware RAID yields about 2 MiB/s on > default CFQ, while noop achieves over 20 MiB/s. > > While the solution is pretty ugly, it does have the advantage of adapting to > queue depth changes. Such a situation might occur if the queue depth is > configured in userspace late in the boot process. > > Thanks, > Aaron. > > -- > > CFQ's detection of queueing devices assumes a non-queuing device and detects > if the queue depth reaches a certain threshold. Under some workloads (e.g. > synchronous reads), CFQ effectively forces a unit queue depth, thus defeating > the detection logic. This leads to poor performance on queuing hardware, > since the idle window remains enabled. > > This patch inverts the sense of the logic: assume a queuing-capable device, > and detect if the depth does not exceed the threshold. > > Signed-off-by: Aaron Carroll <aaronc.edu.au> > --- > block/cfq-iosched.c | 47 ++++++++++++++++++++++++++++++++++++++--------- > 1 files changed, 38 insertions(+), 9 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index 1e2aff8..01ebb75 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -39,6 +39,7 @@ static int cfq_slice_idle = HZ / 125; > #define CFQ_MIN_TT (2) > > #define CFQ_SLICE_SCALE (5) > +#define CFQ_HW_QUEUE_MIN (5) > > #define RQ_CIC(rq) \ > ((struct cfq_io_context *) (rq)->elevator_private) > @@ -86,7 +87,14 @@ struct cfq_data { > > int rq_in_driver; > int sync_flight; > + > + /* > + * queue-depth detection > + */ > + int rq_queued; > int hw_tag; > + int hw_tag_samples; > + int rq_in_driver_peak; > > /* > * idle window management > @@ -654,15 +662,6 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) > cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d", > cfqd->rq_in_driver); > > - /* > - * If the depth is larger 1, it really could be queueing. But lets > - * make the mark a little higher - idling could still be good for > - * low queueing, and a low queueing number could also just indicate > - * a SCSI mid layer like behaviour where limit+1 is often seen. > - */ > - if (!cfqd->hw_tag && cfqd->rq_in_driver > 4) > - cfqd->hw_tag = 1; > - > cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors; > } > > @@ -686,6 +685,7 @@ static void cfq_remove_request(struct request *rq) > list_del_init(&rq->queuelist); > cfq_del_rq_rb(rq); > > + cfqq->cfqd->rq_queued--; > if (rq_is_meta(rq)) { > WARN_ON(!cfqq->meta_pending); > cfqq->meta_pending--; > @@ -1833,6 +1833,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq, > { > struct cfq_io_context *cic = RQ_CIC(rq); > > + cfqd->rq_queued++; > if (rq_is_meta(rq)) > cfqq->meta_pending++; > > @@ -1880,6 +1881,31 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq) > cfq_rq_enqueued(cfqd, cfqq, rq); > } > > +/* > + * Update hw_tag based on peak queue depth over 50 samples under > + * sufficient load. > + */ > +static void cfq_update_hw_tag(struct cfq_data *cfqd) > +{ > + if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak) > + cfqd->rq_in_driver_peak = cfqd->rq_in_driver; > + > + if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN && > + cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN) > + return; > + > + if (cfqd->hw_tag_samples++ < 50) > + return; > + > + if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN) > + cfqd->hw_tag = 1; > + else > + cfqd->hw_tag = 0; > + > + cfqd->hw_tag_samples = 0; > + cfqd->rq_in_driver_peak = 0; > +} > + > static void cfq_completed_request(struct request_queue *q, struct request *rq) > { > struct cfq_queue *cfqq = RQ_CFQQ(rq); > @@ -1890,6 +1916,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > now = jiffies; > cfq_log_cfqq(cfqd, cfqq, "complete"); > > + cfq_update_hw_tag(cfqd); > + > WARN_ON(!cfqd->rq_in_driver); > WARN_ON(!cfqq->dispatched); > cfqd->rq_in_driver--; > @@ -2200,6 +2228,7 @@ static void *cfq_init_queue(struct request_queue *q) > cfqd->cfq_slice[1] = cfq_slice_sync; > cfqd->cfq_slice_async_rq = cfq_slice_async_rq; > cfqd->cfq_slice_idle = cfq_slice_idle; > + cfqd->hw_tag = 1; > > return cfqd; > }
This work will not make the 5.4 release. When the problem was initially reported, I talked to Jens Axboe about it, and he seemed receptive to the idea of adding some code to CFQ to detect processes interleaving I/Os. When I came up with a first patch for this, he then suggested that we would be better off solving the problem in the applications themselves, by having the applications explicitly share I/O contexts (using sys_clone and the CLONE_IO flag*). I wrote a patch for dump to do this very thing, and it did solve the problem. However, the list of applications suffering from this kept growing. The applications I know of that perform interleaved reads between multiple processes include: dump nfsd qemu's posix aio backend one of the iSCSI target mode implementations a third-party volume manager It is evident that this is not too uncommon of a programming paradigm, so Jens decided to take the close cooperator patch set into 2.6.30. However, the implementation he merged was not quite ready for merging as it can cause some processes to be starved. I've been working with him to fix the problem properly while preserving fairness. In the end, the solution may involve a combination of detecting cooperating processes and sharing I/O contexts between them automatically. This issue is my number one priority, and I will keep this bugzilla updated as progress is made. * Note that shared I/O contexts (and the CLONE_IO flag) are not supported in RHEL 5, otherwise I would have made that fix available for the 5.4 release.
I put together another test kernel that implements close cooperator detection logic, and merges the cfq_queue's associated with cooperating processes. The result is that we get a good speedup. In 100 runs of the read-test2 program (written to simulate the I/O pattern of the dump utility), these are the throughput numbers in MB/s: Deadline: Avg: 101.26907 Std. Dev.: 17.59767 CFQ: Avg: 100.14914 Std. Dev.: 17.42747 Most of the runs saw 105MB/s, but there were some outliers in the 28-30MB/s range. I looked into those cases, and found that the cause was processes were scheduled in just the wrong order to introduce seeks into the workload. Unfortunately, I haven't come up with a good solution for that particular problem, though I'll note that the problem affects other I/O schedulers as well. Upstream does not exhibit this behaviour, and I believe it may be due to the rewritten readahead code, but I can't be certain without further investigation. Without the patch set applied, the numbers for cfq were in the 7-10MB/s range. I wasn't able to test nfs server performance as my test lab was experiencing some networking issue. I'll get that testing underway once that problem is resolved. I've uploaded a test kernel here: http://people.redhat.com/jmoyer/cfq-cc/ Please take it for a spin and report your results. If you'd like to test on an architecture other than x86_64, just let me know and I'll kick off a build for whatever architecture is required.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-173.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I posted one additional patch for this to rhkernel-list for review.
in kernel-2.6.18-177.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads. This kernel contains a fix for this problem by detecting cooperating queues and merging them together. If the queues stop issuing requests close to one another, then they are broken apart again.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads. This kernel contains a fix for this problem by detecting cooperating queues and merging them together. If the queues stop issuing requests close to one another, then they are broken apart again.+Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
Got a note that this had an info request. Couldn't find the request, but I am happy with what has been done. So, I guess ...