Bug 448130 - 50-75 % drop in cfq read performance compared to rhel 4.6+
Summary: 50-75 % drop in cfq read performance compared to rhel 4.6+
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: Jeff Moyer
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 436004
Blocks: 391501 483701 499522 525215 533192 5.5TechNotes-Updates 557926
TreeView+ depends on / blocked
 
Reported: 2008-05-23 17:06 UTC by Eric Sandeen
Modified: 2018-10-27 15:32 UTC (History)
39 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.
Clone Of:
Environment:
Last Closed: 2010-03-30 07:18:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
seekwatcher graph of 8 vs. 1 nfsd thread (72.93 KB, image/png)
2008-05-23 18:20 UTC, Eric Sandeen
no flags Details
Systemtap trace with one nfsd process (9.37 KB, text/plain)
2008-05-29 20:15 UTC, Steve Dickson
no flags Details
Systemtap trace with two nfsd processes (9.37 KB, text/plain)
2008-05-29 20:16 UTC, Steve Dickson
no flags Details
Systemtap trace of ext3 with two nfsd process (9.37 KB, text/plain)
2008-05-29 20:42 UTC, Steve Dickson
no flags Details
Systemtap trace of ext3 with one nfsd process (9.37 KB, text/plain)
2008-05-29 20:44 UTC, Steve Dickson
no flags Details
updated seekwatcher trace (79.19 KB, image/png)
2008-05-30 21:10 UTC, Eric Sandeen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Eric Sandeen 2008-05-23 17:06:05 UTC
+++ This bug was initially created as a clone of Bug #436004 +++

This bug specifically covers the read perf regression.  Please see bug for
original partner comments & other info.

On the read side...

On my test setup, with 8 nfsd threads I am seeing about 20MB/s for read and
reread, and a fairly high seek rate, to the tune of around 200 seeks/s.

If I restrict to only 1 nfsd thread, I get 55MB/s and the seek rate is
substantially lower.

Additionally, if we look at the block IO stats for 1 thread:


Total (iozone_xfs_read_1thread_full):
 Reads Queued:      31,254,    4,000MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   31,115,    4,000MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   31,115,    4,000MiB	 Writes Completed:        0,        0KiB
 Read Merges:          139,   17,792KiB	 Write Merges:            0,        0KiB
 IO unplugs:        25,777        	 Timer unplugs:           0

Throughput (R/W): 74,414KiB/s / 0KiB/s

vs. 8 threads:

Total (iozone_xfs_read_full):
 Reads Queued:     121,516,    4,000MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   65,893,    4,000MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   65,893,    4,000MiB	 Writes Completed:        0,        0KiB
 Read Merges:       55,503,    1,768MiB	 Write Merges:            0,        0KiB
 IO unplugs:       125,270        	 Timer unplugs:           0

Throughput (R/W): 32,108KiB/s / 0KiB/s

we can see that this results in a very different IO pattern... with 1 thread
doing larger IOs.

-- Additional comment from esandeen on 2008-05-21 17:17 EST --
I'm going to hazard a guess that on the read side, sharing read requests across
the nfsds is defeating readahead.

With 8 threads, for a block range I see requests issued like:

  8,21   1        6     0.000019677  4059  D   R 630773105 + 64 [nfsd]
  8,21   1       16     0.019685010  4059  D   R 630773297 + 256 [nfsd]
  8,21   1       28     0.049531279  4059  D   R 630773553 + 32 [nfsd]
  8,21   1       34     0.049584655  4060  D   R 630773585 + 64 [nfsd]
  8,21   1       40     0.049614579  4061  D   R 630773649 + 64 [nfsd]
  8,21   1       46     0.049652358  4058  D   R 630773713 + 64 [nfsd]
.... 
and more.

With 1 thread:

  8,21   1        6     0.635870590  4309  D   R 630773105 + 64 [nfsd]
  8,21   1       14     0.662722860  4309  D   R 630773169 + 384 [nfsd]
  8,21   1       23     0.684639922  4309  D   R 630773553 + 512 [nfsd]

this looks like a growing readahead window.

-Eric

-- Additional comment from esandeen on 2008-05-21 17:40 EST --
hmm that might have been slightly anomalous but I do still see the single-thread
case consistently issuing larger IOs, to the tune of 256 sectors vs. 64, usually.

-Eric

-- Additional comment from sandeep_k_shandilya on 2008-05-22 06:39 EST --
(In reply to comment #52)
> I'm going to hazard a guess that on the read side, sharing read requests across
> the nfsds is defeating readahead.

Yes, I have confirmed this, with one thread, rhel5 server performance is equal
to rhel 4 performance.

sandeep

-- Additional comment from jskrabal on 2008-05-22 10:34 EST --
Proposing for rhel-5.2.z. Still needs SEG approval.

-- Additional comment from esandeen on 2008-05-22 11:03 EST --
There seem to be at least 2 issues here, one regarding rewrite which I think I
have identified and solved, and the other regarding read performance, which is
perhaps a bit narrowed down now, but not solved.

Would it be worth filing 1 or 2 sub-bugs to this bug, one for each issue, to
track them separately?

The rewrite issue makes a bit of sense to me for a z-stream, it's a pretty
obvious fix, but the read issue still needs investigation.

-Eric

-- Additional comment from bmarson on 2008-05-22 14:33 EST --
Here's the matrix of nfsd thread count I have come up with.  These were run with
 my simzone tool (100 lines vs 3000 lines of iozone)

nfsd    +--------- RHEL4 -67 ---------+-------- RHEL5  -88 ---------
threads | iwrite rewrite read  reread | iwrite rewrite read   reread
--------+-----------------------------+-----------------------------
1         47427  40593   75514 75764    40963  40465   75229  75697
2         41241  38589   75411 75526    41434  41385   13113  13218
4         42007  38706   70648 69157    46322  38657   16706  16715
8         36787  39707   56489 56650    43524  42155   31778  32141
16                                      44875  39903   45675  45682
32                                      42315  39434   45942  46185

As you can see, with a single nfsd thread RHEL5 read performance matches RHEL4.
 The biggest disparity is the read performance especially at 2 threads. 
Cranking up the thread count in RHEL5 does improve read performance, but it
never reaches RHEL4's performance.

I'm testing the 2 nfsd threads with a 1 cpu booted server.  What I see is
similar low read performance (15KB) as with the 8 cpu's (13KB) booted server. 

Barry

Comment 1 Eric Sandeen 2008-05-23 18:20:14 UTC
Created attachment 306534 [details]
seekwatcher graph of 8 vs. 1 nfsd thread

Here's a graph showing seeks & throughput for running:

iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w

from a client, with the server running 8 and 1 nfsd threads, just to show at a
high level what the seek & throughput situation looks like.

Comment 2 Barry Marson 2008-05-23 20:15:20 UTC
So just a summary of the simplest configuration to recreate this read issue. 
Run a RHEL5 server booted with 1 CPU and 2 nfsd's.  This yields a 6X drop in
read performance.

Barry

Comment 4 Steve Dickson 2008-05-29 20:13:28 UTC
Attached below are to systemtap traces that show the 
NFS server is spending 2 to 3 times longer in reads 
when there are two nfsd threads verses one. 

Comment 5 Steve Dickson 2008-05-29 20:15:14 UTC
Created attachment 307128 [details]
Systemtap trace with one nfsd process

Comment 6 Steve Dickson 2008-05-29 20:16:11 UTC
Created attachment 307130 [details]
Systemtap trace with two nfsd processes

Comment 7 Steve Dickson 2008-05-29 20:42:43 UTC
Created attachment 307134 [details]
Systemtap trace of ext3 with two  nfsd process

Comment 8 Steve Dickson 2008-05-29 20:44:27 UTC
Created attachment 307135 [details]
Systemtap trace of ext3 with one nfsd process

Comment 9 Ric Wheeler 2008-05-30 13:36:10 UTC
(In reply to comment #1)
> Created an attachment (id=306534) [edit]
> seekwatcher graph of 8 vs. 1 nfsd thread
> 
> Here's a graph showing seeks & throughput for running:
> 
> iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w
> 
> from a client, with the server running 8 and 1 nfsd threads, just to show at a
> high level what the seek & throughput situation looks like.

One interesting note is that the writes are much more spread out in the 8 thread
case if I read the graph correctly.

Can we dump the list of blocks allocated for /mnt/nfs/testfile? Bad layout in
the write phase can show up as a huge drop in the read throughput...


Comment 10 Eric Sandeen 2008-05-30 14:15:52 UTC
Ric: re: comment #9; the file I was reading here was actually created on the
server, so it was fairly contiguous (about as contiguous as ext3 could make it,
anyway...)  I can double check the layout if you like.  But, the 1- and 8-thread
graphs are actually reading the very same file with the very same layout; I used
the -w switch to keep the previously-written file in place through both runs. 
So both cases saw the same degree of fragmentation.

-Eric

Comment 11 Ric Wheeler 2008-05-30 14:23:29 UTC
From the seek watcher plot, it did look like there were distinct write pattern
differences in the green and light blue plot lines, but if the test reuses the
same file, I must be misreading ;-)

thanks,

ric


Comment 12 Eric Sandeen 2008-05-30 21:10:21 UTC
Created attachment 307238 [details]
updated seekwatcher trace

The previous seekwatcher graph was buggy; the stair-steppiness was an artifact
of a too-coarse grid for the graph generation.

Just to completely rule out any server-side fragmentation as an issue, I tested
reading the file from xfs which had the 2G file in a single contiguous extent. 
This is the resulting (fixed) seekwatcher graph for 8 vs. 1 nfsd threads.

Thanks,
-Eric

Comment 13 Steve Dickson 2008-06-03 19:59:08 UTC
Putting a systemtap probe in mpage_readpages (which
is what ext3_readpages calls) you can see a very
different pattern on how many pages are allocated.

The following tables show the number of requested pages (#pages),how 
many times mpage_readpages was called (#Calls) and the 
Total amount of time (in nanoseconds) was spent.

One NFSd process
               (#pages)  # Calls   Total ns
mpage_readpages(32)      32767   3776096302
mpage_readpages(16)          2     18598287
Total Calls: 32769

Two NFSd processes:
mpage_readpages(32)       6783    134188214
mpage_readpages(31)          9       171383
mpage_readpages(30)          1         8091
mpage_readpages(29)          2        38305
mpage_readpages(28)         14      8638196
mpage_readpages(27)          1         7189
mpage_readpages(24)        404      8373816
mpage_readpages(23)          1         6801
mpage_readpages(20)          3     24430710
mpage_readpages(16)       4836     59248733
mpage_readpages(15)         10        98711
mpage_readpages(12)          9      8650390
mpage_readpages(11)          1         1285
mpage_readpages(10)          4        23062
mpage_readpages(9)          3        18580
mpage_readpages(8)      90658   2789966293
mpage_readpages(7)         74       445143
mpage_readpages(6)         13        58773
mpage_readpages(5)         12        39802
mpage_readpages(4)         52    274642178
mpage_readpages(3)         10     17733454
mpage_readpages(2)         15        43661
mpage_readpages(1)         16        44958
Total Calls: 102931

So there is quite a different patten when one or 
two nfsd are used... This may be how pages were always
allocated and we never noticed it before since it didn't
cause a problem.... but now it seems to... 




Comment 14 John Feeney 2008-06-04 14:02:56 UTC
Dell has found that the workaround proposed in the original bugzilla (bz436004
comment #67) decreases the effects of this read side problem. Thus, I have 
copied the comment that details the workaround here for posterity. Thanks 
go out to Ben England at ibrix for providing it.


"I have been using a workaround described below, and have observed no regression
in RHEL5.1 single-threaded NFS reads when using this workaround.  This seems
consistent with the preceding results in this bug report -- i.e. 1 nfsd thread
is much faster.

The workaround is to add this line to /etc/rc.local boot script and then to run
that script:

# for n in /sys/block/sd*/queue/iosched/slice_idle ; do echo 1 > $n ; done

This parameter did not exist in the RHEL4 CFQ I/O scheduler.  A similar effect
can be achieved with use of the deadline or noop scheduler, but for writes we
have seen better results with CFQ.

The purpose of this workaround is to minimize overhead imposed by CFQ when
multiple threads are reading from the same file.  NFS uses a thread pool to
service RPCs, so that a sequential single-thread read at the application layer
becomes a multi-thread read at the NFS server.  CFQ treats threads as if they
were application processes, but in fact they are not here so the default delay
of 8 ms between switching to a different thread’s requests, represented by the
slice_idle block device tuning parameter, is unreasonable.    Others have seen
this problem, including the author of CFQ.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg05066.html

More research needs to be done on the effect of setting this parameter to zero,
but until we do a systematic test of all known workloads with this value I would
not recommend it as a general solution.  

Reproducer: A 43% improvement, from 24.7 to 35.4 MB/s, was observed using this
simple test done with 2 hosts running RHEL5.1 connected by a a 1-Gb Ethernet
link.  The NFS server exported a partition on the system disk, /dev/sda3,
mounted as an ext3 file system.  No NFS or ext3 tuning was used.  The workload was:

# dd of=/dev/null bs=64k count=16k if=/mnt/nfsext3/f"




Comment 16 RHEL Program Management 2008-07-09 19:01:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Eric Sandeen 2008-07-10 22:18:39 UTC
FWIW:  I ran a test with a rhel5 server, F9 client.

Created a file on the server with:

iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 0 -w

and read from the client with:

iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w

dropped caches on the server in between tests.  Based on comment #67 in the
parent bug 436004 I decided to try different io schedulers on the server:

                                                                               
            
              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    26694    98303
noop:    2000000      64                    43093    98142
anticip: 2000000      64                    43409    98202
deadline:2000000      64                    43423    98372

... so this is certainly looking cfq-related.

-Eric


Comment 19 Eric Sandeen 2008-07-10 22:39:44 UTC
Doing the same test locally on the server, not over nfs:

              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    56761  1945684
noop:    2000000      64                    88780  1890856
anticip: 2000000      64                    89262  1906614
deadline:2000000      64                    90563  1893523

I suppose the next logical test would be to compare nfs perf w/ one of the other
schedulers between RHEL4 and RHEL5, to see if there is any nfs component of the
problem at all.

-Eric


Comment 20 Peter Staubach 2008-07-11 11:49:30 UTC
Thanx for doing this work, Eric!

I wouldn't be surprised if there is an NFS component, but I think that
it would be good to get the cfq issues addressed because that's the
default.

Comment 23 Eric Sandeen 2008-08-08 17:38:36 UTC
I ran local & nfs tests on the same box as previous, but with a RHEL4 kernel installed (on a RHEL5 root).  Summary follows, with RHEL5 results restated.

-RHEL4-

Local
              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    76965  2019404
noop:    2000000      64                    76461  2040349
anticip: 2000000      64                    76728  2030559
deadline:2000000      64                    78802  2082506

NFS
              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    44627   100130
noop:    2000000      64                    44510    97712
anticip: 2000000      64                    43739    98699
deadline:2000000      64                    43937    99337

-RHEL5-

Local
              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    56761  1945684
noop:    2000000      64                    88780  1890856
anticip: 2000000      64                    89262  1906614
deadline:2000000      64                    90563  1893523

NFS
              KB  reclen   write rewrite    read    reread
cfq:     2000000      64                    26694    98303
noop:    2000000      64                    43093    98142
anticip: 2000000      64                    43409    98202
deadline:2000000      64                    43423    98372

at least from this test, there does not seem to be an NFS performance regression, it appears to be all CFQ's doing.

Comment 25 Eric Sandeen 2008-08-14 14:33:12 UTC
CFQ is pretty clearly hurting here.

For a large/fast array I'd suggest noop in any case.  But we need to get this fixed, I will try to find time to look into it soon.

Comment 26 Shyam kumar Iyer 2008-08-22 09:46:14 UTC
Please have a look at the below code path that is employed by cfq which makes me believe that this is a cfq design

"cfq_select_queue" calls the following snippet of code so if there are no tother further requests then it will wait on idle expecting some other back to back requests.

        /*
         * if queue has requests, dispatch one. if not, check if
         * enough slice is left to wait for one
         */
        if (!RB_EMPTY_ROOT(&cfqq->sort_list))
                goto keep_queue;
        else if (cfq_cfqq_dispatched(cfqq)) {
                cfqq = NULL;
                goto keep_queue;
        } else if (cfq_cfqq_class_sync(cfqq)) {
                if (cfq_arm_slice_timer(cfqd, cfqq))
                        return NULL;
        }


After completing a request in "cfq_completed_request" it again waits on idle if there are no further requests until the timer expires.

 /*
         * If this is the active queue, check if it needs to be expired,
         * or if we want to idle in case it has no pending requests.
         */
        if (cfqd->active_queue == cfqq) {
                if (time_after(now, cfqq->slice_end))
                        cfq_slice_expired(cfqd, 0);
                else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
                        cfq_arm_slice_timer(cfqd, cfqq);
        }


And inside the cfq_arm_slice_timer here is the below snippet of code where it wants to idle for seeks.
        sl = min(cfqq->slice_end - 1, (unsigned long) cfqd->cfq_slice_idle);

        /*
         * we don't want to idle for seeks, but we do want to allow
         * fair distribution of slice time for a process doing back-to-back
         * seeks. so allow a little bit of time for him to submit a new rq
         */
        if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
                sl = min(sl, msecs_to_jiffies(2));

        mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
        return 1;


Now, all of this is per the design of the algorithm(fair queueing) that assumes that there may be back to back io requests even if there are no pending requests currently.

Since the nfsd spawns multiple kernel threads they all have their request queue. So, if the io scheduler is scheduling one nfsd thread all the other threads may be waiting on i/o.

We see that the problem goes off if the slice idle is reduced because that makes the algorithm not wait on one thread for idle time. 

This is a good algorithm for not switching context between say a process A and process B(completely different processes).

But in the case of nfs which spawns multiple nfsd threads that are doing the same task through different queues this algorithm defeats the purpose of keeping the idle time.

I am thinking that probably the cfq algorithm needs to maintain thread groups(that perform the same function and use multiple threads to increase parallelism) to decide on idle time.

Please review and check if the analysis needs to be taken to upstream.

Comment 27 Eric Sandeen 2008-08-22 16:32:51 UTC
Shyam, that seems like a reasonable assessment, however: in comment #23 I've shown that even local, single-threaded IO is suffering under cfq.  I think that before we get to the multi-threaded nfsd issues, we need to sort out what's going on with simple, local IO ...

Comment 28 Shyam kumar Iyer 2008-08-23 12:26:09 UTC
I am sure that it is going to still wait on I/O even if there is only one i/o thread. 

slice_idle will cause the thread to wait on i/o and do the same with other kernel threads. It is just that we reduced the number of kernel threads waiting on i/o by making the enviroment simple, local i/o based.

Comment 29 Eric Sandeen 2008-08-23 15:40:00 UTC
Shyam, out of curiosity, can I ask what sort of storage you've been testing on?

Thanks,
-Eric

Comment 30 Shyam kumar Iyer 2008-09-17 11:57:42 UTC
The testing is on SAS disks connected to a PERC 5/i.

Comment 31 Ben England 2008-10-21 17:24:59 UTC
Based on additional testing and code analysis similar to Shyam's, IBRIX is currently recommending slice_idle=0 for IBRIX FS devices.  When we use this setting, there is no degradation in read throughput from RHEL4, and it is about 20% better than slice_idle=1 (previous recommendation above).  

It appears that EMC PowerPath V5 also used this for all /dev/emcpower* devices on RHEL5, but you'd have to ask them.  

This does not mean that slice_idle should be removed from the scheduler.

Comment 32 Jeff Moyer 2008-11-10 18:30:02 UTC
Can I get testing feedback on the following patch, please?  If you require a kernel build, please let me know against which kernel version you would like the patch applied.

https://bugzilla.redhat.com/attachment.cgi?id=319934

Cheers!

Comment 33 Jeff Moyer 2008-11-10 18:36:54 UTC
(In reply to comment #31)
> Based on additional testing and code analysis similar to Shyam's, IBRIX is
> currently recommending slice_idle=0 for IBRIX FS devices.  When we use this
> setting, there is no degradation in read throughput from RHEL4, and it is about
> 20% better than slice_idle=1 (previous recommendation above).  
> 
> It appears that EMC PowerPath V5 also used this for all /dev/emcpower* devices
> on RHEL5, but you'd have to ask them.  
> 
> This does not mean that slice_idle should be removed from the scheduler.

RHEL 4's cfq implementation was quantum-based, not time-based.  The algorithm went like so:

o Pick the most important queue with I/O pending
o submit X number of I/Os
o repeat

So, after a queue goes empty, the next queue is selected;  there is no idle window.  Setting slice_idle to zero, therefore, gets you very close to the behaviour that was witnessed under RHEL 4.

Now, idle slices do help a number of workloads.  In this case, however, it sounds like the nfsd threads are interleaving I/O to the same file, and thus the idle slice is really hurting performance (as you're now waiting up to 8ms between each I/O).  One solution to this problem is to detect multiple processes issuing I/O in this manner, and to switch queues instead of waiting for more requests on the current active queue.  That is precisely what the patch in comment #31 implements.

Comment 34 Jeff Moyer 2008-11-24 19:29:58 UTC
I'm still waiting for testing feedback.

Comment 35 Ric Wheeler 2008-11-25 18:33:26 UTC
Moving to 5.4

Comment 36 Sandeep K. Shandilya 2008-11-28 05:34:20 UTC
(In reply to comment #34)
> I'm still waiting for testing feedback.

here is the feedback, I see a slight improvement... but not yet to the same performance level as the deadline/noop scheduler
Here are the results with a rhel 5.2 client and a rhel 5.2 server and a rhel 5.2 cfq-patched server filesize=4G ramsize=2G client and server side caching has been taken into account by rebooting server over every run.

patch applied iosched=cfq
-------------------------
[root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 97.1749 seconds, 44.2 MB/s

patch applied iosched=deadline
------------------------------
[root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 73.773 seconds, 58.2 MB/s

patch NOT applied iosched=cfq rhel 5.2 kernel.
----------------------------------------------
[root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 154.909 seconds, 27.7 MB/s

patch NOT applied iosched=deadline rhel 5.2 kernel.
---------------------------------------------------
[root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 76.7484 seconds, 56.0 MB/s

Comment 39 RHEL Program Management 2009-02-16 15:22:40 UTC
Updating PM score.

Comment 41 Jeff Moyer 2009-05-02 00:21:09 UTC
This work will not make the 5.4 release.

When the problem was initially reported, I talked to Jens Axboe about it, and he seemed receptive to the idea of adding some code to CFQ to detect processes interleaving I/Os.  When I came up with a first patch for this, he then suggested that we would be better off solving the problem in the applications themselves, by having the applications explicitly share I/O contexts (using sys_clone and the CLONE_IO flag*).  I wrote a patch for dump(8) to do this very thing, and it did solve the problem.  I also have preliminary patches for nfsd.  However, the list of applications suffering from this kept growing.  The applications I know of that perform interleaved reads between multiple processes include:

dump
nfsd
qemu's posix aio backend
one of the iSCSI target mode implementations
a third-party volume manager

It is evident that this is not too uncommon of a programming paradigm, so Jens decided to take the close cooperator patch set into 2.6.30.  However, the implementation he merged was not quite ready for merging as it can cause some processes to be starved.  I've been working with him to fix the problem properly while preserving fairness.  In the end, the solution may involve a combination of detecting cooperating processes and sharing I/O contexts between them automatically.

This issue is my number one priority, and I will keep this bugzilla updated as progress is made.

* Note that shared I/O contexts (and the CLONE_IO flag) are not supported in RHEL 5, otherwise I would have made that fix available for the 5.4 release.

Comment 42 Janet 2009-05-05 11:44:42 UTC
I just wanted to mention something I have yet to see in this thread. I had the same. if not worse results for both local and network processing on my new RHEL 5.2 server builds. Alot of time was spent acquiring test results and comparison testing to older RHEL5.2 client machines that were performing at least twice as fast.
The differences were this: The seemingly faster machine was using Client (or Desktop if you prefer) The new machines were server class machines running RHEL 5.2 server software.
The other difference was RAID. The client machines did not have either hardware or software RAID.
The new machines were a RAID1 (mirror)with 2- 1TB SATA hard drives. We eliminated the RAID array from the equation and noticed immediate improvement. No real surprise there. but I did think it worth mentioning in case changing the slice_idle count or any of the other suggestions do NOT make a significant improvement for other sufferers of this bug.
It is only because I use clonezilla on a regular imaging schedule that I am comfortable in removing the RAID from my configuration as it was strickly for redundancy and to reduce downtime in case of drive failure.
My technician is still compiling his findings, I can post them if anyone is interested when he is done.

Comment 43 Menny Hamburger 2009-08-06 13:17:26 UTC
Has anyone thought about backporting CFQ-V1 over the RHEL5 elevator

Comment 44 Menny Hamburger 2009-08-06 13:29:07 UTC
We have up to 15% degradation in read performance on RHEL5 CFQ compared to CFQ V1 (when concurrency is 8) when running on the same machine, from the same load generators. 

Since this issue is so problematic, isn't it logical to try an port CFQ-V1 into RHEL5?

Comment 45 Jeff Moyer 2009-08-06 14:41:53 UTC
(In reply to comment #44)
> We have up to 15% degradation in read performance on RHEL5 CFQ compared to CFQ
> V1 (when concurrency is 8) when running on the same machine, from the same load
> generators. 
> 
> Since this issue is so problematic, isn't it logical to try an port CFQ-V1 into
> RHEL5?  

Moving backwards is not a solution with which I would be comfortable.  Making improvements upstream to take these workloads into account, and backporting those changes is the proper way forward.

Please understand that while CFQ v1 may work better for your workload, it may be worse for others.  Quantifying this is often difficult, as you don't hear about problems until they are encountered.

If you could give some specific information about your workload it will help me in verifying forthcoming solutions.

Thanks!

Comment 46 Mike Snitzer 2009-08-06 16:22:29 UTC
*** Bug 510861 has been marked as a duplicate of this bug. ***

Comment 47 Menny Hamburger 2009-08-09 09:40:59 UTC
Hi Jeffrey,

Each one of the I/O scheduling algorithms are good for different work loads.
Adding CFQ V1 is just like adding an additional scheduling algorithm to the list (you can even call it in another name)- It gives you another option on some workloads. I am not saying we should go backwards - on the contrary!! I think the code should support as many workloads as possible - if we cannot support some workload with the current code (in case the kernel misses some functionality in 2.6.18), I think we can provide an option that can. 
Another option can be to add a parameter to CFQ that will make it work more like V1. 

Best Regards,
Menny

Comment 48 Menny Hamburger 2009-09-03 09:29:03 UTC
I do not know if this is enough for emulating V1 over 2.6.18, but I have tested this issue through a degenerated version of CFQ V3, removing the CFQ io_context and slice code, leaving only what's necessary to make it work.
After setting slice_idle to 0, the results showed an insignificant (3%) increase in read performance for my workload, which uses vectored AIO heavily.
So I wonder if the problem is really on the scheduling algorithm.

M.

Comment 49 Jeff Moyer 2009-09-03 12:01:06 UTC
Hi, Menny,

Thanks for your testing.  I verified that the scheduling behaviour is the cause of the problem by analyzing blktrace data from test runs.  It's quite clear what is going on, actually.  Providing a CFQv1 scheduler is not something I will promote.  We really need to just address the problems in the current incarnation of CFQ.

Thanks,
Jeff

Comment 50 Menny Hamburger 2009-09-03 13:14:57 UTC
Hi Jeff,

Can you please elaborate on your findings? maybe using a blktrace as 
an example.

Thanks,
Menny

Comment 51 Jeff Moyer 2009-09-03 13:23:57 UTC
Hi, Menny,

Forgive me, but I don't have a trace handy.  It's quite simple to reproduce using the directions above.  One other indication that the idling logic is the culprit is that if you set the elevator's slice_idle to 0, the problem goes away.

Now, if you collect a blktrace yourself (with slice_idle set to its default of 8), you will see that the biggest cause of delay is the 8ms latency between switching queues.  You may also notice that one or two threads may make much more progress than the others, which will introduce a lot of seeks and associated penalties.  However, the real issue is that 8ms delay between I/Os.

I hope this is helpful.

Cheers,
Jeff

Comment 52 Menny Hamburger 2009-09-03 13:36:17 UTC
Hey Jeff,

Thnx for the quick response; however as I said above, setting slice_idle to 0 only cleared out most of the problem - there is still a 15-20% reduction in performance compared to CFQ V1. 

Best Regards,
Menny

Comment 53 Jeff Moyer 2009-09-03 13:43:58 UTC
I'm sorry, Menny, somehow I missed that.  Could you provide more specifics regarding your workload?  It would be best if you could attach a reproducer.  If that's not possible, then blktrace output, both with slice_idle set to 8 and 0, should be enough to get me going.  Armed with that output, I should be able to let you know what I think the problem is.

Thanks!

Comment 54 Menny Hamburger 2009-09-08 12:20:49 UTC
Hi Jeff,

I ran a performance comparison between the noop scheduler in RHEL4 and RHEL5 and got the same 15-20% performance degradation - so I started going up the stack to ll_rw_blk.c (the blkdev code). 
I saw that RHEL5 introduced a NAPI like mechanism for bio, that activates the ksoftirqd to handle completion requests.
I did some blktrace's and saw that when I do streaming, too many request completions come unordered - so I did some "googling".
I saw thw following patch that eventually found it's way into the Vanilla:
http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg26151.html

This fix gives me an additional ~10% more performance in streaming.

I will continue the tests an let you know.

Cheers,
Menny

Comment 55 Jeff Moyer 2009-09-08 13:55:50 UTC
Hi, Menny,

Thanks for your continued testing on this.  It's certainly interesting to see that noop has also regressed.  I'm having a hard time figuring out how that patch could have made any difference in performance.  It's title indicates that it is purely cosmetic, and I can't see a change in the logic.  Did you also run blktrace on the patched kernel to see if the request completions are in order?  What I've noticed from run to run is that the ordering of completions may not be predictable.  So I wonder if you just got lucky.  Maybe this is an area worth exploring further, though.

Also, what is your workload?  The git commit listed above deals with barrier requests (well, ordered requests), so unless this is a file system workload, it wouldn't make any difference at all, I think.

Thanks again, Menny!

Jeff

Comment 56 Menny Hamburger 2009-09-09 06:40:28 UTC
Hi Jeff,

I think that the fact that noop also regressed points the finger on a component upper in the stack.
I have not run blktrace on the patch yet, so the fact that it worked still needs some exploring - I'm am still trying to sort things out on my side to further understand this issue.
I am doing the tests over our file system (ExaStore), which is a distributed userland FS. The test run over a machine that includes 8 processes, doing mostly vectored AIO. The data is distributed evenly over all the LUNs, and the system includes it's own read ahead algorithm - so I suppose what you say regarding the barrier requests could be correct.

Another performance issue:
During my work on this issue, I stumbled over another performance regression that only added to the complexity - so maybe another bug can be opened on this too:
We are using Emulex (lpfc) HBA's and the driver was changed to include a maximum scatter gather segment threashold, which defaults to 64 (lpfc_sg_seg_cnt). Running the default LPFC driver on RHEL4, shows that sg_seg_cnt from the above layer does reach 256, which meens that this default is too low.
Changing this to 256 did show a 5-8% performance boost on the same configuration and tests. 

Happy to be of assistance,
Menny

Comment 57 Menny Hamburger 2009-09-09 09:41:43 UTC
Hi Jeff,

Really sorry for the hassle!!!

It turns out that the main issue is really the LPFC parameter (described above) and not the completion issue (which had a much smaller effect on the performance). The exitance of many unordered completion requests may be examined, but is not the real reason for our performance degradation.
In addition - noop performance did not degrade, it increased considerably -mainly because the use of ksoftirqd, which easened up the load on the processes doing the I/O. 

Best Regards,
Menny

Comment 58 Jeff Moyer 2009-09-09 17:24:16 UTC
Hi, Menny,

I'm glad to hear that you found your main performance problem.  Would you mind filing a bug for the LPFC issue?  I was going to do it for you, but I don't have enough information to file a good report.  Could you be sure to include the sort of I/O you're driving and the sort of backend storage you think would be necessary to reproduce the performance drop?

Thanks!
Jeff

Comment 59 Ross Walker 2009-09-11 18:55:37 UTC
I know I am late to the party here...

It seems to me that the fix with the best outcome with the least modifications would be to have threads inherit the io context of the parent process.

As 99% of overlapping IO comes from threads of a parent this should overcome the problem for most userland applications. It does mean that all kernel based threads will be in the same context, so it would need to be load tested to make sure user based IO doesn't starve kernel IO. If so then set default kernel parent context priority to highest level (if it isn't already).

-Ross

Comment 60 Jeff Moyer 2009-10-30 21:20:48 UTC
I put together another test kernel that implements close cooperator detection logic, and merges the cfq_queue's associated with cooperating processes.  The result is that we get a good speedup.  In 100 runs of the read-test2 program (written to simulate the I/O pattern of the dump utility), these are the throughput numbers in MB/s:

Deadline:
Avg:       101.26907
Std. Dev.:  17.59767

CFQ:
Avg:       100.14914
Std. Dev.:  17.42747

Most of the runs saw 105MB/s, but there were some outliers in the 28-30MB/s range.  I looked into those cases, and found that the cause was processes were scheduled in just the wrong order to introduce seeks into the workload.  Unfortunately, I haven't come up with a good solution for that particular problem, though I'll note that the problem affects other I/O schedulers as well.  Upstream does not exhibit this behaviour, and I believe it may be due to the rewritten readahead code, but I can't be certain without further investigation.

Without the patch set applied, the numbers for cfq were in the 7-10MB/s range.

I wasn't able to test nfs server performance as my test lab was experiencing some networking issue.  I'll get that testing underway once that problem is resolved.

I've uploaded a test kernel here:
  http://people.redhat.com/jmoyer/cfq-cc/

Please take it for a spin and report your results.  If you'd like to test on an architecture other than x86_64, just let me know and I'll kick off a build for whatever architecture is required.

Comment 61 Don Zickus 2009-11-10 16:50:09 UTC
in kernel-2.6.18-173.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 63 Jeff Moyer 2009-11-25 20:14:11 UTC
I posted one additional patch for this to rhkernel-list for review.

Comment 64 Don Zickus 2009-12-04 18:58:28 UTC
in kernel-2.6.18-177.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 67 Jeff Moyer 2010-01-11 14:44:53 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads.  When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads.  This kernel contains a fix for this problem by detecting cooperating queues and merging them together.  If the queues stop issuing requests close to one another, then they are broken apart again.

Comment 69 Ryan Lerch 2010-02-02 04:49:57 UTC
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads.  When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads.  This kernel contains a fix for this problem by detecting cooperating queues and merging them together.  If the queues stop issuing requests close to one another, then they are broken apart again.+Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.

Comment 70 Chris Ward 2010-02-11 10:15:16 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 71 Chris Ward 2010-02-24 14:58:15 UTC
@Reporters. RHEL 5.5 Test Phase is coming to a end very soon. We would greatly appreciate your confirmation that the latest RHEL 5.5 Beta resolves this issue. 

Please report back here as soon as possible. By Feb 26th would be most appreciated.

Thanks!

Comment 74 errata-xmlrpc 2010-03-30 07:18:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 75 Jonathan Peatfield 2010-03-30 10:46:59 UTC
Trying to follow the link given in #74 results in a 404 not found error.  Is that a typo or is the errata not actually ready yet?


Note You need to log in before you can comment on or make changes to this bug.