Bug 448130
Summary: | 50-75 % drop in cfq read performance compared to rhel 4.6+ | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Eric Sandeen <esandeen> |
Component: | kernel | Assignee: | Jeff Moyer <jmoyer> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 5.3 | CC: | anders.furuhed, bengland, blouin, bmarson, bradleyjr, ctatman, cward, dmair, dshaks, dzickus, esandeen, geert.nijpels, jens.axboe, jfeeney, jlayton, jpirko, j.s.peatfield, jturner, juanino, k.georgiou, lwang, lwoodman, martinez, mennyh, mrkfact, pasteur, rick.beldin, rlerch, rwalker, rwheeler, sandeep_k_shandilya, sghosh, shyam_iyer, sputhenp, steved, syeghiay, tao, vvaldez, wwlinuxengineering |
Target Milestone: | rc | Keywords: | OtherQA |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2010-03-30 07:18:43 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 436004 | ||
Bug Blocks: | 391501, 483701, 499522, 525215, 533192, 541103, 557926 | ||
Attachments: |
Description
Eric Sandeen
2008-05-23 17:06:05 UTC
Created attachment 306534 [details]
seekwatcher graph of 8 vs. 1 nfsd thread
Here's a graph showing seeks & throughput for running:
iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w
from a client, with the server running 8 and 1 nfsd threads, just to show at a
high level what the seek & throughput situation looks like.
So just a summary of the simplest configuration to recreate this read issue. Run a RHEL5 server booted with 1 CPU and 2 nfsd's. This yields a 6X drop in read performance. Barry Attached below are to systemtap traces that show the NFS server is spending 2 to 3 times longer in reads when there are two nfsd threads verses one. Created attachment 307128 [details]
Systemtap trace with one nfsd process
Created attachment 307130 [details]
Systemtap trace with two nfsd processes
Created attachment 307134 [details]
Systemtap trace of ext3 with two nfsd process
Created attachment 307135 [details]
Systemtap trace of ext3 with one nfsd process
(In reply to comment #1) > Created an attachment (id=306534) [edit] > seekwatcher graph of 8 vs. 1 nfsd thread > > Here's a graph showing seeks & throughput for running: > > iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w > > from a client, with the server running 8 and 1 nfsd threads, just to show at a > high level what the seek & throughput situation looks like. One interesting note is that the writes are much more spread out in the 8 thread case if I read the graph correctly. Can we dump the list of blocks allocated for /mnt/nfs/testfile? Bad layout in the write phase can show up as a huge drop in the read throughput... Ric: re: comment #9; the file I was reading here was actually created on the server, so it was fairly contiguous (about as contiguous as ext3 could make it, anyway...) I can double check the layout if you like. But, the 1- and 8-thread graphs are actually reading the very same file with the very same layout; I used the -w switch to keep the previously-written file in place through both runs. So both cases saw the same degree of fragmentation. -Eric From the seek watcher plot, it did look like there were distinct write pattern differences in the green and light blue plot lines, but if the test reuses the same file, I must be misreading ;-) thanks, ric Created attachment 307238 [details]
updated seekwatcher trace
The previous seekwatcher graph was buggy; the stair-steppiness was an artifact
of a too-coarse grid for the graph generation.
Just to completely rule out any server-side fragmentation as an issue, I tested
reading the file from xfs which had the 2G file in a single contiguous extent.
This is the resulting (fixed) seekwatcher graph for 8 vs. 1 nfsd threads.
Thanks,
-Eric
Putting a systemtap probe in mpage_readpages (which is what ext3_readpages calls) you can see a very different pattern on how many pages are allocated. The following tables show the number of requested pages (#pages),how many times mpage_readpages was called (#Calls) and the Total amount of time (in nanoseconds) was spent. One NFSd process (#pages) # Calls Total ns mpage_readpages(32) 32767 3776096302 mpage_readpages(16) 2 18598287 Total Calls: 32769 Two NFSd processes: mpage_readpages(32) 6783 134188214 mpage_readpages(31) 9 171383 mpage_readpages(30) 1 8091 mpage_readpages(29) 2 38305 mpage_readpages(28) 14 8638196 mpage_readpages(27) 1 7189 mpage_readpages(24) 404 8373816 mpage_readpages(23) 1 6801 mpage_readpages(20) 3 24430710 mpage_readpages(16) 4836 59248733 mpage_readpages(15) 10 98711 mpage_readpages(12) 9 8650390 mpage_readpages(11) 1 1285 mpage_readpages(10) 4 23062 mpage_readpages(9) 3 18580 mpage_readpages(8) 90658 2789966293 mpage_readpages(7) 74 445143 mpage_readpages(6) 13 58773 mpage_readpages(5) 12 39802 mpage_readpages(4) 52 274642178 mpage_readpages(3) 10 17733454 mpage_readpages(2) 15 43661 mpage_readpages(1) 16 44958 Total Calls: 102931 So there is quite a different patten when one or two nfsd are used... This may be how pages were always allocated and we never noticed it before since it didn't cause a problem.... but now it seems to... Dell has found that the workaround proposed in the original bugzilla (bz436004 comment #67) decreases the effects of this read side problem. Thus, I have copied the comment that details the workaround here for posterity. Thanks go out to Ben England at ibrix for providing it. "I have been using a workaround described below, and have observed no regression in RHEL5.1 single-threaded NFS reads when using this workaround. This seems consistent with the preceding results in this bug report -- i.e. 1 nfsd thread is much faster. The workaround is to add this line to /etc/rc.local boot script and then to run that script: # for n in /sys/block/sd*/queue/iosched/slice_idle ; do echo 1 > $n ; done This parameter did not exist in the RHEL4 CFQ I/O scheduler. A similar effect can be achieved with use of the deadline or noop scheduler, but for writes we have seen better results with CFQ. The purpose of this workaround is to minimize overhead imposed by CFQ when multiple threads are reading from the same file. NFS uses a thread pool to service RPCs, so that a sequential single-thread read at the application layer becomes a multi-thread read at the NFS server. CFQ treats threads as if they were application processes, but in fact they are not here so the default delay of 8 ms between switching to a different thread’s requests, represented by the slice_idle block device tuning parameter, is unreasonable. Others have seen this problem, including the author of CFQ. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg05066.html More research needs to be done on the effect of setting this parameter to zero, but until we do a systematic test of all known workloads with this value I would not recommend it as a general solution. Reproducer: A 43% improvement, from 24.7 to 35.4 MB/s, was observed using this simple test done with 2 hosts running RHEL5.1 connected by a a 1-Gb Ethernet link. The NFS server exported a partition on the system disk, /dev/sda3, mounted as an ext3 file system. No NFS or ext3 tuning was used. The workload was: # dd of=/dev/null bs=64k count=16k if=/mnt/nfsext3/f" This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. FWIW: I ran a test with a rhel5 server, F9 client. Created a file on the server with: iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 0 -w and read from the client with: iozone -s 2000000 -r 64 -f /mnt/nfs/testfile -i 1 -w dropped caches on the server in between tests. Based on comment #67 in the parent bug 436004 I decided to try different io schedulers on the server: KB reclen write rewrite read reread cfq: 2000000 64 26694 98303 noop: 2000000 64 43093 98142 anticip: 2000000 64 43409 98202 deadline:2000000 64 43423 98372 ... so this is certainly looking cfq-related. -Eric Doing the same test locally on the server, not over nfs: KB reclen write rewrite read reread cfq: 2000000 64 56761 1945684 noop: 2000000 64 88780 1890856 anticip: 2000000 64 89262 1906614 deadline:2000000 64 90563 1893523 I suppose the next logical test would be to compare nfs perf w/ one of the other schedulers between RHEL4 and RHEL5, to see if there is any nfs component of the problem at all. -Eric Thanx for doing this work, Eric! I wouldn't be surprised if there is an NFS component, but I think that it would be good to get the cfq issues addressed because that's the default. I ran local & nfs tests on the same box as previous, but with a RHEL4 kernel installed (on a RHEL5 root). Summary follows, with RHEL5 results restated. -RHEL4- Local KB reclen write rewrite read reread cfq: 2000000 64 76965 2019404 noop: 2000000 64 76461 2040349 anticip: 2000000 64 76728 2030559 deadline:2000000 64 78802 2082506 NFS KB reclen write rewrite read reread cfq: 2000000 64 44627 100130 noop: 2000000 64 44510 97712 anticip: 2000000 64 43739 98699 deadline:2000000 64 43937 99337 -RHEL5- Local KB reclen write rewrite read reread cfq: 2000000 64 56761 1945684 noop: 2000000 64 88780 1890856 anticip: 2000000 64 89262 1906614 deadline:2000000 64 90563 1893523 NFS KB reclen write rewrite read reread cfq: 2000000 64 26694 98303 noop: 2000000 64 43093 98142 anticip: 2000000 64 43409 98202 deadline:2000000 64 43423 98372 at least from this test, there does not seem to be an NFS performance regression, it appears to be all CFQ's doing. CFQ is pretty clearly hurting here. For a large/fast array I'd suggest noop in any case. But we need to get this fixed, I will try to find time to look into it soon. Please have a look at the below code path that is employed by cfq which makes me believe that this is a cfq design "cfq_select_queue" calls the following snippet of code so if there are no tother further requests then it will wait on idle expecting some other back to back requests. /* * if queue has requests, dispatch one. if not, check if * enough slice is left to wait for one */ if (!RB_EMPTY_ROOT(&cfqq->sort_list)) goto keep_queue; else if (cfq_cfqq_dispatched(cfqq)) { cfqq = NULL; goto keep_queue; } else if (cfq_cfqq_class_sync(cfqq)) { if (cfq_arm_slice_timer(cfqd, cfqq)) return NULL; } After completing a request in "cfq_completed_request" it again waits on idle if there are no further requests until the timer expires. /* * If this is the active queue, check if it needs to be expired, * or if we want to idle in case it has no pending requests. */ if (cfqd->active_queue == cfqq) { if (time_after(now, cfqq->slice_end)) cfq_slice_expired(cfqd, 0); else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) cfq_arm_slice_timer(cfqd, cfqq); } And inside the cfq_arm_slice_timer here is the below snippet of code where it wants to idle for seeks. sl = min(cfqq->slice_end - 1, (unsigned long) cfqd->cfq_slice_idle); /* * we don't want to idle for seeks, but we do want to allow * fair distribution of slice time for a process doing back-to-back * seeks. so allow a little bit of time for him to submit a new rq */ if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic)) sl = min(sl, msecs_to_jiffies(2)); mod_timer(&cfqd->idle_slice_timer, jiffies + sl); return 1; Now, all of this is per the design of the algorithm(fair queueing) that assumes that there may be back to back io requests even if there are no pending requests currently. Since the nfsd spawns multiple kernel threads they all have their request queue. So, if the io scheduler is scheduling one nfsd thread all the other threads may be waiting on i/o. We see that the problem goes off if the slice idle is reduced because that makes the algorithm not wait on one thread for idle time. This is a good algorithm for not switching context between say a process A and process B(completely different processes). But in the case of nfs which spawns multiple nfsd threads that are doing the same task through different queues this algorithm defeats the purpose of keeping the idle time. I am thinking that probably the cfq algorithm needs to maintain thread groups(that perform the same function and use multiple threads to increase parallelism) to decide on idle time. Please review and check if the analysis needs to be taken to upstream. Shyam, that seems like a reasonable assessment, however: in comment #23 I've shown that even local, single-threaded IO is suffering under cfq. I think that before we get to the multi-threaded nfsd issues, we need to sort out what's going on with simple, local IO ... I am sure that it is going to still wait on I/O even if there is only one i/o thread. slice_idle will cause the thread to wait on i/o and do the same with other kernel threads. It is just that we reduced the number of kernel threads waiting on i/o by making the enviroment simple, local i/o based. Shyam, out of curiosity, can I ask what sort of storage you've been testing on? Thanks, -Eric The testing is on SAS disks connected to a PERC 5/i. Based on additional testing and code analysis similar to Shyam's, IBRIX is currently recommending slice_idle=0 for IBRIX FS devices. When we use this setting, there is no degradation in read throughput from RHEL4, and it is about 20% better than slice_idle=1 (previous recommendation above). It appears that EMC PowerPath V5 also used this for all /dev/emcpower* devices on RHEL5, but you'd have to ask them. This does not mean that slice_idle should be removed from the scheduler. Can I get testing feedback on the following patch, please? If you require a kernel build, please let me know against which kernel version you would like the patch applied. https://bugzilla.redhat.com/attachment.cgi?id=319934 Cheers! (In reply to comment #31) > Based on additional testing and code analysis similar to Shyam's, IBRIX is > currently recommending slice_idle=0 for IBRIX FS devices. When we use this > setting, there is no degradation in read throughput from RHEL4, and it is about > 20% better than slice_idle=1 (previous recommendation above). > > It appears that EMC PowerPath V5 also used this for all /dev/emcpower* devices > on RHEL5, but you'd have to ask them. > > This does not mean that slice_idle should be removed from the scheduler. RHEL 4's cfq implementation was quantum-based, not time-based. The algorithm went like so: o Pick the most important queue with I/O pending o submit X number of I/Os o repeat So, after a queue goes empty, the next queue is selected; there is no idle window. Setting slice_idle to zero, therefore, gets you very close to the behaviour that was witnessed under RHEL 4. Now, idle slices do help a number of workloads. In this case, however, it sounds like the nfsd threads are interleaving I/O to the same file, and thus the idle slice is really hurting performance (as you're now waiting up to 8ms between each I/O). One solution to this problem is to detect multiple processes issuing I/O in this manner, and to switch queues instead of waiting for more requests on the current active queue. That is precisely what the patch in comment #31 implements. I'm still waiting for testing feedback. Moving to 5.4 (In reply to comment #34) > I'm still waiting for testing feedback. here is the feedback, I see a slight improvement... but not yet to the same performance level as the deadline/noop scheduler Here are the results with a rhel 5.2 client and a rhel 5.2 server and a rhel 5.2 cfq-patched server filesize=4G ramsize=2G client and server side caching has been taken into account by rebooting server over every run. patch applied iosched=cfq ------------------------- [root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null 8388608+0 records in 8388608+0 records out 4294967296 bytes (4.3 GB) copied, 97.1749 seconds, 44.2 MB/s patch applied iosched=deadline ------------------------------ [root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null 8388608+0 records in 8388608+0 records out 4294967296 bytes (4.3 GB) copied, 73.773 seconds, 58.2 MB/s patch NOT applied iosched=cfq rhel 5.2 kernel. ---------------------------------------------- [root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null 8388608+0 records in 8388608+0 records out 4294967296 bytes (4.3 GB) copied, 154.909 seconds, 27.7 MB/s patch NOT applied iosched=deadline rhel 5.2 kernel. --------------------------------------------------- [root@RB-C2 BUILD]# dd if=/mnt/nfs/bigfile of=/dev/null 8388608+0 records in 8388608+0 records out 4294967296 bytes (4.3 GB) copied, 76.7484 seconds, 56.0 MB/s Updating PM score. This work will not make the 5.4 release. When the problem was initially reported, I talked to Jens Axboe about it, and he seemed receptive to the idea of adding some code to CFQ to detect processes interleaving I/Os. When I came up with a first patch for this, he then suggested that we would be better off solving the problem in the applications themselves, by having the applications explicitly share I/O contexts (using sys_clone and the CLONE_IO flag*). I wrote a patch for dump(8) to do this very thing, and it did solve the problem. I also have preliminary patches for nfsd. However, the list of applications suffering from this kept growing. The applications I know of that perform interleaved reads between multiple processes include: dump nfsd qemu's posix aio backend one of the iSCSI target mode implementations a third-party volume manager It is evident that this is not too uncommon of a programming paradigm, so Jens decided to take the close cooperator patch set into 2.6.30. However, the implementation he merged was not quite ready for merging as it can cause some processes to be starved. I've been working with him to fix the problem properly while preserving fairness. In the end, the solution may involve a combination of detecting cooperating processes and sharing I/O contexts between them automatically. This issue is my number one priority, and I will keep this bugzilla updated as progress is made. * Note that shared I/O contexts (and the CLONE_IO flag) are not supported in RHEL 5, otherwise I would have made that fix available for the 5.4 release. I just wanted to mention something I have yet to see in this thread. I had the same. if not worse results for both local and network processing on my new RHEL 5.2 server builds. Alot of time was spent acquiring test results and comparison testing to older RHEL5.2 client machines that were performing at least twice as fast. The differences were this: The seemingly faster machine was using Client (or Desktop if you prefer) The new machines were server class machines running RHEL 5.2 server software. The other difference was RAID. The client machines did not have either hardware or software RAID. The new machines were a RAID1 (mirror)with 2- 1TB SATA hard drives. We eliminated the RAID array from the equation and noticed immediate improvement. No real surprise there. but I did think it worth mentioning in case changing the slice_idle count or any of the other suggestions do NOT make a significant improvement for other sufferers of this bug. It is only because I use clonezilla on a regular imaging schedule that I am comfortable in removing the RAID from my configuration as it was strickly for redundancy and to reduce downtime in case of drive failure. My technician is still compiling his findings, I can post them if anyone is interested when he is done. Has anyone thought about backporting CFQ-V1 over the RHEL5 elevator We have up to 15% degradation in read performance on RHEL5 CFQ compared to CFQ V1 (when concurrency is 8) when running on the same machine, from the same load generators. Since this issue is so problematic, isn't it logical to try an port CFQ-V1 into RHEL5? (In reply to comment #44) > We have up to 15% degradation in read performance on RHEL5 CFQ compared to CFQ > V1 (when concurrency is 8) when running on the same machine, from the same load > generators. > > Since this issue is so problematic, isn't it logical to try an port CFQ-V1 into > RHEL5? Moving backwards is not a solution with which I would be comfortable. Making improvements upstream to take these workloads into account, and backporting those changes is the proper way forward. Please understand that while CFQ v1 may work better for your workload, it may be worse for others. Quantifying this is often difficult, as you don't hear about problems until they are encountered. If you could give some specific information about your workload it will help me in verifying forthcoming solutions. Thanks! *** Bug 510861 has been marked as a duplicate of this bug. *** Hi Jeffrey, Each one of the I/O scheduling algorithms are good for different work loads. Adding CFQ V1 is just like adding an additional scheduling algorithm to the list (you can even call it in another name)- It gives you another option on some workloads. I am not saying we should go backwards - on the contrary!! I think the code should support as many workloads as possible - if we cannot support some workload with the current code (in case the kernel misses some functionality in 2.6.18), I think we can provide an option that can. Another option can be to add a parameter to CFQ that will make it work more like V1. Best Regards, Menny I do not know if this is enough for emulating V1 over 2.6.18, but I have tested this issue through a degenerated version of CFQ V3, removing the CFQ io_context and slice code, leaving only what's necessary to make it work. After setting slice_idle to 0, the results showed an insignificant (3%) increase in read performance for my workload, which uses vectored AIO heavily. So I wonder if the problem is really on the scheduling algorithm. M. Hi, Menny, Thanks for your testing. I verified that the scheduling behaviour is the cause of the problem by analyzing blktrace data from test runs. It's quite clear what is going on, actually. Providing a CFQv1 scheduler is not something I will promote. We really need to just address the problems in the current incarnation of CFQ. Thanks, Jeff Hi Jeff, Can you please elaborate on your findings? maybe using a blktrace as an example. Thanks, Menny Hi, Menny, Forgive me, but I don't have a trace handy. It's quite simple to reproduce using the directions above. One other indication that the idling logic is the culprit is that if you set the elevator's slice_idle to 0, the problem goes away. Now, if you collect a blktrace yourself (with slice_idle set to its default of 8), you will see that the biggest cause of delay is the 8ms latency between switching queues. You may also notice that one or two threads may make much more progress than the others, which will introduce a lot of seeks and associated penalties. However, the real issue is that 8ms delay between I/Os. I hope this is helpful. Cheers, Jeff Hey Jeff, Thnx for the quick response; however as I said above, setting slice_idle to 0 only cleared out most of the problem - there is still a 15-20% reduction in performance compared to CFQ V1. Best Regards, Menny I'm sorry, Menny, somehow I missed that. Could you provide more specifics regarding your workload? It would be best if you could attach a reproducer. If that's not possible, then blktrace output, both with slice_idle set to 8 and 0, should be enough to get me going. Armed with that output, I should be able to let you know what I think the problem is. Thanks! Hi Jeff, I ran a performance comparison between the noop scheduler in RHEL4 and RHEL5 and got the same 15-20% performance degradation - so I started going up the stack to ll_rw_blk.c (the blkdev code). I saw that RHEL5 introduced a NAPI like mechanism for bio, that activates the ksoftirqd to handle completion requests. I did some blktrace's and saw that when I do streaming, too many request completions come unordered - so I did some "googling". I saw thw following patch that eventually found it's way into the Vanilla: http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg26151.html This fix gives me an additional ~10% more performance in streaming. I will continue the tests an let you know. Cheers, Menny Hi, Menny, Thanks for your continued testing on this. It's certainly interesting to see that noop has also regressed. I'm having a hard time figuring out how that patch could have made any difference in performance. It's title indicates that it is purely cosmetic, and I can't see a change in the logic. Did you also run blktrace on the patched kernel to see if the request completions are in order? What I've noticed from run to run is that the ordering of completions may not be predictable. So I wonder if you just got lucky. Maybe this is an area worth exploring further, though. Also, what is your workload? The git commit listed above deals with barrier requests (well, ordered requests), so unless this is a file system workload, it wouldn't make any difference at all, I think. Thanks again, Menny! Jeff Hi Jeff, I think that the fact that noop also regressed points the finger on a component upper in the stack. I have not run blktrace on the patch yet, so the fact that it worked still needs some exploring - I'm am still trying to sort things out on my side to further understand this issue. I am doing the tests over our file system (ExaStore), which is a distributed userland FS. The test run over a machine that includes 8 processes, doing mostly vectored AIO. The data is distributed evenly over all the LUNs, and the system includes it's own read ahead algorithm - so I suppose what you say regarding the barrier requests could be correct. Another performance issue: During my work on this issue, I stumbled over another performance regression that only added to the complexity - so maybe another bug can be opened on this too: We are using Emulex (lpfc) HBA's and the driver was changed to include a maximum scatter gather segment threashold, which defaults to 64 (lpfc_sg_seg_cnt). Running the default LPFC driver on RHEL4, shows that sg_seg_cnt from the above layer does reach 256, which meens that this default is too low. Changing this to 256 did show a 5-8% performance boost on the same configuration and tests. Happy to be of assistance, Menny Hi Jeff, Really sorry for the hassle!!! It turns out that the main issue is really the LPFC parameter (described above) and not the completion issue (which had a much smaller effect on the performance). The exitance of many unordered completion requests may be examined, but is not the real reason for our performance degradation. In addition - noop performance did not degrade, it increased considerably -mainly because the use of ksoftirqd, which easened up the load on the processes doing the I/O. Best Regards, Menny Hi, Menny, I'm glad to hear that you found your main performance problem. Would you mind filing a bug for the LPFC issue? I was going to do it for you, but I don't have enough information to file a good report. Could you be sure to include the sort of I/O you're driving and the sort of backend storage you think would be necessary to reproduce the performance drop? Thanks! Jeff I know I am late to the party here... It seems to me that the fix with the best outcome with the least modifications would be to have threads inherit the io context of the parent process. As 99% of overlapping IO comes from threads of a parent this should overcome the problem for most userland applications. It does mean that all kernel based threads will be in the same context, so it would need to be load tested to make sure user based IO doesn't starve kernel IO. If so then set default kernel parent context priority to highest level (if it isn't already). -Ross I put together another test kernel that implements close cooperator detection logic, and merges the cfq_queue's associated with cooperating processes. The result is that we get a good speedup. In 100 runs of the read-test2 program (written to simulate the I/O pattern of the dump utility), these are the throughput numbers in MB/s: Deadline: Avg: 101.26907 Std. Dev.: 17.59767 CFQ: Avg: 100.14914 Std. Dev.: 17.42747 Most of the runs saw 105MB/s, but there were some outliers in the 28-30MB/s range. I looked into those cases, and found that the cause was processes were scheduled in just the wrong order to introduce seeks into the workload. Unfortunately, I haven't come up with a good solution for that particular problem, though I'll note that the problem affects other I/O schedulers as well. Upstream does not exhibit this behaviour, and I believe it may be due to the rewritten readahead code, but I can't be certain without further investigation. Without the patch set applied, the numbers for cfq were in the 7-10MB/s range. I wasn't able to test nfs server performance as my test lab was experiencing some networking issue. I'll get that testing underway once that problem is resolved. I've uploaded a test kernel here: http://people.redhat.com/jmoyer/cfq-cc/ Please take it for a spin and report your results. If you'd like to test on an architecture other than x86_64, just let me know and I'll kick off a build for whatever architecture is required. in kernel-2.6.18-173.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. I posted one additional patch for this to rhkernel-list for review. in kernel-2.6.18-177.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads. This kernel contains a fix for this problem by detecting cooperating queues and merging them together. If the queues stop issuing requests close to one another, then they are broken apart again. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Some applications (including dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. When using the CFQ I/O scheduler, this application design actually hurt performance, as the I/O scheduler would try to provide fairness between the processes or threads. This kernel contains a fix for this problem by detecting cooperating queues and merging them together. If the queues stop issuing requests close to one another, then they are broken apart again.+Some applications (e.g. dump and nfsd) try to improve disk I/O performance by distributing I/O requests to multiple processes or threads. However, when using the Completely Fair Queuing (CFQ) I/O scheduler, this application design negatively affected I/O performance. In Red Hat Enterprise Linux 5.5, the kernel can now detect and merge cooperating queues, Additionally, the kernel can also detect if the queues stop cooperating, and split them apart again. ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. @Reporters. RHEL 5.5 Test Phase is coming to a end very soon. We would greatly appreciate your confirmation that the latest RHEL 5.5 Beta resolves this issue. Please report back here as soon as possible. By Feb 26th would be most appreciated. Thanks! An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html Trying to follow the link given in #74 results in a 404 not found error. Is that a typo or is the errata not actually ready yet? |