Bug 674147
Summary: | SPECsfs NFS V3 workload on RHEL6 running kernels 2.6.32-85 have a massive performance regression due to compact-kswap behavior | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Barry Marson <bmarson> |
Component: | kernel | Assignee: | Andrea Arcangeli <aarcange> |
kernel sub component: | NFS | QA Contact: | Filesystem QE <fs-qe> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aarcange, dshaks, eteo, jmalanik, kzhang, lwoodman, riel, xzhou |
Version: | 6.1 | Keywords: | Regression |
Target Milestone: | rc | ||
Target Release: | 6.1 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.32-124.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-05-19 12:53:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 678209 | ||
Attachments: |
Ok reverting it is no big deal. One reason I pushed it is also because this is what is in 2.6.38. The compaction code of 2.6.38 supports an async and optimistic mode that may avoid the problem. But if something's wrong in current compaction there's a chance it's also wrong in 2.6.38. I'd like compaction not to lead to this worse behavior. So there are various possibilities: 1) kswapd is the problem 2) the congestion_wait is the problem 3) compaction is the problem and it's too aggressive increasing latency and looping too long before returning 4) migrate lock_page without async-migrate is the problem (misses upstream logic) I'll create 4 kernels to test (4th would be the closest to upstream 2.6.38), to find out what the culprit is. If we've any doubt, we can always decide to backout the patch and have peace of mind and no hurry, but I will need to find exactly what it is for future kernels to be sure it doesn't happen again. Just as a sanity checkpoint, lwoodman built me a 109 kernel without your patch and there was no regression. Just let me know when they are ready Barry This is a first kernel to test (let's call it "K1": 114 + plus the compaction async updates matching upstream and a few other pending fixes. This should be closest to what I intend to submit for merging in 6.1 (no new features, just fixes from upstream). NOTE: it's not supposed to fix the problems, it may mitigate it, but I'd like to be sure it also happens with compaction async and with the compaction code closer to current upstream: https://brewweb.devel.redhat.com/taskinfo?taskID=3095996 If you can test async-aio as well that would help as I included the pgd_lock deadlock fix for bug #671477 . This is kernel for comment #3. https://brewweb.devel.redhat.com/taskinfo?taskID=3096320 This is kernel we can call K2 that is like K1 in comment #7 (described in comment #3), but with compaction disabled for jumbo frames in direct reclaim. https://brewweb.devel.redhat.com/taskinfo?taskID=3101916 This we can call K3 that is like K2 in comment #8 but with compaction disabled for jumbo frames in kswapd too (I hope there are no harmless gcc warnings that will cause the build to fail, I hand edited stuff as this is a bug-localizing effort): https://brewweb.devel.redhat.com/taskinfo?taskID=3101941 This is K4 that is like K1 but with compaction disabled for jumbo frames only in kswapd (if K3 builds this should build too otherwise I'll resubmit both): http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3101963 I have begun testing K1. I should have results in the next 3 hours. Barry This is K5, it's like K1 but it disables sync migration in direct compaction, so it should provide better latency than K1, I'd like to know if it's measurable. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3102168 The one that should work for sure as well as RHEL6 is K3. K1 and K5 hopefully will already work better than current 6.1 thanks to the async_migrate backport from upstream. We need to find if K1 (or K5) are still slower than the original 6.0 performance (and 6.0 didn't use compaction for the jumbo frames). Created attachment 478116 [details]
profiling on BIG with -115-K1 kernel using ext4.
K1 testing completed ... ran with profiling and got the following numbers
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1981 0.5
4000 4033 0.6
6000 6086 0.7
8000 8097 1.2
10000 10097 1.9
12000 12187 3.7
14000 14229 3.8
16000 16257 5.4
18000 18349 5.3
Attached is the profiling information >= 1% cycles from each point.
Looks like the K2 through K5 kernels didn't build. Awaiting new kernels :( Barry Seeing the results of K1, it looks like the problem was the synchronous migration so I'm hopeful that K5 will work ok and with compaction fully enabled for jumbo frames (compaction was disabled for jumbo frames in RHEL6.0, my patch enabled it for both kswapd and jumbo frames but it was a only synchronous migration in all 6.1 kernels so far!). the changes in K1 makes the kswapd using only async migration during kswapd-compaction, but it still become synchronous migration at times in the direct compaction (not in the first pass but in subsequent passes if there's memory pressure the sync flag goes to 1). That is what I tried to address in K5 by making everything asynchronous and not to ever wait for I/O in compaction regardless if it's kswapd or direct compaction. K2 will hopefully also work as well as 6.0, because then the synchronous direct compaction is disabled (only kswapd compaction is enabled but it uses async migration only, plus even if kswapd stalls it shouldn't stall userland programs that measure the latency). K4 will likely perform the same as K1 because synchronous direct compaction is still enabled (disabling kswapd compaction shouldn't help). K3 should work identical as 6.0 (compaction disabled for both kswapd and direct reclaim). K5 should hopefully also work as well as 6.0 by using async compaction/migration for both direct compaction and kswapd so in theory allowing for higher performs thanks to an higher ratio of jumbo frames used... this is K5 resubmitted, I'm submitting them in reverse order now. I hope it builds and that was a buildsystem error, it's the same srpm as before. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3102828 this is a resubmit of K4 https://brewweb.devel.redhat.com/taskinfo?taskID=3105022 this is a resubmit of K3 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3105045 this is a resubmit of K2 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3105053 After we sort this out for RHEL6.1 it'd be very useful to also test this on the upstream kernels that are even more aggressive in trying to free enough memory for compaction to be able to complete, inside shrink_zone. But despite the more aggressive shrink_zone in practice I expect the current upstream to be like the K1 that you already tested. Created attachment 478298 [details]
profiling on BIG with -115-K5 kernel using ext4.
K5 testing completed ... ran with profiling and got the following numbers. I dont think this is what you expected. Not sure which kernel to try next.
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1981 0.5
4000 4030 0.6
6000 6066 0.7
8000 8068 1.3
10000 10097 1.9
12000 12188 3.6
14000 14277 3.8
16000 16285 4.9
18000 18333 5.6
If it's possible feel free to uniquely name each kernel for easier comparison.
Barry
So maybe kswapd is the problem, I don't know. Can you try K3 next? That should work exactly like 6.0. And then K4 that only disables compaction in kswapd. Thanks! Created attachment 478516 [details]
profiling on BIG with -115-K2 kernel using ext4.
K2 testing completed ... ran with profiling and got the following numbers. Response time got worse more slowly but eventually was just as bad. I
dont think this is what you expected. Moving to K3
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1981 0.5
4000 4030 0.6
6000 6066 0.7
8000 8068 1.2
10000 10097 2.0
12000 12188 2.7
14000 14277 3.3
16000 16285 4.9
18000 18333 5.1
Created attachment 478542 [details]
profiling on BIG with -115-K3 kernel using ext4.
K3 testing completed ... ran with profiling and got the following numbers. Clearly the best so far but not quite as good as the original results.
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1981 0.5
4000 4036 0.6
6000 6066 0.7
8000 8097 0.8
10000 10053 0.8
12000 12143 0.9
14000 14187 1.0
16000 16180 1.1
18000 18187 1.3
The profiling shows vmlinux:tg_shares_up is busiest, relatively speaking on the lightest of workloads and drops in % and the workload increases
Ill setup for a K4 run in the morning. Im fading to fast here :)
Thanks,
Barry
-84 was this: 2000 1983 0.5 4000 4037 0.6 6000 6065 0.6 8000 8097 0.7 10000 10054 0.7 12000 12134 0.8 14000 14132 0.9 16000 16214 0.9 18000 18202 1.1 K3 is: 2000 1981 0.5 4000 4036 0.6 6000 6066 0.7 8000 8097 0.8 10000 10053 0.8 12000 12143 0.9 14000 14187 1.0 16000 16180 1.1 18000 18187 1.3 Are you sure the 0.2 difference isn't in the measurement error range? K3 was supposed to perform identical to the original as far as compaction is concerned (I don't think your workload is invoking compaction in any significant way in K3 or -84). If this is no measurement error (I didn't see two consecutive runs to guess what the error could be) so it might be related to some other update if it's not a measurement error. I doubt it's related to the backport of async compaction that don't seem to have any significant impact on your workload, but it still looks good idea to have (even if it's not high priority material). The bug we've to fix is to reduce the cost of compaction when __GFP_REPEAT is set. Either that or to shutdown compaction for small allocations that may be extremely short lived. We're doing a lot of CPU work to create a big frame, but then what we gain from the big frame is very little compared to the work we did to allocate a big frame instead of a small one. I just verified that the runs from comment #1 were unprofiled. My profiled comparison of -84 only went up to 12000 Ops, since performance was terrible by then. The profiled results for -84 to that point were: Req Actual Resp Rate Rate Time msec/op ---------------------------- 2000 1981 0.5 4000 4034 0.6 6000 6086 0.6 8000 8095 0.7 10000 10085 0.8 12000 12104 0.8 Im rerunning K3 again without profiling and will have results this eve. Of note, these are the results for RHEL 6.0 -71 Req Actual Resp Rate Rate Time msec/op ---------------------------- 2000 1983 0.5 4000 4031 0.6 6000 6081 0.6 8000 8096 0.7 10000 10087 0.7 12000 12135 0.8 14000 14177 0.9 16000 16173 1.0 18000 18232 1.1 20000 20270 1.4 22000 22360 1.8 24000 24343 2.7 26000 26409 4.5 28000 28444 6.7 Barry The full non profiled run of K3 is: Req Actual Resp Rate Rate Time msec/op ---------------------------- 2000 1983 0.5 4000 4034 0.6 6000 6080 0.6 8000 8097 0.7 10000 10087 0.8 12000 12104 0.9 14000 14131 1.0 16000 16174 1.1 18000 18247 1.2 20000 20238 1.5 22000 22280 1.9 24000 24407 4.0 26000 26435 4.3 28000 28412 4.6 30000 30484 6.1 In some ways we are better than -71; some ways worse. The Good: We achieve a higher peak Ops/sec rate. This is great since we regressed compared to RHEL5 which still does 10% better. The Not so Good: Response time between 10000-24000 is higher. Operations like getattr, lookup, write, commit, are higher. Barry K8 performs nearly identically to K3. These were tested up to 20K Ops. It's still off from the -84 results. At 20K, K3/K8 is 1.5 response time, where -84 is 1.2. There may be other factors though. We are certainly far better than where we were at -85. Barry K9 (compaction-no-kswapd + compaction_allocate_lowlat) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137227 K10 (only compaction_allocate_lowlat) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137217 K11 (compaction-kswapd + compaction_allocate_lowlat) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137209 I had to respin a new K11 version because while backporting from upstream to rhel6 one s/order/last_order/ went missing and it would make the compaction logic in kswapd almost like it was before. Good thing I noticed it before you started testing K11! This also replaces my kswapd-high-wmark fix with the equivalent one that went in -mm that adds a min(1% of RAM, lowwmark) "balance" (my balance was 0 and I still think it was better as 0 but this will be easier to get acked for inclusing considering it's in -mm and it satisfies me enough). This change can't possibly affect performance. I also included the latest pgd_lock deadlock fix that I posted for review so that gets tested too (in previous K11 I had the one I posted to rhkernel-list through the proper bugzilla which is a bit more risky). I removed the old fix which is still a good cleanup but only relevant from upstream if this new fix is approved by Xen. These changes can't affect performance (they only affect vmalloc_sync_all which isn't called in any fast path). So this is K12 and it replaces K11. (K9 K10 were all good, no need to respin a build for those two) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3140763 I think you go ahead testing K12 if K9 completed. Created attachment 481120 [details]
profiling on BIG with -118-K12 kernel using ext4.
K9 kernel was the best running kernel (which appears to have been expected by Andrea). Response time was better than the -84 kernel. Even better than RHEL6 GA-71. Peak ops was up 2K (though 10% off from RHEL5 still -- thats another bz).
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1982 0.5
4000 4031 0.6
6000 6080 0.6
8000 8097 0.7
10000 10053 0.7
12000 12123 0.8
14000 14178 0.9
16000 16224 0.9
18000 18243 1.1
20000 20294 1.2
22000 22304 1.5
24000 24406 1.9
26000 26438 3.6
28000 28463 4.3
30000 30438 5.5
32000 29261 8.5
34000 26555 9.4
36000 24199 10.3
Ran the K12 kernel and while peak ops was still better than RHEL6 GA, response time was worse starting at around 10K. I profiled the K12 kernel and am attaching the results. Except for one point which didn't profile (not sure why), its clear the compaction code is way to busy starting around 8K ops.
Profiled results
Req Actual Resp
Rate Rate Time msec/op
----------------------------
2000 1982 0.5
4000 4030 0.6
6000 6081 0.6
8000 8069 0.7
10000 10054 0.8 < start of large compaction
12000 12106 0.9
14000 14139 1.0
16000 16207 1.1
18000 18194 1.4
20000 20345 2.1
22000 22397 3.8
24000 24546 4.0
26000 26654 4.4
Barry
Ok. I submitted a new K9 with the updated pgd_lock deadlock fix and updated kswapd-high-wmark patches. This should run the same as K9. I'll submit the patches of this kernel (K13) to rhkernel-list next, I'm in airport right now. No more testing needed after this for RHEL6.1 I think. Thanks a lot! http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3143355 K13 kernel is as good as the K9 kernel. So we have comparable response and 30K ops where GA didnt. Barry All patches in K13 submitted to rhkernel-list as patch series of 22 patches with patch 0/22 with Message-Id: <20110301000318.429428052> Other two patches present in K13 submitted separately through https://bugzilla.redhat.com/show_bug.cgi?id=670063#c3 https://bugzilla.redhat.com/show_bug.cgi?id=671477#c10 This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. *** Bug 616600 has been marked as a duplicate of this bug. *** I have verified that ext2/ext2/xfs all perform much better with K13. In fact they exceed RHEL6.0 GA -71. gfs2 has other issues which still need to be addressed. But that's another bz ;) Barry Patch(es) available on kernel-2.6.32-124.el6 *** Bug 678213 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |
Created attachment 476254 [details] oprofile analysis of the -84 and -85 kernels Description of problem: Running the classic SPECsfs on my BIGI testbed has showed a massive response time and peak performance regression with -99 kernel. Triage showed the problem occurring all the way back to the -85 kernel. Speculation was that it was patch ... [mm] use compaction for GFP_ATOMIC order > 0 (Andrea Arcangeli) [622327 642570] Essentially this code gets very busy as the workload increases and causes dramatic response time increases. Additionally, peak SPECsfs NFS Op/sec performance is degraded against all servered file systems. Version-Release number of selected component (if applicable): 2.6.34-85 and later How reproducible: everytime Steps to Reproduce: 1. run the SPECsfs workload on the BIGI testbed in RDU (bmarson) 2. 3. Actual results: Req Actual Resp Rate Rate Time msec/op ---------------------------- 2000 1983 0.5 4000 4030 0.6 6000 6081 0.7 8000 8099 1.0 10000 10096 1.4 12000 12172 2.7 14000 14178 4.0 16000 16088 5.0 18000 18353 5.4 20000 20202 6.3 22000 22272 6.5 24000 24415 4.6 26000 26439 4.4 28000 28409 4.8 Expected results: Req Actual Resp Rate Rate Time msec/op ---------------------------- 2000 1983 0.5 4000 4037 0.6 6000 6065 0.6 8000 8097 0.7 10000 10054 0.7 12000 12134 0.8 14000 14132 0.9 16000 16214 0.9 18000 18202 1.1 20000 20222 1.2 22000 22357 1.5 24000 24412 2.0 26000 26377 3.9 28000 28458 4.9 Additional info: The clients are networked with e1000 NICs that have jumbo frames enabled (MTU=9000). Testing with standard frames showed no improvement. Attaching notes regarding profiling of each the -84 and -85 kernels