Bug 674147

Summary: SPECsfs NFS V3 workload on RHEL6 running kernels 2.6.32-85 have a massive performance regression due to compact-kswap behavior
Product: Red Hat Enterprise Linux 6 Reporter: Barry Marson <bmarson>
Component: kernelAssignee: Andrea Arcangeli <aarcange>
kernel sub component: NFS QA Contact: Filesystem QE <fs-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aarcange, dshaks, eteo, jmalanik, kzhang, lwoodman, riel, xzhou
Version: 6.1Keywords: Regression
Target Milestone: rc   
Target Release: 6.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-124.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 12:53:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 678209    
Attachments:
Description Flags
oprofile analysis of the -84 and -85 kernels
none
profiling on BIG with -115-K1 kernel using ext4.
none
profiling on BIG with -115-K5 kernel using ext4.
none
profiling on BIG with -115-K2 kernel using ext4.
none
profiling on BIG with -115-K3 kernel using ext4.
none
profiling on BIG with -118-K12 kernel using ext4. none

Description Barry Marson 2011-01-31 19:03:40 UTC
Created attachment 476254 [details]
oprofile analysis of the -84 and -85 kernels

Description of problem: Running the classic SPECsfs on my BIGI testbed has showed a massive response time and peak performance regression with -99 kernel.  Triage showed the problem occurring all the way back to the -85 kernel.  Speculation was that it was patch ... 

[mm] use compaction for GFP_ATOMIC order > 0 (Andrea Arcangeli) [622327 642570]

Essentially this code gets very busy as the workload increases and causes dramatic response time increases.  Additionally, peak SPECsfs NFS Op/sec performance is degraded against all servered file systems.



Version-Release number of selected component (if applicable):
2.6.34-85 and later

How reproducible:
everytime

Steps to Reproduce:
1. run the SPECsfs workload on the BIGI testbed in RDU (bmarson)
2.
3.
  
Actual results:

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
2000    1983    0.5
4000    4030    0.6
6000    6081    0.7
8000    8099    1.0
10000   10096   1.4
12000   12172   2.7
14000   14178   4.0
16000   16088   5.0
18000   18353   5.4
20000   20202   6.3
22000   22272   6.5
24000   24415   4.6
26000   26439   4.4
28000   28409   4.8


Expected results:

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
2000    1983    0.5
4000    4037    0.6
6000    6065    0.6
8000    8097    0.7
10000   10054   0.7
12000   12134   0.8
14000   14132   0.9
16000   16214   0.9
18000   18202   1.1
20000   20222   1.2
22000   22357   1.5
24000   24412   2.0
26000   26377   3.9
28000   28458   4.9


Additional info:

The clients are networked with e1000 NICs that have jumbo frames enabled (MTU=9000).  Testing with standard frames showed no improvement.

Attaching notes regarding profiling of each the -84 and -85 kernels

Comment 1 Andrea Arcangeli 2011-02-01 00:18:46 UTC
Ok reverting it is no big deal. One reason I pushed it is also because this is what is in 2.6.38. The compaction code of 2.6.38 supports an async and optimistic mode that may avoid the problem. But if something's wrong in current compaction there's a chance it's also wrong in 2.6.38. I'd like compaction not to lead to this worse behavior.

So there are various possibilities:

1) kswapd is the problem
2) the congestion_wait is the problem
3) compaction is the problem and it's too aggressive increasing latency and looping too long before returning
4) migrate lock_page without async-migrate is the problem (misses upstream logic)



I'll create 4 kernels to test (4th would be the closest to upstream 2.6.38), to find out what the culprit is.

If we've any doubt, we can always decide to backout the patch and have peace of mind and no hurry, but I will need to find exactly what it is for future kernels to be sure it doesn't happen again.

Comment 2 Barry Marson 2011-02-01 15:49:52 UTC
Just as a sanity checkpoint, lwoodman built me a 109 kernel without your patch and there was no regression.

Just let me know when they are ready

Barry

Comment 3 Andrea Arcangeli 2011-02-08 17:13:19 UTC
This is a first kernel to test (let's call it "K1": 114 + plus the compaction async updates matching upstream and a few other pending fixes. This should be closest to what I intend to submit for merging in 6.1 (no new features, just fixes from upstream). NOTE: it's not supposed to fix the problems, it may mitigate it, but I'd like to be sure it also happens with compaction async and with the compaction code closer to current upstream:

https://brewweb.devel.redhat.com/taskinfo?taskID=3095996

If you can test async-aio as well that would help as I included the pgd_lock deadlock fix for bug #671477 .

Comment 4 Barry Marson 2011-02-08 17:37:00 UTC
Looks like the build in comment 3 failed

Barry

Comment 7 Andrea Arcangeli 2011-02-09 00:14:45 UTC
This is kernel for comment #3.

https://brewweb.devel.redhat.com/taskinfo?taskID=3096320

Comment 8 Andrea Arcangeli 2011-02-10 14:35:13 UTC
This is kernel we can call K2 that is like K1 in comment #7 (described in comment #3), but with compaction disabled for jumbo frames in direct reclaim.

https://brewweb.devel.redhat.com/taskinfo?taskID=3101916

Comment 9 Andrea Arcangeli 2011-02-10 14:39:40 UTC
This we can call K3 that is like K2 in comment #8 but with compaction disabled for jumbo frames in kswapd too (I hope there are no harmless gcc warnings that will cause the build to fail, I hand edited stuff as this is a bug-localizing effort):

https://brewweb.devel.redhat.com/taskinfo?taskID=3101941

Comment 10 Andrea Arcangeli 2011-02-10 14:43:00 UTC
This is K4 that is like K1 but with compaction disabled for jumbo frames only in kswapd (if K3 builds this should build too otherwise I'll resubmit both):

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3101963

Comment 11 Barry Marson 2011-02-10 15:54:14 UTC
I have begun testing K1.  I should have results in the next 3 hours.

Barry

Comment 12 Andrea Arcangeli 2011-02-10 16:45:52 UTC
This is K5, it's like K1 but it disables sync migration in direct compaction, so it should provide better latency than K1, I'd like to know if it's measurable.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3102168

The one that should work for sure as well as RHEL6 is K3. K1 and K5 hopefully will already work better than current 6.1 thanks to the async_migrate backport from upstream. We need to find if K1 (or K5) are still slower than the original 6.0 performance (and 6.0 didn't use compaction for the jumbo frames).

Comment 13 Barry Marson 2011-02-10 20:45:42 UTC
Created attachment 478116 [details]
profiling on BIG with -115-K1 kernel using ext4.

K1 testing completed ... ran with profiling and got the following numbers

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1981     0.5
 4000    4033     0.6
 6000    6086     0.7
 8000    8097     1.2
10000   10097     1.9
12000   12187     3.7
14000   14229     3.8
16000   16257     5.4
18000   18349     5.3

Attached is the profiling information >= 1% cycles from each point.

Comment 14 Barry Marson 2011-02-10 20:48:00 UTC
Looks like the K2 through K5 kernels didn't build.

Awaiting new kernels :(

Barry

Comment 16 Andrea Arcangeli 2011-02-10 21:27:40 UTC
Seeing the results of K1, it looks like the problem was the synchronous migration so I'm hopeful that K5 will work ok and with compaction fully enabled for jumbo frames (compaction was disabled for jumbo frames in RHEL6.0, my patch enabled it for both kswapd and jumbo frames but it was a only synchronous migration in all 6.1 kernels so far!). the changes in K1 makes the kswapd using only async migration during kswapd-compaction, but it still become synchronous migration at times in the direct compaction (not in the first pass but in subsequent passes if there's memory pressure the sync flag goes to 1). That is what I tried to address in K5 by making everything asynchronous and not to ever wait for I/O in compaction regardless if it's kswapd or direct compaction.

K2 will hopefully also work as well as 6.0, because then the synchronous direct compaction is disabled (only kswapd compaction is enabled but it uses async migration only, plus even if kswapd stalls it shouldn't stall userland programs that measure the latency).

K4 will likely perform the same as K1 because synchronous direct compaction is still enabled (disabling kswapd compaction shouldn't help).

K3 should work identical as 6.0 (compaction disabled for both kswapd and direct reclaim).

K5 should hopefully also work as well as 6.0 by using async compaction/migration for both direct compaction and kswapd so in theory allowing for higher performs thanks to an higher ratio of jumbo frames used...

Comment 17 Andrea Arcangeli 2011-02-10 22:52:02 UTC
this is K5 resubmitted, I'm submitting them in reverse order now. I hope it builds and that was a buildsystem error, it's the same srpm as before.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3102828

Comment 19 Andrea Arcangeli 2011-02-11 10:57:33 UTC
this is a resubmit of K4

https://brewweb.devel.redhat.com/taskinfo?taskID=3105022

Comment 20 Andrea Arcangeli 2011-02-11 11:06:12 UTC
this is a resubmit of K3

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3105045

Comment 21 Andrea Arcangeli 2011-02-11 11:09:01 UTC
this is a resubmit of K2

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3105053

Comment 22 Andrea Arcangeli 2011-02-11 11:12:35 UTC
After we sort this out for RHEL6.1 it'd be very useful to also test this on the upstream kernels that are even more aggressive in trying to free enough memory for compaction to be able to complete, inside shrink_zone. But despite the more aggressive shrink_zone in practice I expect the current upstream to be like the K1 that you already tested.

Comment 23 Barry Marson 2011-02-11 19:43:27 UTC
Created attachment 478298 [details]
profiling on BIG with -115-K5 kernel using ext4.

K5 testing completed ... ran with profiling and got the following numbers.  I dont think this is what you expected.  Not sure which kernel to try next.  

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1981     0.5
 4000    4030     0.6
 6000    6066     0.7
 8000    8068     1.3
10000   10097     1.9
12000   12188     3.6
14000   14277     3.8
16000   16285     4.9
18000   18333     5.6

If it's possible feel free to uniquely name each kernel for easier comparison.

Barry

Comment 24 Andrea Arcangeli 2011-02-11 20:27:47 UTC
So maybe kswapd is the problem, I don't know. Can you try K3 next? That should work exactly like 6.0. And then K4 that only disables compaction in kswapd. Thanks!

Comment 25 Barry Marson 2011-02-13 21:03:28 UTC
Created attachment 478516 [details]
profiling on BIG with -115-K2 kernel using ext4.

K2 testing completed ... ran with profiling and got the following numbers.  Response time got worse more slowly but eventually was just as bad.  I
dont think this is what you expected.  Moving to K3

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1981     0.5
 4000    4030     0.6
 6000    6066     0.7
 8000    8068     1.2
10000   10097     2.0
12000   12188     2.7
14000   14277     3.3
16000   16285     4.9
18000   18333     5.1

Comment 26 Barry Marson 2011-02-14 04:22:55 UTC
Created attachment 478542 [details]
profiling on BIG with -115-K3 kernel using ext4.

K3 testing completed ... ran with profiling and got the following numbers.  Clearly the best so far but not quite as good as the original results.

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1981     0.5
 4000    4036     0.6
 6000    6066     0.7
 8000    8097     0.8
10000   10053     0.8
12000   12143     0.9
14000   14187     1.0
16000   16180     1.1
18000   18187     1.3

The profiling shows vmlinux:tg_shares_up is busiest, relatively speaking on the lightest of workloads and drops in % and the workload increases

Ill setup for a K4 run in the morning.  Im fading to fast here :)

Thanks,
Barry

Comment 27 Andrea Arcangeli 2011-02-14 17:06:59 UTC
-84 was this:

2000    1983    0.5
4000    4037    0.6
6000    6065    0.6
8000    8097    0.7
10000   10054   0.7
12000   12134   0.8
14000   14132   0.9
16000   16214   0.9
18000   18202   1.1

K3 is:

 2000    1981     0.5
 4000    4036     0.6
 6000    6066     0.7
 8000    8097     0.8
10000   10053     0.8
12000   12143     0.9
14000   14187     1.0
16000   16180     1.1
18000   18187     1.3

Are you sure the 0.2 difference isn't in the measurement error range? K3 was supposed to perform identical to the original as far as compaction is concerned (I don't think your workload is invoking compaction in any significant way in K3 or -84). If this is no measurement error (I didn't see two consecutive runs to guess what the error could be) so it might be related to some other update if it's not a measurement error. I doubt it's related to the backport of async compaction that don't seem to have any significant impact on your workload, but it still looks good idea to have (even if it's not high priority material).

The bug we've to fix is to reduce the cost of compaction when __GFP_REPEAT is set. Either that or to shutdown compaction for small allocations that may be extremely short lived. We're doing a lot of CPU work to create a big frame, but then what we gain from the big frame is very little compared to the work we did to allocate a big frame instead of a small one.

Comment 28 Barry Marson 2011-02-14 18:58:54 UTC
I just verified that the runs from comment #1 were unprofiled.  My profiled comparison of -84 only went up to 12000 Ops, since performance was terrible by then.  The profiled results for -84 to that point were:

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1981     0.5  
 4000    4034     0.6  
 6000    6086     0.6  
 8000    8095     0.7  
10000   10085     0.8  
12000   12104     0.8  

Im rerunning K3 again without profiling and will have results this eve.

Of note, these are the results for RHEL 6.0 -71

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1983     0.5  
 4000    4031     0.6  
 6000    6081     0.6  
 8000    8096     0.7  
10000   10087     0.7  
12000   12135     0.8  
14000   14177     0.9  
16000   16173     1.0  
18000   18232     1.1  
20000   20270     1.4  
22000   22360     1.8  
24000   24343     2.7  
26000   26409     4.5  
28000   28444     6.7  

Barry

Comment 29 Barry Marson 2011-02-15 18:08:28 UTC
The full non profiled run of K3 is:

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1983     0.5  
 4000    4034     0.6  
 6000    6080     0.6  
 8000    8097     0.7  
10000   10087     0.8  
12000   12104     0.9  
14000   14131     1.0  
16000   16174     1.1  
18000   18247     1.2  
20000   20238     1.5  
22000   22280     1.9  
24000   24407     4.0  
26000   26435     4.3  
28000   28412     4.6  
30000   30484     6.1

In some ways we are better than -71; some ways worse.

The Good:  We achieve a higher peak Ops/sec rate.  This is great since we regressed compared to RHEL5 which still does 10% better.

The Not so Good:  Response time between 10000-24000 is higher.  Operations like getattr, lookup, write, commit, are higher.


Barry

Comment 33 Barry Marson 2011-02-23 15:12:52 UTC
K8 performs nearly identically to K3.  These were tested up to 20K Ops.

It's still off from the -84 results.  At 20K, K3/K8 is 1.5 response time, where -84 is 1.2.  There may be other factors though.  We are certainly far better than where we were at -85.

Barry

Comment 34 Andrea Arcangeli 2011-02-24 03:49:54 UTC
K9 (compaction-no-kswapd + compaction_allocate_lowlat)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137227

K10 (only compaction_allocate_lowlat)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137217

K11 (compaction-kswapd + compaction_allocate_lowlat)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3137209

Comment 35 Andrea Arcangeli 2011-02-25 03:06:53 UTC
I had to respin a new K11 version because while backporting from upstream to rhel6 one s/order/last_order/ went missing and it would make the compaction logic in kswapd almost like it was before. Good thing I noticed it before you started testing K11!

This also replaces my kswapd-high-wmark fix with the equivalent one that went in -mm that adds a min(1% of RAM, lowwmark) "balance" (my balance was 0 and I still think it was better as 0 but this will be easier to get acked for inclusing considering it's in -mm and it satisfies me enough). This change can't possibly affect performance.

I also included the latest pgd_lock deadlock fix that I posted for review so that gets tested too (in previous K11 I had the one I posted to rhkernel-list through the proper bugzilla which is a bit more risky). I removed the old fix which is still a good cleanup but only relevant from upstream if this new fix is approved by Xen. These changes can't affect performance (they only affect vmalloc_sync_all which isn't called in any fast path).

So this is K12 and it replaces K11. (K9 K10 were all good, no need to respin a build for those two)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3140763

I think you go ahead testing K12 if K9 completed.

Comment 36 Barry Marson 2011-02-26 05:04:12 UTC
Created attachment 481120 [details]
profiling on BIG with -118-K12 kernel using ext4.

K9 kernel was the best running kernel (which appears to have been expected by Andrea).  Response time was better than the -84 kernel.  Even better than RHEL6 GA-71. Peak ops was up 2K (though 10% off from RHEL5 still -- thats another bz).

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1982     0.5  
 4000    4031     0.6  
 6000    6080     0.6  
 8000    8097     0.7  
10000   10053     0.7  
12000   12123     0.8  
14000   14178     0.9  
16000   16224     0.9  
18000   18243     1.1  
20000   20294     1.2  
22000   22304     1.5  
24000   24406     1.9  
26000   26438     3.6  
28000   28463     4.3  
30000   30438     5.5  
32000   29261     8.5  
34000   26555     9.4  
36000   24199    10.3  


Ran the K12 kernel and while peak ops was still better than RHEL6 GA, response time was worse starting at around 10K.  I profiled the K12 kernel and am attaching the results.  Except for one point which didn't profile (not sure why), its clear the compaction code is way to busy starting around 8K ops.

Profiled results

Req     Actual  Resp
Rate    Rate    Time msec/op
----------------------------
 2000    1982     0.5  
 4000    4030     0.6
 6000    6081     0.6  
 8000    8069     0.7  
10000   10054     0.8  < start of large compaction
12000   12106     0.9  
14000   14139     1.0  
16000   16207     1.1  
18000   18194     1.4  
20000   20345     2.1  
22000   22397     3.8  
24000   24546     4.0  
26000   26654     4.4  

Barry

Comment 37 Andrea Arcangeli 2011-02-26 06:36:32 UTC
Ok. I submitted a new K9 with the updated pgd_lock deadlock fix and updated kswapd-high-wmark patches. This should run the same as K9. I'll submit the patches of this kernel (K13) to rhkernel-list next, I'm in airport right now. No more testing needed after this for RHEL6.1 I think. Thanks a lot!

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3143355

Comment 38 Barry Marson 2011-02-27 22:44:03 UTC
K13 kernel is as good as the K9 kernel.  So we have comparable response and 30K ops where GA didnt.

Barry

Comment 39 Andrea Arcangeli 2011-03-01 00:45:39 UTC
All patches in K13 submitted to rhkernel-list as patch series of 22 patches with patch 0/22 with Message-Id: <20110301000318.429428052>

Other two patches present in K13 submitted separately through 
https://bugzilla.redhat.com/show_bug.cgi?id=670063#c3 https://bugzilla.redhat.com/show_bug.cgi?id=671477#c10

Comment 40 RHEL Program Management 2011-03-01 22:19:54 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 41 Rik van Riel 2011-03-02 18:27:24 UTC
*** Bug 616600 has been marked as a duplicate of this bug. ***

Comment 42 Barry Marson 2011-03-04 14:00:58 UTC
I have verified that ext2/ext2/xfs all perform much better with K13.  In fact they exceed RHEL6.0 GA -71.

gfs2 has other issues which still need to be addressed.  But that's another bz ;)

Barry

Comment 52 Aristeu Rozanski 2011-03-21 12:38:49 UTC
Patch(es) available on kernel-2.6.32-124.el6

Comment 59 Eugene Teo (Security Response) 2011-04-20 01:01:22 UTC
*** Bug 678213 has been marked as a duplicate of this bug. ***

Comment 61 errata-xmlrpc 2011-05-19 12:53:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html