Bug 631833

Summary:

Big performance regression found on connect/request/response test through IPSEC (openswan) transport

Product:

Red Hat Enterprise Linux 6

Reporter:

Adam Okuliar <aokuliar>

Component:

kernel

Assignee:

Herbert Xu <herbert.xu>

Status:

CLOSED ERRATA

QA Contact:

Network QE <network-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.0

CC:

aokuliar, jolsa, jpirko, jwest, kzhang, nhorman, rmusil, sgrubb, tgraf

Target Milestone:

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-130.el6

Doc Type:

Bug Fix

Doc Text:

The XFRM_SUB_POLICY feature causes all bundles to be at the finest granularity possible. As a result of the data structure used to implement this, the system performance would drop considerably. This update disables a part of XFRM_SUB_POLICY, eliminating the poor performance at the cost of sub-IP address selection granularity in the policy.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-23 20:51:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

710645

Attachments:

Description	Flags
ipsec barf output to print rhel6 IKE establishment time.	none
ipsec barf output to print rhel55 IKE establishment time.	none
tcpdump of one test run	none
tcpdump of one test run with isakmp session established	none
tcpdump of one test run with isakmp session established on rhel 5.5	none
Fix runaway bundles	none
Disable granular bundles	none

Description Adam Okuliar 2010-09-08 13:45:04 UTC

Description of problem:


Version-Release number of selected component (if applicable):
Linux redclient-01.rhts.bos.redhat.com 2.6.32-71.el6.x86_64 #1 SMP Wed Sep 1 01:33:01 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
100%

Steps to Reproduce:
1.yum install openswan

2. append to /etc/ipsec.conf on both machines following text:

conn tr
        type=transport
        authby=secret
        left=172.16.16.11
        leftnexthop=172.16.16.22
        right=172.16.16.22
        rightnexthop=172.16.16.11
        esp=3des-sha1
        keyexchange=ike
        ike=3des-sha1
        pfs=no
        auto=add

conn tr6
        type=transport
        connaddrfamily=ipv6
        authby=secret
        left=fd55:e98f:3c00:99d1::1
        leftnexthop=fd55:e98f:3c00:99d1::2
        right=fd55:e98f:3c00:99d1::2
        rightnexthop=fd55:e98f:3c00:99d1::1
        esp=3des-sha1
        keyexchange=ike
        ike=3des-sha1
        pfs=no
        auto=add



3.append to /etc/ipsec.secrets on both machines following

172.16.16.11 172.16.16.22: PSK "redhat"
fd55:e98f:3c00:99d1::1 fd55:e98f:3c00:99d1::2: PSK "redhat"

4. Of course you can setup IP addresses of tunnel ends according to your needs.

5. service ipsec restart

6. ipsec auto --up tr

7. ipsec auto --up tr6

8. install netperf package from brew - https://brewweb.devel.redhat.com/buildinfo?buildID=135954

9.IPv4 test: netperf -L 172.16.16.11 -H 172.16.16.22 -t TCP_CRR
it shows us:

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00     233.96   
16384  87380 

this is HUGE performance regression in comparison with RHEL5.5 which gives following resoluts:

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    2376.34   
16384  87380     
  
10.IPv6 Test: netperf -L fd55:e98f:3c00:99d1::1 -H fd55:e98f:3c00:99d1::2 -t TCP_CRR gives us:

Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    1409.89   
16384  87380 

On RHEL5.5 resoluts are:

Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    2259.70   
16384  87380 

Difference is not so huge, but significant. Around 35%


Actual results:
regression is more than 90% !

Expected results:
Improoved performance to at least RHEL5.5 levels.

Additional info:

principle and purpose of netperf connect-request-response test:
https://wiki.test.redhat.com/Kernel/Performance/netperf_tests_description

more exhaustive testing:
http://download.englab.brq.redhat.com/perf-results//netperf/rhel6_snap13/#crr4_172-16-16-33_to_172-16-16-44

http://download.englab.brq.redhat.com/perf-results//netperf/rhel6_snap13/#crr6_fd55-e98f-3c00-99d1--3_to_fd55-e98f-3c00-99d1--4

Comment 6 Avesh Agarwal 2010-09-13 14:44:38 UTC

Created attachment 446951 [details]
ipsec barf output to print rhel6 IKE establishment time.

Comment 7 Avesh Agarwal 2010-09-13 14:45:22 UTC

Created attachment 446952 [details]
ipsec barf output to print rhel55 IKE establishment time.

Comment 8 Adam Okuliar 2010-11-02 14:08:58 UTC

Hi all

Sorry for long delay - we hawe new hw for testing, I was dealing with hardware stuff. 
Results from new hardware are werry similar as were on old - we still have a big performance loss - please see 

http://download.englab.brq.redhat.com/perf-results//netperf/rhel6_rc4/#crr4_172-16-20-11_to_172-16-20-21

I'am running netperf at point where SA is established, and all tests does not takes long than 3600secs, so there is not need to renegotiate SA.
I suppose that problem will be probably in NETKEY part.

If you need any other info please let me know. 

Cheers!
Adam

Comment 9 Avesh Agarwal 2010-11-19 15:48:20 UTC

Can I get access to the setup to find out what is going on netkey side? Not sure right now where is the actual problem? As I reported in the comment #5 that I do not see this performance issue caused by Openswan's IKE protocol.

Comment 10 Avesh Agarwal 2010-12-09 17:14:56 UTC

One way to ensure that the problem is in NETKEY (kernel side ipsec) is to setup ipsec policies manually using "ip xfrm" commands, and then do the performance verification. This will eliminate the need for openswan, and if you still see the performance degradation, we can say for sure that NETKEY is the issue, and we can transfer the bug to kernel people to have a look. So please either you can do this by yourself and let me know, or give me access to your hardware so that I can try by myself, as I have tried several times on my VMs locally but this is not useful at all.

Comment 11 Adam Okuliar 2010-12-13 14:58:35 UTC

Hi

I need a little bit help with this. I am using flowing commands:

On Machine A

ip xfrm policy update dir in  src 172.16.20.21/32 dst 172.16.20.11/32 proto any  action allow priority 31798 tmpl src 172.16.20.21 dst 172.16.20.11 proto esp mode transport reqid 1 level required

ip xfrm policy update dir out  src 172.16.20.11/32 dst 172.16.20.21/32 proto any  action allow priority 31798 tmpl src 172.16.20.11 dst 172.16.20.21 proto esp mode transport reqid 1 level required

ip xfrm state add src 172.16.20.11 dst 172.16.20.21 proto esp spi 0x00000301 mode transport  auth md5 0x96358c90783bbfa3d7b196ceabe0536b enc des3_ede 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df

ip xfrm state add src 172.16.20.21 dst 172.16.20.11 proto esp spi 0x00000302 mode transport  auth md5 0x96358c90783bbfa3d7b196ceabe0536b enc des3_ede 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df

On Machine B

ip xfrm policy update dir in  src 172.16.20.11/32 dst 172.16.20.21/32 proto any  action allow priority 31798 tmpl src 172.16.20.11 dst 172.16.20.21 proto esp mode transport reqid 1 level required

ip xfrm policy update dir out  src 172.16.20.21/32 dst 172.16.20.11/32 proto any  action allow priority 31798 tmpl src 172.16.20.21 dst 172.16.20.11 proto esp mode transport reqid 1 level required

ip xfrm state add src 172.16.20.11 dst 172.16.20.21 proto esp spi 0x00000301 mode transport  auth md5 0x96358c90783bbfa3d7b196ceabe0536b enc des3_ede 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df

ip xfrm state add src 172.16.20.21 dst 172.16.20.11 proto esp spi 0x00000302 mode transport  auth md5 0x96358c90783bbfa3d7b196ceabe0536b enc des3_ede 0xf6ddb555acfd9d77b03ea3843f2653255afe8eb5573965df

When trying to ping from A to B I always get:
ping: sendmsg: No such process

Have you any idea what I am doing wrong?
Thanks a lot
Adam

Comment 12 Avesh Agarwal 2010-12-13 16:20:26 UTC

Few things you can try:

1. Check your firewall settings.
2. You have set same keys (for both md5 and des3) for both directions (in and out), try to give different keys.
3. Also check the output of the "ip xfrm policy" and "ip xfrm state" is same as when you use IKE (openswan) (except the keys) to make sure that your commands are fine.

Comment 13 Adam Okuliar 2011-01-12 13:54:55 UTC

Hi, I can give you acces to those machines. Could you please investigate this?

Thanks
Adam

Comment 14 Avesh Agarwal 2011-01-12 13:59:39 UTC

Yes please, I can try.

Comment 16 Adam Okuliar 2011-01-18 12:08:11 UTC

Hi,

Any updates? I need those systems now. But if you will need them for experimentation feel free to mail me or ping me on irc, i can give you access instantaneously.

Thanks a lot
Adam

Comment 17 Avesh Agarwal 2011-01-18 13:47:59 UTC

Not really. Yes, you can take them. Thanks.

Comment 18 Adam Okuliar 2011-01-19 21:03:52 UTC

Created attachment 474357 [details]
tcpdump of one test run

Comment 19 Adam Okuliar 2011-01-19 21:10:14 UTC

Hi,
I posted tcpdump of one run of the benchmark. There is only one ISAKMP negotiation, so I assume, that problem will be in kernel ipsec implementation.

Cheers,
Adam

Comment 20 Avesh Agarwal 2011-01-19 23:46:28 UTC

I just had a look and this test is good. You are right because there is just one ISAKMP negotiation, that means Openswan is just creating the connection, and doing nothing else on top of that, and this overhead is negligible. So this must be related to kernel ipsec or somewhere else. Due to this, I am now changing it component so that kernel people can have a look at it.

Comment 21 Neil Horman 2011-01-20 12:02:06 UTC

Triage assignment.  If you feel this bug doesn't belong to you, or that it cannot be handled in a timely fashion, please contact me for re-assignment

Comment 23 Neil Horman 2011-01-24 18:23:43 UTC

Adam, do you have a comparative tcpdump from the RHEL5 system?  The fastest way to see whats slowing the RHEL6 system down is to compare the same test on RHEL5 to the test on RHEL6.

looking at teh RHEL6 tcpdump, I agree it looks like we do only have one negotiation phase in ISAKMP, but it appears that it takes at least ~10 seconds to complete negotiation.  If this tcpdump records the entire test, then it appears that, out of a 32 second test, we're spending almost half that test time waiting to start transmitting ESP frames (I don't see the first ESP frame until 17 seconds into the trace).

Comment 24 Adam Okuliar 2011-01-25 11:48:09 UTC

Hi,
this first negotiation happened actually before test. It takes ~10 sec because I entered command 'service ipsec restart' manually on both machines after start of dumping. There are few seconds before test included in this dump. I created another dump with established isakmp security association. In this dump there are no ISAKMP messages, only encrypted ESP test packets. Sorry for confusing. 
I can provide also comparative rhel5 dump if you want.

Comment 25 Adam Okuliar 2011-01-25 12:03:16 UTC

Created attachment 475149 [details]
tcpdump of one test run with isakmp session established

Comment 26 Neil Horman 2011-01-25 15:08:46 UTC

yes, please attach the RHEL5 trace for comparison, I'd appreciate it.

Comment 27 Adam Okuliar 2011-01-25 17:23:22 UTC

Created attachment 475215 [details]
tcpdump of one test run with isakmp session established on rhel 5.5

Comment 28 Neil Horman 2011-01-25 21:16:04 UTC

well, so the tcpdumps make it pretty clear I think.  The packet rate on the RHEL6 box is really just about 30% of that of the RHEL5.5 box.  Thats bad.  I can't tell of course what the root cause is given that (not sure if the crypto library is taking longer to encrypt packets of if the tcp layer is slower, or if the application just isn't running as often, and generating less data.  To determine that, I'll really just need to poke about on these system.  Adam, is it possible for you to loan these systems to me so that I can dig into where our bottleneck is?

Comment 30 Neil Horman 2011-01-26 19:47:54 UTC

will do, I'll start poking at this asap.

Comment 31 Neil Horman 2011-02-02 21:12:57 UTC

Interesting note to self:

I did the test in comment 29 iteratively a few times and came up with this:

[root@hp-dl380g7-01 ~]# netperf -H 172.16.20.21 -L 172.16.20.11 -t TCP_CRR
TCP Connect/Request/Response TEST from 172.16.20.11 (172.16.20.11) port 0 AF_INET to 172.16.20.21 (172.16.20.21) port 0 AF_INET : histogram : interval
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00    1155.90   
16384  87380 

[root@hp-dl380g7-01 ~]# netperf -H 172.16.20.21 -L 172.16.20.11 -t TCP_CRR
TCP Connect/Request/Response TEST from 172.16.20.11 (172.16.20.11) port 0 AF_INET to 172.16.20.21 (172.16.20.21) port 0 AF_INET : histogram : interval
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00     493.00   
16384  87380 

[root@hp-dl380g7-01 ~]# netperf -H 172.16.20.21 -L 172.16.20.11 -t TCP_CRR
TCP Connect/Request/Response TEST from 172.16.20.11 (172.16.20.11) port 0 AF_INET to 172.16.20.21 (172.16.20.21) port 0 AF_INET : histogram : interval
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00     357.50   
16384  87380 

[root@hp-dl380g7-01 ~]# netperf -H 172.16.20.21 -L 172.16.20.11 -t TCP_CRR
TCP Connect/Request/Response TEST from 172.16.20.11 (172.16.20.11) port 0 AF_INET to 172.16.20.21 (172.16.20.21) port 0 AF_INET : histogram : interval
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       10.00     292.40   
16384  87380 


It seems that as I run the test performance is collapsing rapidly.  I have no idea what would cause behavior like that.  Looking further.

Comment 32 Neil Horman 2011-02-02 21:33:01 UTC

perf data indicates the following:
   46.81%          netperf  [kernel.kallsyms]                                                          [k] xfrm_bundle_ok
    38.94%          netperf  [kernel.kallsyms]                                                          [k] __xfrm4_find_bundle

We're spending about %80 of our time in xfrm_bundle_ok and __xfrm4_find_bundle.  I'd be willing to bet that that stat (If I had prevous perf samples) gets larger as time goes on.  I expect we're looping on some list that has lost entries that hang around on it forever or some such.  I'll start digging in tomorrow and see if I can't find the root cause.

Adam, feel free to reclaim the machines for testing purposes.  I'll update the bz when I have a patch/theory that I need them to explore again.

Comment 33 Neil Horman 2011-02-04 16:13:51 UTC

Note to self:  Based on the above testing, I think we need this commit:
80c802f3073e84c956846e921e8a0b02dfa3755f
I'm constructing a kernel with this patch backported now.

Comment 34 Neil Horman 2011-02-11 20:08:29 UTC

Ugh, looks like that helped but not enough.  I'm inclined, given that xfrm only has 60 changelog entries, and only a few minor abi breakers to just sync us with upstream.

Comment 35 Herbert Xu 2011-03-04 07:19:23 UTC

Neil, I'm confused here.  Did we test with the latest upstream kernel to conclude that it doesn't suffer from this regression?

Also, any chance I can get access to these machines for a couple of days?

Thanks!

Comment 36 Neil Horman 2011-03-04 12:01:53 UTC

We did not, Based on my testing in comments 31, 32 and 33 that we had an upstream patch that fit the bill for what this bug looked liked, that I figured it was faster to backport the patch than grab the full upstream kernel, but the patch turned out to require more backporting than I initially though.

As for the machines, I can't give you access to them.  Comment 29 has the system details, but they are owned/managed by adam and shared between several people so you'll need to ask him for time on them.

Comment 37 Herbert Xu 2011-03-04 12:36:16 UTC

Thanks Neil!

Adam, please let me know when it's convenient for me to test those machines.

Comment 38 Adam Okuliar 2011-03-04 15:39:11 UTC

Hi Herbert,

There are some tests scheduled on this systems today and during weekend, but I can give you access next week.

Thanks a lot,
Adam

Comment 39 Adam Okuliar 2011-03-14 10:58:35 UTC

Hi Herbert.

Machines are now ready - sorry for delay, we were dealing with another urgent issue.

Please use these systems:
hp-dl380g7-01.lab.eng.brq.redhat.com
hp-dl385g7-01.lab.eng.brq.redhat.com

systems are provisioned with latest 6.1 which also suffers from this regression. 

For setting up configuration please run /root/prepare_sys.py.

To test actual performance please use
netperf -L 172.16.20.11 -H 172.16.20.21 -t TCP_CRR
on hp-dl380g7-01.lab.eng.brq.redhat.com

Please notify me before you are going to use these systems. I just want to know that you are logged in. This is only because I don't want to interrupt your work with my experiments.

Thanks,
Adam

Comment 40 Herbert Xu 2011-03-15 14:11:16 UTC

Hi Adam:

Can you use those systems now?

Comment 41 Adam Okuliar 2011-03-15 14:17:14 UTC

Hi herbert
 
In few minutes they can be prepared for you. Please tell me yours irc login. I'll ping you when they are ready.
Adam

Comment 42 Herbert Xu 2011-03-15 14:22:43 UTC

It's "herbert".  Thanks Adam.

Comment 43 Herbert Xu 2011-03-15 17:11:05 UTC

Created attachment 485560 [details]
Fix runaway bundles

OK, the problem is that we're creating a new bundle every time and the bundle list gets longer and longer.

Please apply this patch and try again.

Thanks!

Comment 44 Adam Okuliar 2011-03-16 20:31:41 UTC

Hi Herbert. 

I applied your patch on 2.6.32-112.el6
https://brewweb.devel.redhat.com/buildinfo?buildID=159163

Unfortunately there is no improvement at all. CRR performance is same as with original kernel. I can provide you patched .srpm and built kernel rpm as well as access to our systems as well for further investigation.

Thanks, 
Adam

Comment 45 Herbert Xu 2011-03-18 12:37:23 UTC

Thanks Adam.  What I will try to do now is to reproduce the problem on my machines as from what I've on yours have eliminated the hardware as the cause.
I'll let you know how I go.

Comment 46 Herbert Xu 2011-03-22 11:55:13 UTC

OK, I've tracked it down to CONFIG_XFRM_SUB_POLICY.

Apparently we've completely broken it upstream during the policy/bundle cache rework, while RHEL6 still retained the original semantics which is also stuffed.

So my recommendation for RHEL6 is to disable this option.

I just tried it here and it solved the problem.

Comment 47 Herbert Xu 2011-03-24 08:53:25 UTC

Created attachment 487257 [details]
Disable granular bundles

Adam, please try this patch (without disabling XFRM_SUB_POLICY) and let me know if it fixes the problem.

Thanks!

Comment 48 Adam Okuliar 2011-03-24 15:29:02 UTC

Hi Herbert,

This helped a lot. With this patch we have ~91% performance of RHEL5.5. There is still 9% regression, but I believe that this is because of Bug 652311, which is plain-text CRR regression via bnx2. I believe that your patch resolved this ipsec problem.

Thanks a lot,
Adam

Comment 49 Aristeu Rozanski 2011-04-07 13:52:07 UTC

Patch(es) available on kernel-2.6.32-130.el6

Comment 52 Adam Okuliar 2011-04-10 13:50:09 UTC

Reproduced on:
2.6.32-125.el6.x86_64

verified on:
2.6.32-130.el6.x86_64

Changing to verified status.

Comment 53 errata-xmlrpc 2011-05-23 20:51:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Comment 55 Martin Prpič 2011-08-18 14:42:54 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The XFRM_SUB_POLICY feature causes all bundles to be at the finest
granularity possible. As a result of the data structure used to implement
this, the system performance would drop considerably. This update disables
a part of XFRM_SUB_POLICY, eliminating the poor performance at the cost of
sub-IP address selection granularity in the policy.