Bug 499887 - IPSEC dosen't work with a big SPD/SAD
IPSEC dosen't work with a big SPD/SAD
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
low Severity medium
: rc
: ---
Assigned To: Neil Horman
Red Hat Kernel QE team
:
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2009-05-08 14:35 EDT by Marc Milgram
Modified: 2010-10-23 05:30 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-10-21 15:57:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marc Milgram 2009-05-08 14:35:44 EDT
Description of problem:
Same description as https://trac.ipsec-tools.net/ticket/1
ipsec-tools suite just doesn't work when SPD and/or SAD becomes big (problems can start around 100 tunnels, which is not so "big" !).

The main problem behind all that is in the PFKey interface:
when userland (racoon / setkey) sends a SADB_DUMP or a SADB_X_SPDDUMP, it sends a single PFKey message, but the kernel will send one PFKey message by entry.

Those messages are sent through an UNIX socket, and the socket's buffer will quickly fill in.

userland tools have no chance to fill it out, as almost all kernels will process the whole PFKey request before giving back some CPU to the userland.



Version-Release number of selected component (if applicable):
kernel-2.6.18-53

How reproducible:
Very

Steps to Reproduce:
1. load many rules into the SPD (ie. 192)
2. Validate that all rules are loaded into racoon correctly
3. Validate links.
  
Actual results:
Not all rules are loaded (customer indicated that first 170 were loaded).

Expected results:
All rules loaded.

Additional info:
This may be fixed by the following two patches:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c563f7669c10a12354b72b518c2287ffc6ebfb3
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=83321d6b9872b94604e481a79dc2c8acbe4ece31
Comment 1 Neil Horman 2009-05-17 14:01:30 EDT
Looks like you might be right about the git commits.  Do you already have this setup to reproduce on a set of systems somewhere, or do I need to set it up myself?
Comment 2 Marc Milgram 2009-05-18 08:50:18 EDT
I don't have a setup to test this.
Comment 3 Neil Horman 2009-05-18 11:34:36 EDT
So, I don't have enough hosts to actually validate all the links, but I setup a few hundred SA and SPD entries (1 24 bit subnet worth of each) and started racoon in the foreground with debug maxed out.  I was able to observe racoon read all of the resultant entries produced by the SPDDUMP it issues at startup, so I'm not sure whats going on here.  I agree the above commits look like they might improve performance in reading those entries, but I'm hesitant to take them just because something might be a bit slow.  That said, I am using newer kernel that has some fixes for UNIX sockets in it (although I could have sworn setkey uses the PF_KEY address family rather than PF_UNIX, I'll need to check on that).  Anywho, can the customer try this with the latest kernel?
Comment 6 Neil Horman 2009-05-18 12:54:33 EDT
I can read the bugzilla, I'm saying I can't reproduce the problem.  I loaded 254 SPD rules and 254 associated SA rules on a system, and started racoon in the foreground with full debug.  Parsing through the output, I see that racoon reads all 254 entries from the X_SPDDUMP request. (te results are on amd-toonie2-01.rhts.bos.redhat.com:/root/resutls if you want to see, just grep sub: results).  Anywho, it obviously works for me on the system I'm using.  Looking at the patches above, I see how they might help, but the first looks like an abi breaker, so its out.  The second looks doable, but before we take it, I'd really like to see the problem occur consistently, and then cease to occur when we take the patch.  My guess is that there is a load aspect to the problem that isn't being considered here.  As an alternative, the second patch should apply pretty cleanly to the rhel5 kernel I think.  Can the customer try a test kernel out with the second of the above patches included to confirm the fix?
Comment 7 Marc Milgram 2009-05-18 13:04:13 EDT
The customer is willing to try a test kernel in order to confirm the fix.
Comment 8 Neil Horman 2009-05-18 16:41:52 EDT
Gah, the second patch is pretty non-descript, but it requires the first patch to work properly, which makes the whole thing an ABI breaker.  I'm going to try to hack something together to make this work, but I can't promise anything.

In the meantime, I expect that the customer can work around this issue (assuming the problem is what we assume it is), but setting  /proc/sys/net/core/rmem_default and rmem_max to very large numbers.  If racoon doesn't explicitly reset those values on any sockets that it opens, that should prevent blocks/drops on the pf_key protocol and avoid this issue.  If so, the raccoon startup script can be adjusted to ensure that it starts with a sufficiently large buffer space to make the problem avoidable.  Please relay that to the customer and let me know how that works out.

Note You need to log in before you can comment on or make changes to this bug.