Red Hat Bugzilla – Bug 499887
IPSEC dosen't work with a big SPD/SAD
Last modified: 2010-10-23 05:30:04 EDT
Description of problem:
Same description as https://trac.ipsec-tools.net/ticket/1
ipsec-tools suite just doesn't work when SPD and/or SAD becomes big (problems can start around 100 tunnels, which is not so "big" !).
The main problem behind all that is in the PFKey interface:
when userland (racoon / setkey) sends a SADB_DUMP or a SADB_X_SPDDUMP, it sends a single PFKey message, but the kernel will send one PFKey message by entry.
Those messages are sent through an UNIX socket, and the socket's buffer will quickly fill in.
userland tools have no chance to fill it out, as almost all kernels will process the whole PFKey request before giving back some CPU to the userland.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. load many rules into the SPD (ie. 192)
2. Validate that all rules are loaded into racoon correctly
3. Validate links.
Not all rules are loaded (customer indicated that first 170 were loaded).
All rules loaded.
This may be fixed by the following two patches:
Looks like you might be right about the git commits. Do you already have this setup to reproduce on a set of systems somewhere, or do I need to set it up myself?
I don't have a setup to test this.
So, I don't have enough hosts to actually validate all the links, but I setup a few hundred SA and SPD entries (1 24 bit subnet worth of each) and started racoon in the foreground with debug maxed out. I was able to observe racoon read all of the resultant entries produced by the SPDDUMP it issues at startup, so I'm not sure whats going on here. I agree the above commits look like they might improve performance in reading those entries, but I'm hesitant to take them just because something might be a bit slow. That said, I am using newer kernel that has some fixes for UNIX sockets in it (although I could have sworn setkey uses the PF_KEY address family rather than PF_UNIX, I'll need to check on that). Anywho, can the customer try this with the latest kernel?
I can read the bugzilla, I'm saying I can't reproduce the problem. I loaded 254 SPD rules and 254 associated SA rules on a system, and started racoon in the foreground with full debug. Parsing through the output, I see that racoon reads all 254 entries from the X_SPDDUMP request. (te results are on amd-toonie2-01.rhts.bos.redhat.com:/root/resutls if you want to see, just grep sub: results). Anywho, it obviously works for me on the system I'm using. Looking at the patches above, I see how they might help, but the first looks like an abi breaker, so its out. The second looks doable, but before we take it, I'd really like to see the problem occur consistently, and then cease to occur when we take the patch. My guess is that there is a load aspect to the problem that isn't being considered here. As an alternative, the second patch should apply pretty cleanly to the rhel5 kernel I think. Can the customer try a test kernel out with the second of the above patches included to confirm the fix?
The customer is willing to try a test kernel in order to confirm the fix.
Gah, the second patch is pretty non-descript, but it requires the first patch to work properly, which makes the whole thing an ABI breaker. I'm going to try to hack something together to make this work, but I can't promise anything.
In the meantime, I expect that the customer can work around this issue (assuming the problem is what we assume it is), but setting /proc/sys/net/core/rmem_default and rmem_max to very large numbers. If racoon doesn't explicitly reset those values on any sockets that it opens, that should prevent blocks/drops on the pf_key protocol and avoid this issue. If so, the raccoon startup script can be adjusted to ensure that it starts with a sufficiently large buffer space to make the problem avoidable. Please relay that to the customer and let me know how that works out.