Red Hat Bugzilla – Bug 235680
LSPP: racoon is unable to open files after running for 17 hours.
Last modified: 2007-11-30 17:07:43 EST
Description of problem:
I started a labeled ipsec stress test Friday afternoon and let it run over the
weekend on lspp 72 kernel with ipsec-tools-0.6.5-6.2. On Saturday morning the
following was logged:
Apr 7 08:10:35 racoon: INFO: respond new phase 2 negotiation:
Apr 7 08:10:35 racoon: ERROR: racoon: MLS support is not enabled.
Apr 7 08:10:35 racoon: ERROR: within_range: couldn't initialize the AVC
Apr 7 08:10:35 racoon: ERROR: no policy found: 9.3.xx.xx/32
9.3.xx.xx/32 proto=any dir=in
Apr 7 08:10:35 racoon: ERROR: failed to get proposal for responder.
Apr 7 08:10:35 racoon: ERROR: failed to pre-process packet.
I looked at the code, and is_selinux_mls_enabeled() is called to determine if
there is MLS support. Also, avc_init() fails...
I also noticed following in log file:
Apr 7 08:17:14 racoon: INFO: respond new phase 1 negotiation:
9.3.xx.xx<=>9.3.xx.xx Apr 7 08:17:14 hvracer1 racoon: INFO: begin
Identity Protection mode.
Apr 7 08:17:14 racoon: INFO: received Vendor ID: DPD Apr 7 08:17:14 hvracer1
racoon: ERROR: failed to open pre_share_key file /etc/racoon/psk.txt
Apr 7 08:17:14 racoon: ERROR: couldn't find the pskey for 9.3.xx.xx.
Apr 7 08:17:14 racoon: ERROR: failed to process packet. Apr 7 08:17:14
hvracer1 racoon: ERROR: phase1 negotiation failed.
The inability to get a file descriptor led George to suggest we had used up all
the file descriptors. We did an "lsof | grep racoon" and there were 1041
different entries. 1017 of these were for a socket whose protocol was unknown.
I did an lsof on responder machine and only saw 179 entries of which 155 were a
socket with unknown protocol.
Next step is to look through code and see where initiator is perhaps not closing
Otherwise, racoon is unable to negotiate any new SAs and all the old ones expire.
Version-Release number of selected component (if applicable):
Appears reproduceable after racoon has been running over 17 hours.
Steps to Reproduce:
1.setup labeled ipsec by adding ipsec policy with a label
3.start netperf into sending a stream of packets for 48 hours.
All the SAs expire since racoon is unable to negotiae new ones.
That also means the netperf stream stops running.
Should not see logged error messages. And racoon should be running and
Joy Latten is actively looking at this bug.
A labeled ipsec stress test has been running for over 20 hours so far and I do
not see this problem occurring. will continue to let run for another 24 hours. I
have been checking the amount of opened file/socket descriptors by ls -l
/proc/[racoon's pid]/fd and "lsof | grep racoon".
On my machine that racoon inititated the negotiation, after 20 hours ls -l in
/proc/pid/fd has 129 open socket descriptors.
On my machine that was the responder, after 20 hours, ls -l in /proc/pid/fd has
209 open socket descriptors.
A netstat -ae accounts for 5 of these sockets. Where in the world are the rest
of them coming from??
I also added debug messages for each time racoon opens a socket.
I have debug messages to account for the above 5 and 3 more for
getting interfaces out of kernel, etc... in grabmyaddrs.c. So again, where are
all these other open socket descriptors from?
Will query the ipsec-tools mailing list for help.
I ran a set of stress tests between 2 machines for over 48 hours and I did not
see the above problem.
However, after stopping netperf after the 48 hours. I did ls -l on responder's
/proc/pid/fd and there were 461 open socket descriptors. Again, only a handful
could be accounted for. On inititiating machine, ls -l in /proc/pid/fd listed
129 open socket descriptors. Again, only a handful accounted for. Where are all
these descriptors coming from. Just can't help wonder where they are coming
from. I noticed clusters of the sockets have same date and time. Like 36 of them
have same time.
I am currently running stress tests between 2 machines, this time stressing
labeled ipsec over both loopback and point-to-point. So far has been running
almost 24 hours with no sign of problem of this bug report.
I did notice that there were a bunch of open sockets before leaving yesterday.
When I looked this morning, they had all been cleaned up and I only saw about 10
open sockets on each machine. Right now I see about 500 on one machine (the on
with loopback and eth0 being stressed) and only about 10 on the other (only eth0
Created attachment 152625 [details]
patch addressing socket leak
Based on findings reported by Joe Nall on bz 235475, I created this patch. It
calls init_avc during daemon init rather than each SA negotiation.
Created attachment 152626 [details]
This patch is a little less harsh to nonselinux systems.
Created attachment 152627 [details]
avc_init check was backwards. Should be correct now.
Built ipsec-tools-0.6.5-6.5 with the above patch. It could use a re-test to make
sure the problem is indeed fixed. Thanks.
Ran all night with no increase in number of file descriptors. Looks good. Thanks.
Removing LSPP tracker. Also built a new ipsec-tools-0.6.5-6.6 which adds audit
system call backs so that racoon can correctly act as a user space object manager.