Bug 235680

Summary: LSPP: racoon is unable to open files after running for 17 hours.
Product: Red Hat Enterprise Linux 5 Reporter: Joy Latten <latten>
Component: ipsec-toolsAssignee: Steve Conklin <sconklin>
Status: CLOSED CURRENTRELEASE QA Contact: David Lawrence <dkl>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: eparis, iboverma, joe, krisw, linda.knippers, paul.moore, sgrubb
Target Milestone: ---   
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2007-0342 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-06-27 14:16:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 224041    
Attachments:
Description Flags
patch addressing socket leak
none
updated patch
none
updated patch none

Description Joy Latten 2007-04-09 15:43:46 UTC
Description of problem:
I started a labeled ipsec stress test Friday afternoon and let it run over the
weekend on lspp 72 kernel with ipsec-tools-0.6.5-6.2. On Saturday morning the
following was logged:
Apr  7 08:10:35 racoon: INFO: respond new phase 2 negotiation:
9.3.xx.xx[500]<=>9.3.xx.xx[500]
Apr  7 08:10:35 racoon: ERROR: racoon: MLS support is not enabled.
Apr  7 08:10:35 racoon: ERROR: within_range: couldn't initialize the AVC
Apr  7 08:10:35 racoon: ERROR: no policy found: 9.3.xx.xx/32[500]
9.3.xx.xx/32[500] proto=any dir=in
sec_ctx:doi=1,alg=1,len=43,str=system_u:object_r:unlabeled_t:s15:c0.c1023
Apr  7 08:10:35 racoon: ERROR: failed to get proposal for responder.
Apr  7 08:10:35 racoon: ERROR: failed to pre-process packet.

I looked at the code, and is_selinux_mls_enabeled() is called to determine if
there is MLS support. Also, avc_init() fails... 

I also noticed following in log file:
Apr  7 08:17:14 racoon: INFO: respond new phase 1 negotiation:
9.3.xx.xx[500]<=>9.3.xx.xx[500] Apr  7 08:17:14 hvracer1 racoon: INFO: begin
Identity Protection mode.
Apr  7 08:17:14 racoon: INFO: received Vendor ID: DPD Apr  7 08:17:14 hvracer1
racoon: ERROR: failed to open pre_share_key file /etc/racoon/psk.txt
Apr  7 08:17:14 racoon: ERROR: couldn't find the pskey for 9.3.xx.xx.
Apr  7 08:17:14 racoon: ERROR: failed to process packet. Apr  7 08:17:14
hvracer1 racoon: ERROR: phase1 negotiation failed.

The inability to get a file descriptor led George to suggest we had used up all
the file descriptors. We did an "lsof | grep racoon" and there were 1041
different entries. 1017 of these were for a socket whose protocol was unknown.

I did an lsof on responder machine and only saw 179 entries of which 155 were a
socket with unknown protocol.

Next step is to look through code and see where initiator is perhaps not closing
socket descriptors.  

Otherwise, racoon is unable to negotiate any new SAs and all the old ones expire.

Version-Release number of selected component (if applicable):
ipsec-tools-0.6.5-6.2

How reproducible:
Appears reproduceable after racoon has been running over 17 hours. 

Steps to Reproduce:
1.setup labeled ipsec by adding ipsec policy with a label
2.start racoon
3.start netperf into sending a stream of packets for 48 hours.
  
Actual results:
All the SAs expire since racoon is unable to negotiae new ones.
That also means the netperf stream stops running.

Expected results:
Should not see logged error messages. And racoon should be running and
SAD populated.

Additional info:

Comment 1 George C. Wilson 2007-04-09 20:53:13 UTC
Joy Latten is actively looking at this bug.

Comment 2 Joy Latten 2007-04-11 20:17:08 UTC
A labeled ipsec stress test has been running for over 20 hours so far and I do
not see this problem occurring. will continue to let run for another 24 hours. I
have been checking the amount of opened file/socket descriptors by ls -l
/proc/[racoon's pid]/fd and "lsof | grep racoon".

On my machine that racoon inititated the negotiation, after 20 hours ls -l in
/proc/pid/fd has 129 open socket descriptors.

On my machine that was the responder, after 20 hours, ls -l in /proc/pid/fd has 
209 open socket descriptors.

A netstat -ae accounts for 5 of these sockets. Where in the world are the rest
of them coming from?? 

I also added debug messages for each time racoon opens a socket.
I have debug messages to account for the above 5 and 3 more for
getting interfaces out of kernel, etc... in grabmyaddrs.c. So again, where are
all these other open socket descriptors from?

Will query the ipsec-tools mailing list for help. 

Comment 3 Joy Latten 2007-04-13 19:53:56 UTC
I ran a set of stress tests between 2 machines for over 48 hours and I did not
see the above problem.

However, after stopping netperf after the 48 hours. I did ls -l on responder's
/proc/pid/fd and there were 461 open socket descriptors. Again, only a handful
could be accounted for. On inititiating machine, ls -l in /proc/pid/fd listed
129 open socket descriptors. Again, only a handful accounted for. Where are all
these descriptors coming from.  Just can't help wonder where they are coming
from. I noticed clusters of the sockets have same date and time. Like 36 of them
have same time. 
 
I am currently running stress tests between 2 machines, this time stressing
labeled ipsec over both loopback and point-to-point. So far has been running
almost 24 hours with no sign of problem of this bug report.   
I did notice that there were a bunch of open sockets before leaving yesterday.
When I looked this morning, they had all been cleaned up and I only saw about 10
open sockets on each machine. Right now I see about 500 on one machine (the on
with loopback and eth0 being stressed) and only about 10 on the other (only eth0
being stressed). 

Comment 4 Steve Grubb 2007-04-14 20:19:19 UTC
Created attachment 152625 [details]
patch addressing socket leak

Based on findings reported by Joe Nall on bz 235475, I created this patch. It
calls init_avc during daemon init rather than each SA negotiation.

Comment 5 Steve Grubb 2007-04-14 20:57:18 UTC
Created attachment 152626 [details]
updated patch

This patch is a little less harsh to nonselinux systems.

Comment 6 Steve Grubb 2007-04-14 21:04:16 UTC
Created attachment 152627 [details]
updated patch

avc_init check was backwards. Should be correct now.

Comment 8 Steve Grubb 2007-04-15 13:07:22 UTC
Built ipsec-tools-0.6.5-6.5 with the above patch. It could use a re-test to make
sure the problem is indeed fixed.  Thanks.

Comment 9 Joe Nall 2007-04-16 14:51:36 UTC
Ran all night with no increase in number of file descriptors. Looks good. Thanks.

Comment 10 Steve Grubb 2007-04-16 17:49:39 UTC
Removing LSPP tracker. Also built a new ipsec-tools-0.6.5-6.6 which adds audit
system call backs so that racoon can correctly act as a user space object manager.