Description of problem: I started a labeled ipsec stress test Friday afternoon and let it run over the weekend on lspp 72 kernel with ipsec-tools-0.6.5-6.2. On Saturday morning the following was logged: Apr 7 08:10:35 racoon: INFO: respond new phase 2 negotiation: 9.3.xx.xx[500]<=>9.3.xx.xx[500] Apr 7 08:10:35 racoon: ERROR: racoon: MLS support is not enabled. Apr 7 08:10:35 racoon: ERROR: within_range: couldn't initialize the AVC Apr 7 08:10:35 racoon: ERROR: no policy found: 9.3.xx.xx/32[500] 9.3.xx.xx/32[500] proto=any dir=in sec_ctx:doi=1,alg=1,len=43,str=system_u:object_r:unlabeled_t:s15:c0.c1023 Apr 7 08:10:35 racoon: ERROR: failed to get proposal for responder. Apr 7 08:10:35 racoon: ERROR: failed to pre-process packet. I looked at the code, and is_selinux_mls_enabeled() is called to determine if there is MLS support. Also, avc_init() fails... I also noticed following in log file: Apr 7 08:17:14 racoon: INFO: respond new phase 1 negotiation: 9.3.xx.xx[500]<=>9.3.xx.xx[500] Apr 7 08:17:14 hvracer1 racoon: INFO: begin Identity Protection mode. Apr 7 08:17:14 racoon: INFO: received Vendor ID: DPD Apr 7 08:17:14 hvracer1 racoon: ERROR: failed to open pre_share_key file /etc/racoon/psk.txt Apr 7 08:17:14 racoon: ERROR: couldn't find the pskey for 9.3.xx.xx. Apr 7 08:17:14 racoon: ERROR: failed to process packet. Apr 7 08:17:14 hvracer1 racoon: ERROR: phase1 negotiation failed. The inability to get a file descriptor led George to suggest we had used up all the file descriptors. We did an "lsof | grep racoon" and there were 1041 different entries. 1017 of these were for a socket whose protocol was unknown. I did an lsof on responder machine and only saw 179 entries of which 155 were a socket with unknown protocol. Next step is to look through code and see where initiator is perhaps not closing socket descriptors. Otherwise, racoon is unable to negotiate any new SAs and all the old ones expire. Version-Release number of selected component (if applicable): ipsec-tools-0.6.5-6.2 How reproducible: Appears reproduceable after racoon has been running over 17 hours. Steps to Reproduce: 1.setup labeled ipsec by adding ipsec policy with a label 2.start racoon 3.start netperf into sending a stream of packets for 48 hours. Actual results: All the SAs expire since racoon is unable to negotiae new ones. That also means the netperf stream stops running. Expected results: Should not see logged error messages. And racoon should be running and SAD populated. Additional info:
Joy Latten is actively looking at this bug.
A labeled ipsec stress test has been running for over 20 hours so far and I do not see this problem occurring. will continue to let run for another 24 hours. I have been checking the amount of opened file/socket descriptors by ls -l /proc/[racoon's pid]/fd and "lsof | grep racoon". On my machine that racoon inititated the negotiation, after 20 hours ls -l in /proc/pid/fd has 129 open socket descriptors. On my machine that was the responder, after 20 hours, ls -l in /proc/pid/fd has 209 open socket descriptors. A netstat -ae accounts for 5 of these sockets. Where in the world are the rest of them coming from?? I also added debug messages for each time racoon opens a socket. I have debug messages to account for the above 5 and 3 more for getting interfaces out of kernel, etc... in grabmyaddrs.c. So again, where are all these other open socket descriptors from? Will query the ipsec-tools mailing list for help.
I ran a set of stress tests between 2 machines for over 48 hours and I did not see the above problem. However, after stopping netperf after the 48 hours. I did ls -l on responder's /proc/pid/fd and there were 461 open socket descriptors. Again, only a handful could be accounted for. On inititiating machine, ls -l in /proc/pid/fd listed 129 open socket descriptors. Again, only a handful accounted for. Where are all these descriptors coming from. Just can't help wonder where they are coming from. I noticed clusters of the sockets have same date and time. Like 36 of them have same time. I am currently running stress tests between 2 machines, this time stressing labeled ipsec over both loopback and point-to-point. So far has been running almost 24 hours with no sign of problem of this bug report. I did notice that there were a bunch of open sockets before leaving yesterday. When I looked this morning, they had all been cleaned up and I only saw about 10 open sockets on each machine. Right now I see about 500 on one machine (the on with loopback and eth0 being stressed) and only about 10 on the other (only eth0 being stressed).
Created attachment 152625 [details] patch addressing socket leak Based on findings reported by Joe Nall on bz 235475, I created this patch. It calls init_avc during daemon init rather than each SA negotiation.
Created attachment 152626 [details] updated patch This patch is a little less harsh to nonselinux systems.
Created attachment 152627 [details] updated patch avc_init check was backwards. Should be correct now.
Built ipsec-tools-0.6.5-6.5 with the above patch. It could use a re-test to make sure the problem is indeed fixed. Thanks.
Ran all night with no increase in number of file descriptors. Looks good. Thanks.
Removing LSPP tracker. Also built a new ipsec-tools-0.6.5-6.6 which adds audit system call backs so that racoon can correctly act as a user space object manager.