Red Hat Bugzilla – Bug 11320
kernel-2.2.14-12 crashes after multiple failed connect() calls on same socket
Last modified: 2008-05-01 11:37:55 EDT
This is a severe bug that crashes kernel-2.2.14-12 (SMP and UP), as well as
its predecessors. It shows up when a program makes multiple calls to
connect(), using the same socket for each call. If the connect() calls
fail repeatedly with "Connection refused", the kernel will crash out after
just a few calls.
The problem is aggravated if other system calls are made between calls to
connect(). I use a call to sleep(1) in my example program. In this case
the kernel will usually fail after three to six cycles.
I also tested this on a kernel 2.2.14 I built myself from the tarball, and
it did not crash. It returned "Connection refused" the first time, and
"Invalid argument" after that. (The connect() man page insinuates that
this is incorrect as well, but that's a different story.)
For example code, see http://www.cpons.com/linux-bug/bomb_kernel.c
And sync your disks before running it...
Created attachment 230 [details]
Test program to show the bug.
This is a duplicate of bug reported 4/22 on rawhide #10986 FYI, example code is
included there as well.
Verified this is the case. Its also not the known bug we found with
daddr=0. There is something more fundamentally [word not allowed on a public web
site] with this. The trace shows the daddr based check doesnt seem to be working
at all. Might be a connect in progress code path bug ?
Cause identified. Im working on what needs to get fixed, and checking with
people so I dont break something else in the process
I have been having occasional (roughly weekly) system crashes with the
2.2.14-5.0 kernel which seem to match the above description. Not surprisingly,
no info about the crash could be found in any of the system logs.
I decided to test the kernel bug reported here and of course the test program
bomb_kernel.c does the job of killing off the 2.2.15-5.0 kernel after 3-6
I also tried bomb_kernel.c on my custom built 2.2.16 kernel and everything seems
fine now. I get
Trial: 1 connect: Connection refused
Trial: 2 connect: Connection refused
up to the 411th time it tries to make the connection, whereafter I get
Trial: 411 connect: No buffer space available
Trial: 412 connect: No buffer space available
and so on.
In any case the kernel didn't crash, which was nice, and hasn't so far. I'm just
waiting to see if this has cleared up my almost weekly system crashes....
I ran into this truly annoying problem and found a way around it.
You can workaround this bug in user level code by doing this:
0. Open socket
1. connect using that newly opened fd ONCE, if it fails for whatever reason.
2. close socket
3. repeat 0 as many times as you want to try connecting to the service.
Normally, people open a socket and it stays alive during each connect
attempt and then close it if all of the atteptms fail. In this case, you close
the fd and reopen it anew for every connect attempt.
Hope this helps.....
Alan, was this fixed in our 2.2.16-3 kernel as I suspect?
DaveM fixed it yes