From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041224
Description of problem:
(This bug does not happen with any released FC3 kernels.)
With rawhide kernels (1125 and up, although I didn't try anything between 1063 and 1125), I can remotely kill my home router. This is the same box as bug 144324, still running FC3, except with a rawhide kernel.
I'm not absolutely certain that 1125 is as easily killable as 113x and 1143, but it might be. With 1125/1126, the box would oops whether I installed the _FC4 kernel from rawhide or if I recompiled on FC3 (with a spec file change to work around bug 147281). For 1143 I just tried the FC4 kernel, I haven't tried recompiling on FC3.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Turn router computer on.
2. From a client computer (behind the router), spend a few minutes web browsing. (I don't know how much HTTP activity is needed. It seems to be a non-zero amount, but approx. 5 minutes of casual web browsing always does it.)
3. Try to ssh from the client computer into the router. (SSHing from a third computer, on the outside Internet rather than behind the router, also works. I don't know if that is as consistently reproducible though. It's consistent enough that it makes telecommuting effectively impossible, however.)
Actual Results: Router goes into super-weird recursive oops. These are the weirdest oopses I've ever seen. Perhaps they won't be the weirdest you've ever seen, but they're still undeniably weird.
Expected Results: Router should not crash, and ssh on client should get a chance to ask the user for the password. The router actually happens to die before the ssh client gets that far.
I will be attaching a few oopses to this bug. I'll have to truncate the longer ones to the first several hundred KB; my serial console box has a failing Ethernet controller and no way of adding a new one so I have to keep the oopses down to floppy-disk size, and I have a hard time imagining that the oopses have any interesting information that far into them anyway.
I'm not sure whether I should set the bug's severity to Security or High. This thing *is* effectively a remote DoS, so I set it to Security, but if you drop the severity to High then that's OK with me.
Perhaps I'll try to dig into the problem this weekend and see if I can narrow things down. OTOH this is a production system so I may decide to move back to a released FC3 kernel, or I may even reinstall with RHEL 4...
Created attachment 111224 [details]
one oops with kernel 1143_FC4
This is probably about as sane-looking as the oopses from this bug get.
Created attachment 111225 [details]
another 1143_FC4 oops
This oops is a bit more insane (ever seen an oops with this many "=" characters
in it?). It seemed like it was going to go on forever, so this is only the
On an unrelated note, I have no idea how this bug ended up being so wide. :(
Hitting "reload" once on this web page (i.e. this bug report) in Mozilla 1.7.5
seems to be sufficient network activity for step #2 for reproducing this bug.
If I recompile for FC3 with patches 1020 through 10001 disabled, the problem
looks like stack overflow in the networking layer to me. Davem ? Ingo?
I actually don't need any HTTP activity before doing the SSH. I need to just
wait 1-2 minutes or so.
If I reduce the kernel down to the following patches:
450 (Removing Xen completely breaks the build for some reason)
Then I get some message about a double-fault. I only tried this kernel once, so
I don't know how reproducible it is. Unfortunately I didn't have my serial
console up-and-running at that point. I still have the binaries from this kernel
so I can run it again if you wish.
If I reduce it further, to these patches:
Then it's *always* an automatic reboot, rather than super-weird oopses. This is
CPU is a 1GHz Pentium III "Coppermine", if that matters. IIRC the BIOS is
sufficiently recent to update it with the latest microcode. Right now I'm going
to see if I can figure out how to build without patch 450. At that point it
would be the absolute minimum for a recent kernel (1143) being built via RPM
with slab debugging.
1 + 2 + 450 + 500 + 1010 sometimes (if not always, I'm not sure) rapidly prints
the first line of the double-fault message to the serial and VGA consoles before
the spontaneous reboot happens.
I managed to get 1 + 2 + 500 + 1010 to build. That no longer double-faults.
Instead, I get tons of warning messages -- first a multi-screenful blast, and
then a trickle afterward. The trickle varies in speed -- sometimes 2 or 3 lines
a second, sometimes one line every 10-15 seconds or so (very rough estimate,
just to give you an idea).
Attachment coming up...
Created attachment 111237 [details]
"Warning: kfree_skb on hard IRQ c0XXXXXX" logs
This is from kernel 1143, recompiled on FC3 and reduced down to patches 1 + 2 +
500 + 1010.
Oops, I clicked on "Submit" instead of inside the text pane. :( What I meant to
say (in addition) is that the machine froze up shortly after these messages.
(There were more messages of the same form that I didn't get to capture on my
serial console, as I had stopped the capture in order to copy it onto an
BTW I know that 1143 isn't the latest anymore, but it's still pretty recent, and
I don't want to change too many variables at once. Perhaps trying with 1146 (or
whatever the latest is now) will be my next step.
If I compile only with patches 1 + 2 + 500, with both slab and pagealloc debug
disabled, I still see the same behavior (i.e. the "Warning: kfree_skb on hard
IRQ c0XXXXXX" messages -- but I haven't yet left it up long enough to see if it
All 4 NICs in this system are pcnet32, if that is relevant.
Created attachment 111239 [details]
oops from 2.6.10-1.1148_FC4
I decided to try figuring out when this problem started. So far I've managed to
reproduce this (the oopses) going back to 2.6.11-rc3. That doesn't mean it
didn't start earlier, just that I haven't tried anything between 2.6.10-ac and
Update on my progress:
2.6.11-rc1 oopses. 2.6.10 doesn't.
I'm now going to binary-search to see which 2.6.10-bk patch introduced the oopses...
2.6.10-bk13 is OK. 2.6.10-bk14 oopses.
I'll see if I can narrow it down any further...
Ok, I've narrowed it to a single ChangeSet:
1.1938.474.1 [TCP] merge tcp_sock with tcp_opt
2.6.10-bk13 by itself works. 2.6.10-bk13 + this one changeset oopses.
BTW I need to add a step 4 to the steps-to-reproduce:
4. If the router does not crash after step 3, then log out of the SSH session
and repeat step 3 one more time.
It turns out that, without step 4, my procedure isn't fully 100% reproducible.
With the new step 4 it seems to be.
2.6.10-bk13 + the changeset stops oopsing if I disable CONFIG_4KSTACKS. I know
that's not a fix, but maybe this is useful information.
Thanks for tracking this down.
The problem is tcp_v4_conn_request(), it puts a struct tcp_sock
on the stack, and this structure is huge. It wasn't so bad when
it was just tcp_opt (before Arnaldo's change), but now it's much
larger. And as you determined, this overflows 4KSTACKS.
I'm discussing possible fixes with Arnaldo.
Damn, I just changed my password on this bugzilla to ask if ipv6 is enabled,
probably it is, if you could test with a kernel with ipv6 completely disabled,
i.e. it is not enough to not load ipv6.
Could you please try with the attached patch? Its one of the possible
solutions for this problem, but its completely untested and may be suboptimal,
but I'd like to see test results if possible, its too late here in Brazil and
I'm off to bed, tomorrow I'll look at this again, thanks.
Created attachment 111285 [details]
allocate the tmp sock from TCP slab cache, not on the stack.
The patch to test.
Yes, ipv6 is enabled in the kernels I've been testing.
I'll test your patch now.
Created attachment 111287 [details]
oops with 2.6.10-1.1148 + the patch
If you want me to try against a mainline kernel, I will, but it appears that
the patch didn't fully fix it.
I am now recompiling without the patch and with ipv6 disabled. Then I will try
again, with the patch but ipv6 still disabled.
Oh, by the way, it looks to me like tcp_timewait_state_process() and
tcp_check_req() in net/ipv4/tcp_minisocks.c are also allocating tcp_socks on the
I think I found the bug in the patch. I'll have a new patch shortly (it'll
probably be a patch that applies on top of the first one).
Never mind, I didn't find the bug in the patch, I just got confused. Sorry about
With ipv6 disabled, I can't reproduce the bug anymore, even with 4KSTACKS
enabled. (This is without the patch.)
OK, thanks for the tests so far, I'll take a look at the other sites where
tcp_socks are being allocated in the stack now.
Created attachment 111290 [details]
an untested attempt to fix the first patch
This patch applies on top of the first patch. It's currently untested, not even
compile-tested. Right now I'm compiling, and after that I'll run it.
Created attachment 111291 [details]
move things related to options received from tcp_sock to tcp_received_options and make tcp_sock use it
Could you please test this fix? touchs a lot of places, but seems to be
the right thing as per what I discussed with David.
Comment on attachment 111290 [details]
an untested attempt to fix the first patch
marking my patch as obsolete because of a stupid typo
I'll try the new patch right after I hit "Submit" now.
Ok, with the patch from comment #28, I can't reproduce the problem anymore, even
with ipv6 and 4KSTACKS enabled.
Excellent news, thank you for the hard work reporting, narrowing down the
problem and testing my patches, it is great to work with people like you!
Dave, when you wake up please see if everything is OK and please push this
Sure thing, working on that now.
Linus has taken in the fix upstream, so it will be in
2.6.11-final whenever that comes out.
The fix should also be in 2.6.11-rc4-bk11 and 2.6.11-rc5. This also implies that
it should be in 2.6.10-1.1151_FCx and later.
I say "should," because I haven't had a chance to try running any of those
kernels yet, to verify that they really don't crash with my testcase anymore.
If anyone (with sufficient privs) is less paranoid than me and wants to close
this as RAWHIDE right this instant, that's OK with me. Otherwise, I guess I'll
wait for 1151 or later to hit public rawhide, and then I'll test that before
closing this bug.
I just tested 2.6.10-1.1155_FC4 and the bug is gone. Closing.