Bug 149143
Description
Barry K. Nathan
2005-02-19 09:16:26 UTC
Created attachment 111224 [details]
one oops with kernel 1143_FC4
This is probably about as sane-looking as the oopses from this bug get.
Created attachment 111225 [details]
another 1143_FC4 oops
This oops is a bit more insane (ever seen an oops with this many "=" characters
in it?). It seemed like it was going to go on forever, so this is only the
first 500-600KB.
On an unrelated note, I have no idea how this bug ended up being so wide. :(
Hitting "reload" once on this web page (i.e. this bug report) in Mozilla 1.7.5 seems to be sufficient network activity for step #2 for reproducing this bug. If I recompile for FC3 with patches 1020 through 10001 disabled, the problem still occurs. looks like stack overflow in the networking layer to me. Davem ? Ingo? I actually don't need any HTTP activity before doing the SSH. I need to just wait 1-2 minutes or so. If I reduce the kernel down to the following patches: 1 2 212 450 (Removing Xen completely breaks the build for some reason) 500 511 512 513 1010 Then I get some message about a double-fault. I only tried this kernel once, so I don't know how reproducible it is. Unfortunately I didn't have my serial console up-and-running at that point. I still have the binaries from this kernel so I can run it again if you wish. If I reduce it further, to these patches: 1 2 450 500 1010 Then it's *always* an automatic reboot, rather than super-weird oopses. This is 100% reproducible. CPU is a 1GHz Pentium III "Coppermine", if that matters. IIRC the BIOS is sufficiently recent to update it with the latest microcode. Right now I'm going to see if I can figure out how to build without patch 450. At that point it would be the absolute minimum for a recent kernel (1143) being built via RPM with slab debugging. 1 + 2 + 450 + 500 + 1010 sometimes (if not always, I'm not sure) rapidly prints the first line of the double-fault message to the serial and VGA consoles before the spontaneous reboot happens. I managed to get 1 + 2 + 500 + 1010 to build. That no longer double-faults. Instead, I get tons of warning messages -- first a multi-screenful blast, and then a trickle afterward. The trickle varies in speed -- sometimes 2 or 3 lines a second, sometimes one line every 10-15 seconds or so (very rough estimate, just to give you an idea). Attachment coming up... Created attachment 111237 [details]
"Warning: kfree_skb on hard IRQ c0XXXXXX" logs
This is from kernel 1143, recompiled on FC3 and reduced down to patches 1 + 2 +
500 + 1010.
Oops, I clicked on "Submit" instead of inside the text pane. :( What I meant to say (in addition) is that the machine froze up shortly after these messages. (There were more messages of the same form that I didn't get to capture on my serial console, as I had stopped the capture in order to copy it onto an Internet-connected computer.) BTW I know that 1143 isn't the latest anymore, but it's still pretty recent, and I don't want to change too many variables at once. Perhaps trying with 1146 (or whatever the latest is now) will be my next step. If I compile only with patches 1 + 2 + 500, with both slab and pagealloc debug disabled, I still see the same behavior (i.e. the "Warning: kfree_skb on hard IRQ c0XXXXXX" messages -- but I haven't yet left it up long enough to see if it freezes too). All 4 NICs in this system are pcnet32, if that is relevant. Created attachment 111239 [details]
oops from 2.6.10-1.1148_FC4
I decided to try figuring out when this problem started. So far I've managed to reproduce this (the oopses) going back to 2.6.11-rc3. That doesn't mean it didn't start earlier, just that I haven't tried anything between 2.6.10-ac and 2.6.11-rc3 yet. Update on my progress: 2.6.11-rc1 oopses. 2.6.10 doesn't. I'm now going to binary-search to see which 2.6.10-bk patch introduced the oopses... 2.6.10-bk13 is OK. 2.6.10-bk14 oopses. I'll see if I can narrow it down any further... Ok, I've narrowed it to a single ChangeSet: 1.1938.474.1 [TCP] merge tcp_sock with tcp_opt http://linux.bkbits.net:8080/linux-2.5/cset@41d1a480O2yDOgBi0vPoTs0torY1Tw 2.6.10-bk13 by itself works. 2.6.10-bk13 + this one changeset oopses. BTW I need to add a step 4 to the steps-to-reproduce: 4. If the router does not crash after step 3, then log out of the SSH session and repeat step 3 one more time. It turns out that, without step 4, my procedure isn't fully 100% reproducible. With the new step 4 it seems to be. 2.6.10-bk13 + the changeset stops oopsing if I disable CONFIG_4KSTACKS. I know that's not a fix, but maybe this is useful information. Thanks for tracking this down. The problem is tcp_v4_conn_request(), it puts a struct tcp_sock on the stack, and this structure is huge. It wasn't so bad when it was just tcp_opt (before Arnaldo's change), but now it's much larger. And as you determined, this overflows 4KSTACKS. I'm discussing possible fixes with Arnaldo. Damn, I just changed my password on this bugzilla to ask if ipv6 is enabled, probably it is, if you could test with a kernel with ipv6 completely disabled, i.e. it is not enough to not load ipv6. Could you please try with the attached patch? Its one of the possible solutions for this problem, but its completely untested and may be suboptimal, but I'd like to see test results if possible, its too late here in Brazil and I'm off to bed, tomorrow I'll look at this again, thanks. Created attachment 111285 [details]
allocate the tmp sock from TCP slab cache, not on the stack.
The patch to test.
Yes, ipv6 is enabled in the kernels I've been testing. I'll test your patch now. Created attachment 111287 [details]
oops with 2.6.10-1.1148 + the patch
If you want me to try against a mainline kernel, I will, but it appears that
the patch didn't fully fix it.
I am now recompiling without the patch and with ipv6 disabled. Then I will try
again, with the patch but ipv6 still disabled.
Oh, by the way, it looks to me like tcp_timewait_state_process() and tcp_check_req() in net/ipv4/tcp_minisocks.c are also allocating tcp_socks on the stack. I think I found the bug in the patch. I'll have a new patch shortly (it'll probably be a patch that applies on top of the first one). Never mind, I didn't find the bug in the patch, I just got confused. Sorry about that. With ipv6 disabled, I can't reproduce the bug anymore, even with 4KSTACKS enabled. (This is without the patch.) OK, thanks for the tests so far, I'll take a look at the other sites where tcp_socks are being allocated in the stack now. Created attachment 111290 [details]
an untested attempt to fix the first patch
This patch applies on top of the first patch. It's currently untested, not even
compile-tested. Right now I'm compiling, and after that I'll run it.
Created attachment 111291 [details]
move things related to options received from tcp_sock to tcp_received_options and make tcp_sock use it
Could you please test this fix? touchs a lot of places, but seems to be
the right thing as per what I discussed with David.
Comment on attachment 111290 [details]
an untested attempt to fix the first patch
marking my patch as obsolete because of a stupid typo
I'll try the new patch right after I hit "Submit" now.
Ok, with the patch from comment #28, I can't reproduce the problem anymore, even with ipv6 and 4KSTACKS enabled. Excellent news, thank you for the hard work reporting, narrowing down the problem and testing my patches, it is great to work with people like you! Dave, when you wake up please see if everything is OK and please push this to Linus. Sure thing, working on that now. Thanks everyone. Linus has taken in the fix upstream, so it will be in 2.6.11-final whenever that comes out. The fix should also be in 2.6.11-rc4-bk11 and 2.6.11-rc5. This also implies that it should be in 2.6.10-1.1151_FCx and later. I say "should," because I haven't had a chance to try running any of those kernels yet, to verify that they really don't crash with my testcase anymore. If anyone (with sufficient privs) is less paranoid than me and wants to close this as RAWHIDE right this instant, that's OK with me. Otherwise, I guess I'll wait for 1151 or later to hit public rawhide, and then I'll test that before closing this bug. I just tested 2.6.10-1.1155_FC4 and the bug is gone. Closing. |