From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041224 Description of problem: (This bug does not happen with any released FC3 kernels.) With rawhide kernels (1125 and up, although I didn't try anything between 1063 and 1125), I can remotely kill my home router. This is the same box as bug 144324, still running FC3, except with a rawhide kernel. I'm not absolutely certain that 1125 is as easily killable as 113x and 1143, but it might be. With 1125/1126, the box would oops whether I installed the _FC4 kernel from rawhide or if I recompiled on FC3 (with a spec file change to work around bug 147281). For 1143 I just tried the FC4 kernel, I haven't tried recompiling on FC3. Version-Release number of selected component (if applicable): kernel-2.6.10-1.1143_FC4 How reproducible: Always Steps to Reproduce: 1. Turn router computer on. 2. From a client computer (behind the router), spend a few minutes web browsing. (I don't know how much HTTP activity is needed. It seems to be a non-zero amount, but approx. 5 minutes of casual web browsing always does it.) 3. Try to ssh from the client computer into the router. (SSHing from a third computer, on the outside Internet rather than behind the router, also works. I don't know if that is as consistently reproducible though. It's consistent enough that it makes telecommuting effectively impossible, however.) Actual Results: Router goes into super-weird recursive oops. These are the weirdest oopses I've ever seen. Perhaps they won't be the weirdest you've ever seen, but they're still undeniably weird. Expected Results: Router should not crash, and ssh on client should get a chance to ask the user for the password. The router actually happens to die before the ssh client gets that far. Additional info: I will be attaching a few oopses to this bug. I'll have to truncate the longer ones to the first several hundred KB; my serial console box has a failing Ethernet controller and no way of adding a new one so I have to keep the oopses down to floppy-disk size, and I have a hard time imagining that the oopses have any interesting information that far into them anyway. I'm not sure whether I should set the bug's severity to Security or High. This thing *is* effectively a remote DoS, so I set it to Security, but if you drop the severity to High then that's OK with me. Perhaps I'll try to dig into the problem this weekend and see if I can narrow things down. OTOH this is a production system so I may decide to move back to a released FC3 kernel, or I may even reinstall with RHEL 4...
Created attachment 111224 [details] one oops with kernel 1143_FC4 This is probably about as sane-looking as the oopses from this bug get.
Created attachment 111225 [details] another 1143_FC4 oops This oops is a bit more insane (ever seen an oops with this many "=" characters in it?). It seemed like it was going to go on forever, so this is only the first 500-600KB. On an unrelated note, I have no idea how this bug ended up being so wide. :(
Hitting "reload" once on this web page (i.e. this bug report) in Mozilla 1.7.5 seems to be sufficient network activity for step #2 for reproducing this bug. If I recompile for FC3 with patches 1020 through 10001 disabled, the problem still occurs.
looks like stack overflow in the networking layer to me. Davem ? Ingo?
I actually don't need any HTTP activity before doing the SSH. I need to just wait 1-2 minutes or so. If I reduce the kernel down to the following patches: 1 2 212 450 (Removing Xen completely breaks the build for some reason) 500 511 512 513 1010 Then I get some message about a double-fault. I only tried this kernel once, so I don't know how reproducible it is. Unfortunately I didn't have my serial console up-and-running at that point. I still have the binaries from this kernel so I can run it again if you wish. If I reduce it further, to these patches: 1 2 450 500 1010 Then it's *always* an automatic reboot, rather than super-weird oopses. This is 100% reproducible. CPU is a 1GHz Pentium III "Coppermine", if that matters. IIRC the BIOS is sufficiently recent to update it with the latest microcode. Right now I'm going to see if I can figure out how to build without patch 450. At that point it would be the absolute minimum for a recent kernel (1143) being built via RPM with slab debugging.
1 + 2 + 450 + 500 + 1010 sometimes (if not always, I'm not sure) rapidly prints the first line of the double-fault message to the serial and VGA consoles before the spontaneous reboot happens. I managed to get 1 + 2 + 500 + 1010 to build. That no longer double-faults. Instead, I get tons of warning messages -- first a multi-screenful blast, and then a trickle afterward. The trickle varies in speed -- sometimes 2 or 3 lines a second, sometimes one line every 10-15 seconds or so (very rough estimate, just to give you an idea). Attachment coming up...
Created attachment 111237 [details] "Warning: kfree_skb on hard IRQ c0XXXXXX" logs This is from kernel 1143, recompiled on FC3 and reduced down to patches 1 + 2 + 500 + 1010.
Oops, I clicked on "Submit" instead of inside the text pane. :( What I meant to say (in addition) is that the machine froze up shortly after these messages. (There were more messages of the same form that I didn't get to capture on my serial console, as I had stopped the capture in order to copy it onto an Internet-connected computer.) BTW I know that 1143 isn't the latest anymore, but it's still pretty recent, and I don't want to change too many variables at once. Perhaps trying with 1146 (or whatever the latest is now) will be my next step.
If I compile only with patches 1 + 2 + 500, with both slab and pagealloc debug disabled, I still see the same behavior (i.e. the "Warning: kfree_skb on hard IRQ c0XXXXXX" messages -- but I haven't yet left it up long enough to see if it freezes too). All 4 NICs in this system are pcnet32, if that is relevant.
Created attachment 111239 [details] oops from 2.6.10-1.1148_FC4
I decided to try figuring out when this problem started. So far I've managed to reproduce this (the oopses) going back to 2.6.11-rc3. That doesn't mean it didn't start earlier, just that I haven't tried anything between 2.6.10-ac and 2.6.11-rc3 yet.
Update on my progress: 2.6.11-rc1 oopses. 2.6.10 doesn't. I'm now going to binary-search to see which 2.6.10-bk patch introduced the oopses...
2.6.10-bk13 is OK. 2.6.10-bk14 oopses. I'll see if I can narrow it down any further...
Ok, I've narrowed it to a single ChangeSet: 1.1938.474.1 [TCP] merge tcp_sock with tcp_opt http://linux.bkbits.net:8080/linux-2.5/cset@41d1a480O2yDOgBi0vPoTs0torY1Tw 2.6.10-bk13 by itself works. 2.6.10-bk13 + this one changeset oopses. BTW I need to add a step 4 to the steps-to-reproduce: 4. If the router does not crash after step 3, then log out of the SSH session and repeat step 3 one more time. It turns out that, without step 4, my procedure isn't fully 100% reproducible. With the new step 4 it seems to be.
2.6.10-bk13 + the changeset stops oopsing if I disable CONFIG_4KSTACKS. I know that's not a fix, but maybe this is useful information.
Thanks for tracking this down. The problem is tcp_v4_conn_request(), it puts a struct tcp_sock on the stack, and this structure is huge. It wasn't so bad when it was just tcp_opt (before Arnaldo's change), but now it's much larger. And as you determined, this overflows 4KSTACKS. I'm discussing possible fixes with Arnaldo.
Damn, I just changed my password on this bugzilla to ask if ipv6 is enabled, probably it is, if you could test with a kernel with ipv6 completely disabled, i.e. it is not enough to not load ipv6.
Could you please try with the attached patch? Its one of the possible solutions for this problem, but its completely untested and may be suboptimal, but I'd like to see test results if possible, its too late here in Brazil and I'm off to bed, tomorrow I'll look at this again, thanks.
Created attachment 111285 [details] allocate the tmp sock from TCP slab cache, not on the stack. The patch to test.
Yes, ipv6 is enabled in the kernels I've been testing. I'll test your patch now.
Created attachment 111287 [details] oops with 2.6.10-1.1148 + the patch If you want me to try against a mainline kernel, I will, but it appears that the patch didn't fully fix it. I am now recompiling without the patch and with ipv6 disabled. Then I will try again, with the patch but ipv6 still disabled.
Oh, by the way, it looks to me like tcp_timewait_state_process() and tcp_check_req() in net/ipv4/tcp_minisocks.c are also allocating tcp_socks on the stack.
I think I found the bug in the patch. I'll have a new patch shortly (it'll probably be a patch that applies on top of the first one).
Never mind, I didn't find the bug in the patch, I just got confused. Sorry about that.
With ipv6 disabled, I can't reproduce the bug anymore, even with 4KSTACKS enabled. (This is without the patch.)
OK, thanks for the tests so far, I'll take a look at the other sites where tcp_socks are being allocated in the stack now.
Created attachment 111290 [details] an untested attempt to fix the first patch This patch applies on top of the first patch. It's currently untested, not even compile-tested. Right now I'm compiling, and after that I'll run it.
Created attachment 111291 [details] move things related to options received from tcp_sock to tcp_received_options and make tcp_sock use it Could you please test this fix? touchs a lot of places, but seems to be the right thing as per what I discussed with David.
Comment on attachment 111290 [details] an untested attempt to fix the first patch marking my patch as obsolete because of a stupid typo I'll try the new patch right after I hit "Submit" now.
Ok, with the patch from comment #28, I can't reproduce the problem anymore, even with ipv6 and 4KSTACKS enabled.
Excellent news, thank you for the hard work reporting, narrowing down the problem and testing my patches, it is great to work with people like you!
Dave, when you wake up please see if everything is OK and please push this to Linus.
Sure thing, working on that now. Thanks everyone.
Linus has taken in the fix upstream, so it will be in 2.6.11-final whenever that comes out.
The fix should also be in 2.6.11-rc4-bk11 and 2.6.11-rc5. This also implies that it should be in 2.6.10-1.1151_FCx and later. I say "should," because I haven't had a chance to try running any of those kernels yet, to verify that they really don't crash with my testcase anymore. If anyone (with sufficient privs) is less paranoid than me and wants to close this as RAWHIDE right this instant, that's OK with me. Otherwise, I guess I'll wait for 1151 or later to hit public rawhide, and then I'll test that before closing this bug.
I just tested 2.6.10-1.1155_FC4 and the bug is gone. Closing.