149143 – I can remotely recursive-oops the computer at will, 100% reproducible

Bug 149143 - I can remotely recursive-oops the computer at will, 100% reproducible

Summary: I can remotely recursive-oops the computer at will, 100% reproducible

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Miller
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-19 09:16 UTC by Barry K. Nathan
Modified:	2007-11-30 22:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:	2.6.10-1.1155_FC4
Clone Of:
Environment:
Last Closed:	2005-02-28 03:51:38 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
one oops with kernel 1143_FC4 (7.15 KB, text/plain) 2005-02-19 09:21 UTC, Barry K. Nathan	no flags	Details
another 1143_FC4 oops (580.09 KB, text/plain) 2005-02-19 09:31 UTC, Barry K. Nathan	no flags	Details
"Warning: kfree_skb on hard IRQ c0XXXXXX" logs (16.25 KB, text/plain) 2005-02-20 07:26 UTC, Barry K. Nathan	no flags	Details
oops from 2.6.10-1.1148_FC4 (13.90 KB, text/plain) 2005-02-20 16:21 UTC, Barry K. Nathan	no flags	Details
allocate the tmp sock from TCP slab cache, not on the stack. (3.92 KB, patch) 2005-02-22 07:18 UTC, acme	no flags	Details \| Diff
oops with 2.6.10-1.1148 + the patch (2.34 KB, text/plain) 2005-02-22 09:55 UTC, Barry K. Nathan	no flags	Details
an untested attempt to fix the first patch (918 bytes, patch) 2005-02-22 12:29 UTC, Barry K. Nathan	no flags	Details \| Diff
move things related to options received from tcp_sock to tcp_received_options and make tcp_sock use it (51.57 KB, patch) 2005-02-22 12:35 UTC, acme	no flags	Details \| Diff
Show Obsolete (1) View All

Description Barry K. Nathan 2005-02-19 09:16:26 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041224

Description of problem:
(This bug does not happen with any released FC3 kernels.)

With rawhide kernels (1125 and up, although I didn't try anything between 1063 and 1125), I can remotely kill my home router. This is the same box as bug 144324, still running FC3, except with a rawhide kernel.

I'm not absolutely certain that 1125 is as easily killable as 113x and 1143, but it might be. With 1125/1126, the box would oops whether I installed the _FC4 kernel from rawhide or if I recompiled on FC3 (with a spec file change to work around bug 147281). For 1143 I just tried the FC4 kernel, I haven't tried recompiling on FC3.

Version-Release number of selected component (if applicable):
kernel-2.6.10-1.1143_FC4

How reproducible:
Always

Steps to Reproduce:
1. Turn router computer on.
2. From a client computer (behind the router), spend a few minutes web browsing. (I don't know how much HTTP activity is needed. It seems to be a non-zero amount, but approx. 5 minutes of casual web browsing always does it.)
3. Try to ssh from the client computer into the router. (SSHing from a third computer, on the outside Internet rather than behind the router, also works. I don't know if that is as consistently reproducible though. It's consistent enough that it makes telecommuting effectively impossible, however.)

Actual Results: Router goes into super-weird recursive oops. These are the weirdest oopses I've ever seen. Perhaps they won't be the weirdest you've ever seen, but they're still undeniably weird.

Expected Results: Router should not crash, and ssh on client should get a chance to ask the user for the password. The router actually happens to die before the ssh client gets that far.

Additional info:

I will be attaching a few oopses to this bug. I'll have to truncate the longer ones to the first several hundred KB; my serial console box has a failing Ethernet controller and no way of adding a new one so I have to keep the oopses down to floppy-disk size, and I have a hard time imagining that the oopses have any interesting information that far into them anyway.

I'm not sure whether I should set the bug's severity to Security or High. This thing *is* effectively a remote DoS, so I set it to Security, but if you drop the severity to High then that's OK with me.

Perhaps I'll try to dig into the problem this weekend and see if I can narrow things down. OTOH this is a production system so I may decide to move back to a released FC3 kernel, or I may even reinstall with RHEL 4...

Comment 1 Barry K. Nathan 2005-02-19 09:21:11 UTC

Created attachment 111224 [details]
one oops with kernel 1143_FC4

This is probably about as sane-looking as the oopses from this bug get.

Comment 2 Barry K. Nathan 2005-02-19 09:31:34 UTC

Created attachment 111225 [details]
another 1143_FC4 oops

This oops is a bit more insane (ever seen an oops with this many "=" characters
in it?). It seemed like it was going to go on forever, so this is only the
first 500-600KB.

On an unrelated note, I have no idea how this bug ended up being so wide. :(

Comment 3 Barry K. Nathan 2005-02-19 17:00:46 UTC

Hitting "reload" once on this web page (i.e. this bug report) in Mozilla 1.7.5
seems to be sufficient network activity for step #2 for reproducing this bug.

If I recompile for FC3 with patches 1020 through 10001 disabled, the problem
still occurs.

Comment 4 Dave Jones 2005-02-19 18:31:26 UTC

looks like stack overflow in the networking layer to me. Davem ? Ingo?

Comment 5 Barry K. Nathan 2005-02-20 04:03:50 UTC

I actually don't need any HTTP activity before doing the SSH. I need to just
wait 1-2 minutes or so.

If I reduce the kernel down to the following patches:
1
2
212
450 (Removing Xen completely breaks the build for some reason)
500
511
512
513
1010

Then I get some message about a double-fault. I only tried this kernel once, so
I don't know how reproducible it is. Unfortunately I didn't have my serial
console up-and-running at that point. I still have the binaries from this kernel
so I can run it again if you wish.

If I reduce it further, to these patches:

1
2
450
500
1010

Then it's *always* an automatic reboot, rather than super-weird oopses. This is
100% reproducible.

CPU is a 1GHz Pentium III "Coppermine", if that matters. IIRC the BIOS is
sufficiently recent to update it with the latest microcode. Right now I'm going
to see if I can figure out how to build without patch 450. At that point it
would be the absolute minimum for a recent kernel (1143) being built via RPM
with slab debugging.

Comment 6 Barry K. Nathan 2005-02-20 07:21:42 UTC

1 + 2 + 450 + 500 + 1010 sometimes (if not always, I'm not sure) rapidly prints
the first line of the double-fault message to the serial and VGA consoles before
the spontaneous reboot happens.

I managed to get 1 + 2 + 500 + 1010 to build. That no longer double-faults.
Instead, I get tons of warning messages -- first a multi-screenful blast, and
then a trickle afterward. The trickle varies in speed -- sometimes 2 or 3 lines
a second, sometimes one line every 10-15 seconds or so (very rough estimate,
just to give you an idea).

Attachment coming up...

Comment 7 Barry K. Nathan 2005-02-20 07:26:12 UTC

Created attachment 111237 [details]
"Warning: kfree_skb on hard IRQ c0XXXXXX" logs

This is from kernel 1143, recompiled on FC3 and reduced down to patches 1 + 2 +
500 + 1010.

Comment 8 Barry K. Nathan 2005-02-20 07:42:20 UTC

Oops, I clicked on "Submit" instead of inside the text pane. :( What I meant to
say (in addition) is that the machine froze up shortly after these messages.
(There were more messages of the same form that I didn't get to capture on my
serial console, as I had stopped the capture in order to copy it onto an
Internet-connected computer.)

BTW I know that 1143 isn't the latest anymore, but it's still pretty recent, and
I don't want to change too many variables at once. Perhaps trying with 1146 (or
whatever the latest is now) will be my next step.

Comment 9 Barry K. Nathan 2005-02-20 10:02:46 UTC

If I compile only with patches 1 + 2 + 500, with both slab and pagealloc debug
disabled, I still see the same behavior (i.e. the "Warning: kfree_skb on hard
IRQ c0XXXXXX" messages -- but I haven't yet left it up long enough to see if it
freezes too).

All 4 NICs in this system are pcnet32, if that is relevant.

Comment 10 Barry K. Nathan 2005-02-20 16:21:49 UTC

Created attachment 111239 [details]
oops from 2.6.10-1.1148_FC4

Comment 11 Barry K. Nathan 2005-02-21 08:07:46 UTC

I decided to try figuring out when this problem started. So far I've managed to
reproduce this (the oopses) going back to 2.6.11-rc3. That doesn't mean it
didn't start earlier, just that I haven't tried anything between 2.6.10-ac and
2.6.11-rc3 yet.

Comment 12 Barry K. Nathan 2005-02-21 14:42:10 UTC

Update on my progress:

2.6.11-rc1 oopses. 2.6.10 doesn't.

I'm now going to binary-search to see which 2.6.10-bk patch introduced the oopses...

Comment 13 Barry K. Nathan 2005-02-22 04:02:42 UTC

2.6.10-bk13 is OK. 2.6.10-bk14 oopses.

I'll see if I can narrow it down any further...

Comment 14 Barry K. Nathan 2005-02-22 04:50:31 UTC

Ok, I've narrowed it to a single ChangeSet:
1.1938.474.1 	[TCP] merge tcp_sock with tcp_opt
http://linux.bkbits.net:8080/linux-2.5/cset@41d1a480O2yDOgBi0vPoTs0torY1Tw

2.6.10-bk13 by itself works. 2.6.10-bk13 + this one changeset oopses.


BTW I need to add a step 4 to the steps-to-reproduce:
4. If the router does not crash after step 3, then log out of the SSH session
and repeat step 3 one more time.

It turns out that, without step 4, my procedure isn't fully 100% reproducible.
With the new step 4 it seems to be.

Comment 15 Barry K. Nathan 2005-02-22 06:37:35 UTC

2.6.10-bk13 + the changeset stops oopsing if I disable CONFIG_4KSTACKS. I know
that's not a fix, but maybe this is useful information.

Comment 16 David Miller 2005-02-22 06:42:21 UTC

Thanks for tracking this down.

The problem is tcp_v4_conn_request(), it puts a struct tcp_sock
on the stack, and this structure is huge.  It wasn't so bad when
it was just tcp_opt (before Arnaldo's change), but now it's much
larger.  And as you determined, this overflows 4KSTACKS.

I'm discussing possible fixes with Arnaldo.

Comment 17 acme 2005-02-22 06:45:45 UTC

Damn, I just changed my password on this bugzilla to ask if ipv6 is enabled, 
probably it is, if you could test with a kernel with ipv6 completely disabled, 
i.e. it is not enough to not load ipv6.

Comment 18 acme 2005-02-22 07:17:00 UTC

Could you please try with the attached patch? Its one of the possible 
solutions for this problem, but its completely untested and may be suboptimal, 
but I'd like to see test results if possible, its too late here in Brazil and 
I'm off to bed, tomorrow I'll look at this again, thanks.

Comment 19 acme 2005-02-22 07:18:25 UTC

Created attachment 111285 [details]
allocate the tmp sock from TCP slab cache, not on the stack.

The patch to test.

Comment 20 Barry K. Nathan 2005-02-22 08:30:20 UTC

Yes, ipv6 is enabled in the kernels I've been testing.

I'll test your patch now.

Comment 21 Barry K. Nathan 2005-02-22 09:55:31 UTC

Created attachment 111287 [details]
oops with 2.6.10-1.1148 + the patch

If you want me to try against a mainline kernel, I will, but it appears that
the patch didn't fully fix it.

I am now recompiling without the patch and with ipv6 disabled. Then I will try
again, with the patch but ipv6 still disabled.

Comment 22 Barry K. Nathan 2005-02-22 10:07:39 UTC

Oh, by the way, it looks to me like tcp_timewait_state_process() and
tcp_check_req() in net/ipv4/tcp_minisocks.c are also allocating tcp_socks on the
stack.

Comment 23 Barry K. Nathan 2005-02-22 10:14:00 UTC

I think I found the bug in the patch. I'll have a new patch shortly (it'll
probably be a patch that applies on top of the first one).

Comment 24 Barry K. Nathan 2005-02-22 10:23:07 UTC

Never mind, I didn't find the bug in the patch, I just got confused. Sorry about
that.

Comment 25 Barry K. Nathan 2005-02-22 10:34:42 UTC

With ipv6 disabled, I can't reproduce the bug anymore, even with 4KSTACKS
enabled. (This is without the patch.)

Comment 26 acme 2005-02-22 11:15:08 UTC

OK, thanks for the tests so far, I'll take a look at the other sites where 
tcp_socks are being allocated in the stack now.

Comment 27 Barry K. Nathan 2005-02-22 12:29:35 UTC

Created attachment 111290 [details]
an untested attempt to fix the first patch

This patch applies on top of the first patch. It's currently untested, not even
compile-tested. Right now I'm compiling, and after that I'll run it.

Comment 28 acme 2005-02-22 12:35:49 UTC

Created attachment 111291 [details]
move things related to options received from tcp_sock to tcp_received_options and make tcp_sock use it

Could you please test this fix? touchs a lot of places, but seems to be
the right thing as per what I discussed with David.

Comment 29 Barry K. Nathan 2005-02-22 12:44:37 UTC

Comment on attachment 111290 [details]
an untested attempt to fix the first patch

marking my patch as obsolete because of a stupid typo

I'll try the new patch right after I hit "Submit" now.

Comment 30 Barry K. Nathan 2005-02-22 13:30:55 UTC

Ok, with the patch from comment #28, I can't reproduce the problem anymore, even
with ipv6 and 4KSTACKS enabled.

Comment 31 acme 2005-02-22 13:42:11 UTC

Excellent news, thank you for the hard work reporting, narrowing down the 
problem and testing my patches, it is great to work with people like you!

Comment 32 acme 2005-02-22 17:12:28 UTC

Dave, when you wake up please see if everything is OK and please push this 
to Linus.

Comment 33 David Miller 2005-02-22 18:22:29 UTC

Sure thing, working on that now.

Thanks everyone.

Comment 34 David Miller 2005-02-23 03:37:01 UTC

Linus has taken in the fix upstream, so it will be in
2.6.11-final whenever that comes out.

Comment 35 Barry K. Nathan 2005-02-24 17:58:21 UTC

The fix should also be in 2.6.11-rc4-bk11 and 2.6.11-rc5. This also implies that
it should be in 2.6.10-1.1151_FCx and later.

I say "should," because I haven't had a chance to try running any of those
kernels yet, to verify that they really don't crash with my testcase anymore.

If anyone (with sufficient privs) is less paranoid than me and wants to close
this as RAWHIDE right this instant, that's OK with me. Otherwise, I guess I'll
wait for 1151 or later to hit public rawhide, and then I'll test that before
closing this bug.

Comment 36 Barry K. Nathan 2005-02-28 03:51:38 UTC

I just tested 2.6.10-1.1155_FC4 and the bug is gone. Closing.

Note You need to log in before you can comment on or make changes to this bug.