Bug 144324

Summary: networking-related (?) oopses on 2.6.10 Fedora kernels, more than once a day
Product: [Fedora] Fedora Reporter: Barry K. Nathan <barryn>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: barryn, davem, nphilipp, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.10-1.741_FC3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-01-15 14:12:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel messages and oops from 2.6.10-1.727_FC3
none
post-oops Alt-SysRQ-M output followed by Alt-SysRQ-T output, for 2.6.10-1.727_FC3
none
another oops from 727_FC3
none
sysrq-m and sysrq-t from the second oops none

Description Barry K. Nathan 2005-01-05 22:47:26 UTC
Description of problem:
(This is probably going to be the most useless bug report you read all
day, initially, but I'll fill it in with details later -- hopefully
later today.)

If I run either of the kernel versions below, then the box eventually
falls over and oopses. These oopses invariably mention "ip" stuff.
However, I have been unable to capture a full oops to this point;
that's something for another Bugzilla when I get a chance, but I
expect to be able to work around it.

(See "additional info" below for some more information, although you
probably won't have enough to go on until further comments/attachments.)

Version-Release number of selected component (if applicable):
kernel-2.6.10-1.727_FC3
kernel-2.6.10-1.1063_FC4

How reproducible:
Almost always (much closer to 100% of the time than 50% of the time).
If I followed the "steps to reproduce" exactly as written here, it may
even be 100% reproducible.

Steps to Reproduce:
1. Boot computer into new kernel (with /etc/inittab configured for
runlevel 3).
2. Wait several hours (optionally, go to bed and have a nap or even a
full night's sleep).
3. Tap a key on the keyboard to wake up the screensaver (this still
works post-oops).
  
Actual results:
oops fills the screen, to the point that the top of the oops gets cut off

Expected results:
Text login prompt

Additional info:
This box is sort of the home router from hell (or perhaps from heaven,
when it works). It typically has masquerading, bridging, and QoS stuff
going. It also has four network cards (3 Ethernet and one HomePNA).

(I could get by well enough for the moment without the QoS, and maybe
without the bridging, but I'm not quite at that point in hunting this
bug yet.)

Oh, yeah, and it's running loop root -- the root filesystem is XFS,
inside a ~4GB file on a FAT32 filesystem. (Yes, FC3 actually runs well
this way.)

This box has been a little quirky (weird network log messages) since
2.6.6 or so. Maybe I'll elaborate on this later, if need be. With
2.6.10, all of the quirks suddenly disappeared; if the box would only
stop oopsing, it would finally be stable under 2.6.x.

Kernel 2.6.9-1.724_FC3 doesn't oops for me (at least, not so far) so I
can use that in the meantime. However, I will try as hard as I can to
get full oopses from 2.6.10. I have not tried mainline 2.6.10 at all,
and 2.6.10-ac2 only briefly, so I don't yet know if the problem is in
the upstream kernel.

Comment 1 Barry K. Nathan 2005-01-05 23:10:27 UTC
Arrrrrrgh... why didn't I think of using a serial console!? [Probably
because I've worn myself out looking at swsusp code.]

Now I just need to set one up (later today), then I should be able to
get full oopses...

Comment 2 Barry K. Nathan 2005-01-06 08:16:32 UTC
Make that "swsusp and ACPI code" in my last comment ;)

Anyway, a couple of other things I forgot to mention earlier:

2.6.9-1.715_FC3 ran for almost 16 days without oopsing.

Earlier today I ran 2.6.9-1.724_FC3 for several hours. No oops, so
far. However, I've rebooted it back into 2.6.10-1.727_FC3 now that I
have a serial console set up to capture the oops.

Looking back through my logs, it looks like the time to oops varies a
bit; sometimes it oopses after half an hour, sometimes it oopses after
maybe 10-11 hours. Not once has the box managed to survive 12 hours
without oopsing. It seems to me (but I can't really prove this so far)
that the oops happens faster if there is more filesystem activity on
the box. (But maybe that's just because there's also been more network
activity whenever there's been more filesystem activity. So, who knows...)

I'll attach an oops once I capture one.

Comment 3 Barry K. Nathan 2005-01-07 10:40:15 UTC
Created attachment 109459 [details]
kernel messages and oops from 2.6.10-1.727_FC3

This is (for the most part -- see my next attachment too) what the serial
console captured from 2.6.10-1.727_FC3. This log contains most of the kernel's
boot messages, followed by the oops  (Next time I think I will get the messages
with "dmesg" and capture the oops separately -- that would provide slightly
more complete output.)

Note that this particular oops is not the only one I've seen -- that is, other
oopses have mentioned different functions, but they've still been
network-related. But this is the first oops since I set up the serial console.
(And FWIW it would have happened earlier today had I not been rebooting
continuously into new kernels in order to figure out what caused the infamous
ACPI shutdown bug.)

Comment 4 Barry K. Nathan 2005-01-07 10:43:21 UTC
Created attachment 109460 [details]
post-oops Alt-SysRQ-M output followed by Alt-SysRQ-T output, for 2.6.10-1.727_FC3

Note that this output may not be trustworthy -- I started holding down the
power button, then after a second or two I thought "hey, this is stupid, I
should see if I can get more info with SysRQ instead."

Comment 5 Barry K. Nathan 2005-01-07 10:46:15 UTC
Now I've booted into 2.6.10-ac2, to see if that oopses.

Comment 6 Nils Philippsen 2005-01-07 13:46:58 UTC
For the record, this seems to happen here as well:

- Athlon 1400, Abit KT7A mobo, 2 (yeah, junk) RTL 8139 NICs
- oopses with ipt_* symbols (don't know exactly and I don't have serial console
due to broken null modem cable)
- I use masquerading and used to use QOS but I don't know whether this (i.e its
configuration) survived a severe root FS crash I had at the beginning of the
week (not with the 2.6.10 kernels in question).

Comment 7 Barry K. Nathan 2005-01-07 15:04:51 UTC
BTW, in my case I think the link may have been down on eth3 (which is bridged
with eth1) for the entire uptime of the computer, before each oops. However,
right now the link is definitely up; I don't know if that's going to affect the
likelihood of oopsing. (If ac2 does not oops, I guess I'll try 1.727_FC3 again,
but making sure that eth3's link is up, and I'll see if that oopses.)

Comment 8 Barry K. Nathan 2005-01-08 09:39:45 UTC
On 2.6.10-ac2, in the period of time that I would have expected the
kernel to oops, I got this instead (and the system kept running):

NETDEV WATCHDOG: eth3: transmit timed out
eth3: transmit timed out, status 0073, resetting.

Link was down on eth3 at the time that this happened.


I've now rebooted into 2.6.9-1.727_FC3, with the link up on eth3 this
time. I'm going to keep the link up overnight and see if it oopses by
early afternoon.

Comment 9 Barry K. Nathan 2005-01-08 10:12:23 UTC
Created attachment 109509 [details]
another oops from 727_FC3

This time the link was up on eth3. The oops also happened after less than an
hour of uptime.

Comment 10 Barry K. Nathan 2005-01-08 10:13:19 UTC
Created attachment 109510 [details]
sysrq-m and sysrq-t from the second oops

This time I made sure not to hit the power button first. :)

Comment 11 Barry K. Nathan 2005-01-08 10:34:30 UTC
I just now noticed that CONFIG_DEBUG_PAGEALLOC is enabled in the
1.727_FC3 kernel but disabled in the ac2 and 1.724_FC3 kernels.

My next step will probably be to recompile some kernel with
CONFIG_DEBUG_PAGEALLOC enabled (probably 1.724_FC3) and see if that
makes the bug appear in that kernel.

Comment 12 Barry K. Nathan 2005-01-09 12:04:58 UTC
After nearly 24 hours, 724_FC3 + CONFIG_DEBUG_PAGEALLOC still had not
oopsed. I'm now running 727_FC3 - CONFIG_DEBUG_PAGEALLOC; we'll see
what happens with that.

Comment 13 Barry K. Nathan 2005-01-10 23:06:38 UTC
2.6.10-1.727_FC3 with CONFIG_DEBUG_PAGEALLOC also appears not to oops,
after well over a day. When I get a chance, I'll try 2.6.10-1.736_FC3
and see whether its debugging code shows anything. (That will not be
for at least several hours however.)

Comment 14 Barry K. Nathan 2005-01-14 18:16:08 UTC
2.6.10-1.736_FC3 didn't oops after a couple days of continuous uptime.

I'm now running 2.6.10-1.741_FC3.barryn -- this is a recompile of
1.741_FC3 with pagealloc debug re-enabled. If it can stay up for a day
or so without oopsing, I'll close this bug. Otherwise, I'll attach the
new oops and figure out how to proceed.

Comment 15 Dave Jones 2005-01-14 20:44:15 UTC
be sure to disable the periodic slab checker patch if you enable
pagealloc debugging. the two aren't compatable.


Comment 16 Barry K. Nathan 2005-01-14 22:40:14 UTC
Right, I'm aware of that (I saw it in one of the changelogs).

BTW, the first time I booted into 741 + pagealloc debug, brctl oopsed
(near the end of boot, when it was executing stuff in
/etc/rc.d/rc.local) and that oops took the whole system down. I
captured the oops but I haven't been able to reproduce it. Should I
bother filing a bug and submitting the oops anyway?

(Maybe this weekend I'll try setting up a reboot loop, to see if that
reproduces it, but for now I want to keep it up to see if *this* oops
happens again.)

Comment 17 Dave Jones 2005-01-15 04:29:41 UTC
given the incompatability has a tendancy to throw some really wierd
oopses, I'd not bother unless you can reproduce it on a regular FC kernel.

Comment 18 Barry K. Nathan 2005-01-15 14:12:47 UTC
Oops, once again I didn't make myself clear (-ENOSLEEP this time).

The periodic slab debug is disabled, and pagealloc debug is enabled.
That shouldn't have the incompatibility, should it?

Anyway, 741_FC3 + pagealloc debug *should* have oopsed by now if my
original bug was still present -- but it didn't. So it seems that
whatever caused it has been fixed.