Red Hat Bugzilla – Bug 144324
networking-related (?) oopses on 2.6.10 Fedora kernels, more than once a day
Last modified: 2015-01-04 17:14:51 EST
Description of problem:
(This is probably going to be the most useless bug report you read all
day, initially, but I'll fill it in with details later -- hopefully
If I run either of the kernel versions below, then the box eventually
falls over and oopses. These oopses invariably mention "ip" stuff.
However, I have been unable to capture a full oops to this point;
that's something for another Bugzilla when I get a chance, but I
expect to be able to work around it.
(See "additional info" below for some more information, although you
probably won't have enough to go on until further comments/attachments.)
Version-Release number of selected component (if applicable):
Almost always (much closer to 100% of the time than 50% of the time).
If I followed the "steps to reproduce" exactly as written here, it may
even be 100% reproducible.
Steps to Reproduce:
1. Boot computer into new kernel (with /etc/inittab configured for
2. Wait several hours (optionally, go to bed and have a nap or even a
full night's sleep).
3. Tap a key on the keyboard to wake up the screensaver (this still
oops fills the screen, to the point that the top of the oops gets cut off
Text login prompt
This box is sort of the home router from hell (or perhaps from heaven,
when it works). It typically has masquerading, bridging, and QoS stuff
going. It also has four network cards (3 Ethernet and one HomePNA).
(I could get by well enough for the moment without the QoS, and maybe
without the bridging, but I'm not quite at that point in hunting this
Oh, yeah, and it's running loop root -- the root filesystem is XFS,
inside a ~4GB file on a FAT32 filesystem. (Yes, FC3 actually runs well
This box has been a little quirky (weird network log messages) since
2.6.6 or so. Maybe I'll elaborate on this later, if need be. With
2.6.10, all of the quirks suddenly disappeared; if the box would only
stop oopsing, it would finally be stable under 2.6.x.
Kernel 2.6.9-1.724_FC3 doesn't oops for me (at least, not so far) so I
can use that in the meantime. However, I will try as hard as I can to
get full oopses from 2.6.10. I have not tried mainline 2.6.10 at all,
and 2.6.10-ac2 only briefly, so I don't yet know if the problem is in
the upstream kernel.
Arrrrrrgh... why didn't I think of using a serial console!? [Probably
because I've worn myself out looking at swsusp code.]
Now I just need to set one up (later today), then I should be able to
get full oopses...
Make that "swsusp and ACPI code" in my last comment ;)
Anyway, a couple of other things I forgot to mention earlier:
2.6.9-1.715_FC3 ran for almost 16 days without oopsing.
Earlier today I ran 2.6.9-1.724_FC3 for several hours. No oops, so
far. However, I've rebooted it back into 2.6.10-1.727_FC3 now that I
have a serial console set up to capture the oops.
Looking back through my logs, it looks like the time to oops varies a
bit; sometimes it oopses after half an hour, sometimes it oopses after
maybe 10-11 hours. Not once has the box managed to survive 12 hours
without oopsing. It seems to me (but I can't really prove this so far)
that the oops happens faster if there is more filesystem activity on
the box. (But maybe that's just because there's also been more network
activity whenever there's been more filesystem activity. So, who knows...)
I'll attach an oops once I capture one.
Created attachment 109459 [details]
kernel messages and oops from 2.6.10-1.727_FC3
This is (for the most part -- see my next attachment too) what the serial
console captured from 2.6.10-1.727_FC3. This log contains most of the kernel's
boot messages, followed by the oops (Next time I think I will get the messages
with "dmesg" and capture the oops separately -- that would provide slightly
more complete output.)
Note that this particular oops is not the only one I've seen -- that is, other
oopses have mentioned different functions, but they've still been
network-related. But this is the first oops since I set up the serial console.
(And FWIW it would have happened earlier today had I not been rebooting
continuously into new kernels in order to figure out what caused the infamous
ACPI shutdown bug.)
Created attachment 109460 [details]
post-oops Alt-SysRQ-M output followed by Alt-SysRQ-T output, for 2.6.10-1.727_FC3
Note that this output may not be trustworthy -- I started holding down the
power button, then after a second or two I thought "hey, this is stupid, I
should see if I can get more info with SysRQ instead."
Now I've booted into 2.6.10-ac2, to see if that oopses.
For the record, this seems to happen here as well:
- Athlon 1400, Abit KT7A mobo, 2 (yeah, junk) RTL 8139 NICs
- oopses with ipt_* symbols (don't know exactly and I don't have serial console
due to broken null modem cable)
- I use masquerading and used to use QOS but I don't know whether this (i.e its
configuration) survived a severe root FS crash I had at the beginning of the
week (not with the 2.6.10 kernels in question).
BTW, in my case I think the link may have been down on eth3 (which is bridged
with eth1) for the entire uptime of the computer, before each oops. However,
right now the link is definitely up; I don't know if that's going to affect the
likelihood of oopsing. (If ac2 does not oops, I guess I'll try 1.727_FC3 again,
but making sure that eth3's link is up, and I'll see if that oopses.)
On 2.6.10-ac2, in the period of time that I would have expected the
kernel to oops, I got this instead (and the system kept running):
NETDEV WATCHDOG: eth3: transmit timed out
eth3: transmit timed out, status 0073, resetting.
Link was down on eth3 at the time that this happened.
I've now rebooted into 2.6.9-1.727_FC3, with the link up on eth3 this
time. I'm going to keep the link up overnight and see if it oopses by
Created attachment 109509 [details]
another oops from 727_FC3
This time the link was up on eth3. The oops also happened after less than an
hour of uptime.
Created attachment 109510 [details]
sysrq-m and sysrq-t from the second oops
This time I made sure not to hit the power button first. :)
I just now noticed that CONFIG_DEBUG_PAGEALLOC is enabled in the
1.727_FC3 kernel but disabled in the ac2 and 1.724_FC3 kernels.
My next step will probably be to recompile some kernel with
CONFIG_DEBUG_PAGEALLOC enabled (probably 1.724_FC3) and see if that
makes the bug appear in that kernel.
After nearly 24 hours, 724_FC3 + CONFIG_DEBUG_PAGEALLOC still had not
oopsed. I'm now running 727_FC3 - CONFIG_DEBUG_PAGEALLOC; we'll see
what happens with that.
2.6.10-1.727_FC3 with CONFIG_DEBUG_PAGEALLOC also appears not to oops,
after well over a day. When I get a chance, I'll try 2.6.10-1.736_FC3
and see whether its debugging code shows anything. (That will not be
for at least several hours however.)
2.6.10-1.736_FC3 didn't oops after a couple days of continuous uptime.
I'm now running 2.6.10-1.741_FC3.barryn -- this is a recompile of
1.741_FC3 with pagealloc debug re-enabled. If it can stay up for a day
or so without oopsing, I'll close this bug. Otherwise, I'll attach the
new oops and figure out how to proceed.
be sure to disable the periodic slab checker patch if you enable
pagealloc debugging. the two aren't compatable.
Right, I'm aware of that (I saw it in one of the changelogs).
BTW, the first time I booted into 741 + pagealloc debug, brctl oopsed
(near the end of boot, when it was executing stuff in
/etc/rc.d/rc.local) and that oops took the whole system down. I
captured the oops but I haven't been able to reproduce it. Should I
bother filing a bug and submitting the oops anyway?
(Maybe this weekend I'll try setting up a reboot loop, to see if that
reproduces it, but for now I want to keep it up to see if *this* oops
given the incompatability has a tendancy to throw some really wierd
oopses, I'd not bother unless you can reproduce it on a regular FC kernel.
Oops, once again I didn't make myself clear (-ENOSLEEP this time).
The periodic slab debug is disabled, and pagealloc debug is enabled.
That shouldn't have the incompatibility, should it?
Anyway, 741_FC3 + pagealloc debug *should* have oopsed by now if my
original bug was still present -- but it didn't. So it seems that
whatever caused it has been fixed.