Bug 144324
Summary: | networking-related (?) oopses on 2.6.10 Fedora kernels, more than once a day | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Barry K. Nathan <barryn> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | barryn, davem, nphilipp, pfrields, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 2.6.10-1.741_FC3 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-01-15 14:12:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Barry K. Nathan
2005-01-05 22:47:26 UTC
Arrrrrrgh... why didn't I think of using a serial console!? [Probably because I've worn myself out looking at swsusp code.] Now I just need to set one up (later today), then I should be able to get full oopses... Make that "swsusp and ACPI code" in my last comment ;) Anyway, a couple of other things I forgot to mention earlier: 2.6.9-1.715_FC3 ran for almost 16 days without oopsing. Earlier today I ran 2.6.9-1.724_FC3 for several hours. No oops, so far. However, I've rebooted it back into 2.6.10-1.727_FC3 now that I have a serial console set up to capture the oops. Looking back through my logs, it looks like the time to oops varies a bit; sometimes it oopses after half an hour, sometimes it oopses after maybe 10-11 hours. Not once has the box managed to survive 12 hours without oopsing. It seems to me (but I can't really prove this so far) that the oops happens faster if there is more filesystem activity on the box. (But maybe that's just because there's also been more network activity whenever there's been more filesystem activity. So, who knows...) I'll attach an oops once I capture one. Created attachment 109459 [details]
kernel messages and oops from 2.6.10-1.727_FC3
This is (for the most part -- see my next attachment too) what the serial
console captured from 2.6.10-1.727_FC3. This log contains most of the kernel's
boot messages, followed by the oops (Next time I think I will get the messages
with "dmesg" and capture the oops separately -- that would provide slightly
more complete output.)
Note that this particular oops is not the only one I've seen -- that is, other
oopses have mentioned different functions, but they've still been
network-related. But this is the first oops since I set up the serial console.
(And FWIW it would have happened earlier today had I not been rebooting
continuously into new kernels in order to figure out what caused the infamous
ACPI shutdown bug.)
Created attachment 109460 [details]
post-oops Alt-SysRQ-M output followed by Alt-SysRQ-T output, for 2.6.10-1.727_FC3
Note that this output may not be trustworthy -- I started holding down the
power button, then after a second or two I thought "hey, this is stupid, I
should see if I can get more info with SysRQ instead."
Now I've booted into 2.6.10-ac2, to see if that oopses. For the record, this seems to happen here as well: - Athlon 1400, Abit KT7A mobo, 2 (yeah, junk) RTL 8139 NICs - oopses with ipt_* symbols (don't know exactly and I don't have serial console due to broken null modem cable) - I use masquerading and used to use QOS but I don't know whether this (i.e its configuration) survived a severe root FS crash I had at the beginning of the week (not with the 2.6.10 kernels in question). BTW, in my case I think the link may have been down on eth3 (which is bridged with eth1) for the entire uptime of the computer, before each oops. However, right now the link is definitely up; I don't know if that's going to affect the likelihood of oopsing. (If ac2 does not oops, I guess I'll try 1.727_FC3 again, but making sure that eth3's link is up, and I'll see if that oopses.) On 2.6.10-ac2, in the period of time that I would have expected the kernel to oops, I got this instead (and the system kept running): NETDEV WATCHDOG: eth3: transmit timed out eth3: transmit timed out, status 0073, resetting. Link was down on eth3 at the time that this happened. I've now rebooted into 2.6.9-1.727_FC3, with the link up on eth3 this time. I'm going to keep the link up overnight and see if it oopses by early afternoon. Created attachment 109509 [details]
another oops from 727_FC3
This time the link was up on eth3. The oops also happened after less than an
hour of uptime.
Created attachment 109510 [details]
sysrq-m and sysrq-t from the second oops
This time I made sure not to hit the power button first. :)
I just now noticed that CONFIG_DEBUG_PAGEALLOC is enabled in the 1.727_FC3 kernel but disabled in the ac2 and 1.724_FC3 kernels. My next step will probably be to recompile some kernel with CONFIG_DEBUG_PAGEALLOC enabled (probably 1.724_FC3) and see if that makes the bug appear in that kernel. After nearly 24 hours, 724_FC3 + CONFIG_DEBUG_PAGEALLOC still had not oopsed. I'm now running 727_FC3 - CONFIG_DEBUG_PAGEALLOC; we'll see what happens with that. 2.6.10-1.727_FC3 with CONFIG_DEBUG_PAGEALLOC also appears not to oops, after well over a day. When I get a chance, I'll try 2.6.10-1.736_FC3 and see whether its debugging code shows anything. (That will not be for at least several hours however.) 2.6.10-1.736_FC3 didn't oops after a couple days of continuous uptime. I'm now running 2.6.10-1.741_FC3.barryn -- this is a recompile of 1.741_FC3 with pagealloc debug re-enabled. If it can stay up for a day or so without oopsing, I'll close this bug. Otherwise, I'll attach the new oops and figure out how to proceed. be sure to disable the periodic slab checker patch if you enable pagealloc debugging. the two aren't compatable. Right, I'm aware of that (I saw it in one of the changelogs). BTW, the first time I booted into 741 + pagealloc debug, brctl oopsed (near the end of boot, when it was executing stuff in /etc/rc.d/rc.local) and that oops took the whole system down. I captured the oops but I haven't been able to reproduce it. Should I bother filing a bug and submitting the oops anyway? (Maybe this weekend I'll try setting up a reboot loop, to see if that reproduces it, but for now I want to keep it up to see if *this* oops happens again.) given the incompatability has a tendancy to throw some really wierd oopses, I'd not bother unless you can reproduce it on a regular FC kernel. Oops, once again I didn't make myself clear (-ENOSLEEP this time). The periodic slab debug is disabled, and pagealloc debug is enabled. That shouldn't have the incompatibility, should it? Anyway, 741_FC3 + pagealloc debug *should* have oopsed by now if my original bug was still present -- but it didn't. So it seems that whatever caused it has been fixed. |