Bug 176997 - kernel-smp-2.6.14-1.1653_FC4smp locks up hard
kernel-smp-2.6.14-1.1653_FC4smp locks up hard
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
4
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-04 23:51 EST by Ralf Corsepius
Modified: 2015-01-04 17:24 EST (History)
2 users (show)

See Also:
Fixed In Version: 2.6.15-1.1824_FC4smp
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-05 08:52:37 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmseg of booting with 2.6.14-1.1644_FC4smp (15.21 KB, text/plain)
2006-01-15 12:31 EST, Ralf Corsepius
no flags Details
dmesg of booting 2.6.14-1.1653_FC4 with acpi=off (15.32 KB, text/plain)
2006-01-15 12:32 EST, Ralf Corsepius
no flags Details
dmesg of booting with 2.6.15-1.1824_FC4smp (15.24 KB, text/plain)
2006-01-15 12:33 EST, Ralf Corsepius
no flags Details

  None (edit)
Description Ralf Corsepius 2006-01-04 23:51:20 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050923 Fedora/1.7.12-1.5.1

Description of problem:
I am observing hard kernel hangers/lockup with kernel-smp-2.6.14-1.1653_FC4 shortly (<1min) after booting up the system.

Symptoms are: Shortly after booting the system doesn't react anymore. Neither console input/loggins nor remote loggins are possible.

Unfortunately /var/log/messages doesn't provide any helpful information related to the breakdown. No oops nor other indication of what might be going wrong.


Version-Release number of selected component (if applicable):
kernel-smp-2.6.14-1.1653_FC4

How reproducible:
Always

Steps to Reproduce:
1. boot
2. wait for a minute
=> system is inaccessable.


Additional info:

This is an old dual PII/266MHz, SCSI-only system.

All previous FC4 kernel-smp kernels up to kernel-smp-2.6.14-1.1644_FC4, I had installed, seemed to haved worked flawlessly (currenly running 2.6.14-1.1644_FC4).
Comment 1 Dave Jones 2006-01-05 20:30:03 EST
can you try booting with pci=noacpi ?  (or if that doesn't work, acpi=off)

any chance you can boot (making sure 'quiet' is removed from the boot command
line, and let us know the last few things that appear on the screen ? (Even a
digital camera pic would be useful).
Comment 2 Ralf Corsepius 2006-01-06 02:24:37 EST
(In reply to comment #1)
> can you try booting with pci=noacpi ?
>  (or if that doesn't work, acpi=off)
Nope, neither pci=noacpi nor acpi=off, seem to help.

> any chance you can boot (making sure 'quiet' is removed from the boot command
> line, and let us know the last few things that appear on the screen ? (Even a
> digital camera pic would be useful).
Well, I would have done so, if there was anything useful.

In most cases, the system boots up normally and ends up with a normal console
login. Then, after some time of "seemingly normal operation", the system becomes
completely non-responsive. Everything seems "frozen", not even a "3 finger
salute" works.

In some (less frequently), the system hangs while booting.

After rebooting into an older kernel, /var/log/messages doesn't show anything
unusual concerning the "hanger", no oops, no errors, no warnings - just normal logs.

However, meanwhile I am suspecting (beware: wild guess!) something related to
networking, because these "hangers" always seem occur during network access.

In those cases, it hangs while booting, the last boot msg in most cases is
autofs's, some nfs or yp daemon's startup message.
In those cases, it hangs after a successful bootup, I can (almost)
deterministically cause the system to hang by logging in from remote and running
"yum update" as root.

Network driver problem? Compiler miscompiling driver?
Comment 3 Dave Jones 2006-01-12 19:04:17 EST
have you tried running memtest86 over this box for a while ? A large percentage
of hangs we get reported turn out to be bad ram.

something that may trigger a backtrace when it hangs is booting with nmi_watchdog=1
Comment 4 Ralf Corsepius 2006-01-12 22:35:22 EST
(In reply to comment #3)
> have you tried running memtest86 over this box for a while ? 
Not recently. However, this machine (ca. 8 years old) has had almost every
Fedora and RHL kernels since RH-8.0 installed, and (except of occasional kernel
bugs) so far has been rock-solid.

It currently is running 2.6.14-1.1644_FC4smp without any problems ;)

> something that may trigger a backtrace when it hangs is booting with
> nmi_watchdog=1
I can give this a try.

For the record: 2.6.14-1.1656_FC4smp exposes this issue, too.
Comment 5 Dave Jones 2006-01-12 23:59:17 EST
could you try out the kernel just pushed out to updates-testing too please ?
(2.6.15-1.1824_FC4)
Comment 6 Ralf Corsepius 2006-01-13 01:41:26 EST
(In reply to comment #5)
> could you try out the kernel just pushed out to updates-testing too please ?
> (2.6.15-1.1824_FC4)

Initial results (uptime 1 hour) look promissing: The box survived several boot
ups, an e2fsck during bootup, a "yum update" and shoveling around several megs
of data over the network.

Diffing the dmesg of *1644, *1653 and 1824 kernels shows some presumably
noteworthy differences related to DMA and APIC (This box is known to have a
"broken" APIC implementation - Yes, I mean APIC not ACPI).

The only thing related to the NIC, I can spot, is this:
 8139too Fast Ethernet driver 0.9.27
-eth0: RealTek RTL8139 at 0xe800, 00:0b:2b:00:c0:9d, IRQ 185
+eth0: RealTek RTL8139 at 0xd0818000, 00:0b:2b:00:c0:9d, IRQ 185

Any explanation for the lockups with 165* ?
Comment 7 Dave Jones 2006-01-13 16:29:07 EST
the RTL8139 diff is because we enabled memory mapped IO, which is faster, and
should be stable, I'd be surprised if that was causing lockups, though its
always possible.

could you paste the other diffs from the two dmesg's ? There have been some
changes in the area of APIC, but unless you're passing boot command line
options, they should make no difference.
Comment 8 Ralf Corsepius 2006-01-15 12:31:06 EST
Created attachment 123221 [details]
dmseg of booting with 2.6.14-1.1644_FC4smp
Comment 9 Ralf Corsepius 2006-01-15 12:32:10 EST
Created attachment 123222 [details]
dmesg of booting 2.6.14-1.1653_FC4 with acpi=off
Comment 10 Ralf Corsepius 2006-01-15 12:33:48 EST
Created attachment 123223 [details]
dmesg of booting with 2.6.15-1.1824_FC4smp
Comment 11 Ralf Corsepius 2006-01-15 12:38:40 EST
(In reply to comment #7)
> could you paste the other diffs from the two dmesg's ? 

I've added the dmesg's of booting the system with those different kernels being
discussed. Attachment #123222 [details] contains the dmesg of a boot that hung shortly
afterwards, both other worked without major problems.

[BTW: Uptime with *1824_FC4smp now: 2.5 days]

 

Comment 12 Dave Jones 2006-02-03 02:03:12 EST
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.
Comment 13 John Thacker 2006-05-05 08:52:37 EDT
Sounds like it was fixed with 2.6.15-1.1824_FC4smp.
Closing.

Note You need to log in before you can comment on or make changes to this bug.