Bug 169816 - SMP kernel randomly crashes when coming up on Pentium EM64T with hyperthreading enabled
Summary: SMP kernel randomly crashes when coming up on Pentium EM64T with hyperthreadi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-10-03 23:38 UTC by Karl Auerbach
Modified: 2015-01-04 22:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-02-21 02:29:01 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
boot log with survived crash during module loading (33.38 KB, text/plain)
2005-12-30 19:42 UTC, Dan Horák
no flags Details

Description Karl Auerbach 2005-10-03 23:38:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc3 Firefox/1.0.7

Description of problem:
Machine: Supermicro 5014C-MF (P8SCT motherboard - http://www.supermicro.com/products/motherboard/P4/E7221/P8SCT.cfm )

CPU: Intel Pentium 4 640 EM64T, 3.2ghz, 800mhz FSB, 2MB L2 cache

Memory: 1G non ECC
Drive: SATA

During booting the kernel crashes randomly - sometimes it even comes up and runs.

Crashes range from simply hanging to stacktrace (I don't know how to capture it during boot-up).  Stacktraces appear to be different from crash to crash.

Only happens with SMP kernel with Hyperthreading enabled; the same kernel runs fine when hyperthreading is turned off in the BIOS.  Non-SMP kernel runs find no matter what the BIOS settings are.

Version-Release number of selected component (if applicable):
kernel 2.6.13-1 (1526_FC4SMP)

How reproducible:
Always

Steps to Reproduce:
1. Ensure that hyperthreading on Pentium EM64T is enabled
2. Boot SMP kernel
3. Wait for crash or hang (happens 90%+ of time)
  

Additional info:

I've gone through the liteny of BIOS settings - disabling legacy USB, etc, to see if anything else changes the situation.  It seems entirely dependent on whether hyperthreading is enabled or not.

Comment 1 Dave Jones 2005-10-03 23:48:19 UTC
can you try running memtest86 overnight, and see if that picks up anything ?
Random crashes are often a sign of hardware problems.
Also check that there's sufficient cooling, and a strong enough power supply.


Comment 2 Karl Auerbach 2005-10-04 21:57:02 UTC
OK, I can try memtest - Which version of memtest would you like me to run (and
might you have an bootable .iso for it [the machine does not have a floppy drive])?

However, given that the problem occurs reliably and during boot-up when
hyperthreading is enabled, it seems that it is probably unrelated to memory flaws.

Also I was a bit overbroad when I said that the failures were random - there are
several kinds of failures, but they seem to recur at the same several spots in
the boot sequence.

If memory were flakey then the system ought to be crashing in non-hyperthreading
mode as well.  However the box runs rock solid in non-hyperthread mode.

As for cooling - the CPU is running at about 40C, stable.  And the power supply
is pretty mongo.  This is a Supermicro server box so it's got reasonably studly
engineering margins.

Comment 3 Karl Auerbach 2005-10-04 22:38:52 UTC
I found an memtest iso at http://www.memtest86.com/

It's running now with default settings - it's gone through one full pass without
errors so far.


Comment 4 Karl Auerbach 2005-10-05 23:16:46 UTC
OK, memtest has run overnight (with hyperthreading enabled) - 85 full passes. 
Zero errors.

So I think we can rule out memory and processor.

I believe that we've got a problem related to the kernel's handling of the
Pentium 4 EMT64 with hyperthreading.

Comment 5 Karl Auerbach 2005-10-11 01:57:26 UTC
I am able to localize the problem down to a small set of kernel configs:
With the following make kernel resident rather than loadable modules, the system
comes up fine.  (I suspect that CONFIG_SCSI_QLA2XXX snuck in by accident.)

< CONFIG_SCSI=y
---
> CONFIG_SCSI=m
773c773
< CONFIG_SCSI_SATA=y
---
> CONFIG_SCSI_SATA=m
776c776
< CONFIG_SCSI_ATA_PIIX=y
---
> CONFIG_SCSI_ATA_PIIX=m
800c800
< CONFIG_SCSI_QLA2XXX=y
---
> CONFIG_SCSI_QLA2XXX=m


Comment 6 Dave Jones 2005-11-10 19:22:37 UTC
2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.


Comment 7 Karl Auerbach 2005-11-11 05:51:18 UTC
It's not a happy camper.

On the intel P4/64 box the message "i8042.c: Can't read CTR while initializing
i8042" still pops out on some boots and not on others.

That same system still crashes frequently on the way down when rebooting
(assuming that it managed to sucessfully come up.)

Keyboard input from the hardware keyboard, both USB and PS2, seems lost on the
Intel P4/64 box under both the SMP and uniprocessor versions of 1637.  But on an
AMD/64 dual-core box I get intermittent massive key bounce from 1637 (Keyboard
operation returns to normal when I go back to the 1532 version kernel.)

I've had to resume 1532 non-smp on the P4/64 box and am putting up with the
keyboard bounce of 1637 for the moment on the AMD/64 dual-core box.




Comment 8 Karl Auerbach 2005-11-16 04:28:37 UTC
Things get amazingly better on the 1632 kernel when I change the BIOS setting
for the SATA to be AHCI rather than any of the other modes.

With AHCI/SATA the system seems to run reliably, there are no 8042 complaints,
the keyboard works.

(There is still an intermittent crash when halting the kernel in the NFS
unmount, but it seems to be in both the SMP and non-SMP kernels.)


Comment 9 Wu 2005-11-29 21:40:37 UTC
I can confirm the bug, completely the same behaviour. When HT enabled, the SMP
kernel (2.6.14-1.1637_FC4) crashes. 

Motherboard MSI 945P NEO F
CPU Intel Pentium 4 630 EM64T
SATA drive

I am not able to change SATA mode in BIOS (option is disabled), so I cannot
check comment #8.

Comment 10 Dan Horak 2005-12-30 13:28:05 UTC
I think that I see the same problem on my machine. When it crashes, it is
usually during the module loading (Detecting hardware ... sound, network, ...)

motherboard Intel 945PSN (sandusky)
CPU Pentium 4 640
SATA HDD, IDE HDD on Promise Ultra66

Comment 11 Dan Horák 2005-12-30 19:42:23 UTC
Created attachment 122644 [details]
boot log with survived crash during module loading

there are three variants of boot - no problem, small crash with continuing and
a crash with kernel panic

Comment 12 Dan Horák 2006-01-13 09:46:56 UTC
With current FC4 kernel 2.6.14-1.1656_FC4smp is the situation still the same.

Comment 13 Dave Jones 2006-02-03 05:27:06 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 14 Dan Horák 2006-02-11 10:30:47 UTC
I don't want to be too optimistic, but it looks like that I can boot without
problems with enabled HT in kernels 2.6.15-1.1830_FC4 and 2.6.15-1.1831_FC4.

Comment 15 Karl Auerbach 2006-02-11 19:22:42 UTC
I have not seen this bug for several FC4 kernel releases - the machine that
originally had the problem is happily running 1831 with hyperthreading enabled
without a hitch.

Comment 16 Dave Jones 2006-02-21 02:29:01 UTC
great, thanks for the update.


Note You need to log in before you can comment on or make changes to this bug.