Bug 55622 - Tyan Tiger SMP lockups
Tyan Tiger SMP lockups
Status: CLOSED NOTABUG
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.2
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Ben LaHaise
David Lawrence
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-11-02 18:22 EST by Mike Zingale
Modified: 2007-04-18 12:38 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-09-11 14:12:56 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages (160.71 KB, text/plain)
2001-11-03 17:42 EST, Mike Zingale
no flags Details
/var/log/Xfree86.0.log (52.91 KB, text/plain)
2001-11-03 17:45 EST, Mike Zingale
no flags Details
/etc/X11/XF86Config-4 (1.82 KB, text/plain)
2001-11-04 22:42 EST, Mike Zingale
no flags Details
/etc/X11/XF86Config (14.24 KB, text/plain)
2001-11-04 22:43 EST, Mike Zingale
no flags Details
/var/log/XFree86.0.log (42.85 KB, text/plain)
2001-11-06 12:33 EST, Mike Zingale
no flags Details
cat /proc/cpuinfo (826 bytes, text/plain)
2001-11-06 15:51 EST, Mike Zingale
no flags Details
John's /var/log/messages (42.93 KB, text/plain)
2001-11-10 16:57 EST, John
no flags Details
cat /proc/sys/kernel/tainted (2 bytes, text/plain)
2001-11-10 16:58 EST, John
no flags Details

  None (edit)
Description Mike Zingale 2001-11-02 18:22:30 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.78 [en] (X11; U; Linux 2.4.7-10smp i686)

Description of problem:
About once a day, my X session locks up (mouse freezes,, etc.), and I am
unable to switch to any virtual consoles or ssh into the machine (both of
these work under normal circumstances).  I have had my machine freeze like
this 3 times since installing 7.2 five days ago.  In each instance, I was
using mozilla.  I am aware of atleast one other system that freezes
constantly with mozilla.
  

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. launch mozilla, browse, wait . . .
2.
3.
	

Actual Results:  After an unknown amount of time, the X session will
completely lock up.  I am unable to get access to the system in any of the
usual ways.	

Expected Results:  Nothing should lock a system up this hard.	

Additional info:

The system is a dual 1.2 GHz Athlon with 1 GB of RAM.   The motherboard is
a Tyan Tiger MP S2460.  HD is 80 GB Seagate IDE drive.  There is also an
IDE CD-RW/DVD drive on the other IDE channel.  The RH install was a
straight workstation install.  In all instances, I was using GNOME.  The
kernel is the RH 2.4.7 kernel, all other errata/patches have been applied
from the updates section of the RH ftp site / up2date.
Comment 1 Mike Zingale 2001-11-02 18:25:53 EST
... I should add that after each instance, I looked in /var/log/messages, and
saw no messages at the time of the lockup.
Comment 2 Mike A. Harris 2001-11-02 20:20:40 EST
What video card are you using?
Please attach your X server configuration, and your X server logs
using the link below.  Also attach your /var/log/messages file,
and the output of "cat /proc/sys/kernel/tainted".
Comment 3 Mike Zingale 2001-11-03 17:42:19 EST
Created attachment 36304 [details]
/var/log/messages
Comment 4 Mike Zingale 2001-11-03 17:45:07 EST
Created attachment 36305 [details]
/var/log/Xfree86.0.log
Comment 5 Mike Zingale 2001-11-03 17:45:41 EST
The video card is a nVidia Geforce2 64 MB card.

I've attached /var/log/messages/ and /var/log/Xfree86.0.log.  There is no
/proc/sys/kernel/tainted file to cat.

Please let me know what other information you need.


Comment 6 Mike Zingale 2001-11-04 22:42:27 EST
Created attachment 36423 [details]
/etc/X11/XF86Config-4
Comment 7 Mike Zingale 2001-11-04 22:43:24 EST
Created attachment 36424 [details]
/etc/X11/XF86Config
Comment 8 Mike Zingale 2001-11-04 22:46:05 EST
I added the XF86Config files (I believe XF86Config-4 is being used).  Also, the
problem recurred twice today, and in one instance, mozilla was not up and
running, only GNOME and 4 terminals were running.

-- Mike
Comment 9 Mike A. Harris 2001-11-05 23:11:31 EST
It seems the logfile attachment is a logfile from a running X server
rather than after a server crash.  If this is the case, can you
please attach a log file from after a crash?  Note that if you
restart XFree86 after a crash, the log will be overwritten.

One more thing, could you attach the output of "lspci -n".

Also, can you test the Option "NoAccel"in your XF86Config-4 and see
if that causes the problem to disappear.  This option is documented
on the XF86Config manpage.  If it works, we can narrow down the
problem.
Comment 10 Mike Zingale 2001-11-06 12:27:12 EST
Below is the result of lspci:

[zingale@nan ~]$ /sbin/lspci -n
00:00.0 Class 0600: 1022:700c (rev 11)
00:01.0 Class 0604: 1022:700d
00:07.0 Class 0601: 1022:7410 (rev 02)
00:07.1 Class 0101: 1022:7411 (rev 01)
00:07.3 Class 0680: 1022:7413 (rev 01)
00:07.4 Class 0c03: 1022:7414 (rev 07)
00:0a.0 Class 0200: 10b7:9200 (rev 6c)
01:05.0 Class 0300: 10de:0152 (rev a4)


also for completeness, /proc/pci gives:

[zingale@nan ~]$ more /proc/pci
PCI devices found:
  Bus  0, device   0, function  0:
    Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
      Master Capable.  Latency=64.  
      Prefetchable 32 bit memory at 0xec000000 [0xefffffff].
      Prefetchable 32 bit memory at 0xe8002000 [0xe8002fff].
      I/O at 0x1090 [0x1093].
  Bus  0, device   1, function  0:
    PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
      Master Capable.  Latency=99.  Min Gnt=12.
  Bus  0, device   7, function  0:
    ISA bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ISA (rev 2).
  Bus  0, device   7, function  1:
    IDE interface: Advanced Micro Devices [AMD] AMD-765 [Viper] IDE (rev 1).
      Master Capable.  Latency=64.  
      I/O at 0xf000 [0xf00f].
  Bus  0, device   7, function  3:
    Bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ACPI (rev 1).
      Master Capable.  Latency=64.  
  Bus  0, device   7, function  4:
    USB Controller: Advanced Micro Devices [AMD] AMD-765 [Viper] USB (rev 7).
      IRQ 11.
      Master Capable.  Latency=16.  Max Lat=80.
      Non-prefetchable 32 bit memory at 0xdc000 [0xdcfff].
  Bus  0, device  10, function  0:
    Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 108).
      IRQ 10.
      Master Capable.  Latency=80.  Min Gnt=10.Max Lat=10.
      I/O at 0x1000 [0x107f].
      Non-prefetchable 32 bit memory at 0xe8001000 [0xe800107f].
  Bus  1, device   5, function  0:
    VGA compatible controller: nVidia Corporation NV15 Bladerunner (Geforce2 GTS
) (rev 164).
      IRQ 10.
      Master Capable.  Latency=64.  Min Gnt=5.Max Lat=1.
      Non-prefetchable 32 bit memory at 0xe9000000 [0xe9ffffff].
      Prefetchable 32 bit memory at 0xf0000000 [0xf7ffffff].

I will test the noaccel option after the next hang.

I will also look for a different XFree log after a hang.

There are several similar problems reported in the tyan newsgroup
(alt.comp.periphs.mainboard.tyan) under the heading "Tigen S2480 'freezing'
under Redhat 7.2".  I am beginning to wonder if this is not a kernel issue,
instead of X, but it is difficult for me to differentiate between the two right
now, with the information I have.
Comment 11 Mike Zingale 2001-11-06 12:33:34 EST
Created attachment 36661 [details]
/var/log/XFree86.0.log
Comment 12 Mike Zingale 2001-11-06 12:35:07 EST
I've attached the XFree86 log again -- this is the only recent log I have in
/var/log/.
Comment 13 John 2001-11-06 13:01:10 EST
Hi, I'm one of the people who posted a message on the newsgroup.  I'll copy my
note to that group here.  I had assumed that it was a hardware conflict in my
system, or because maybe one of my XP cpus has bad smp logic.  But, since other
people are reporting freezes, I should add my experiences.

Btw, it hasn't crashed since I posted to the newsgroup, so I haven't had a
chance to ping the machine while it's frozen yet.

If you'd like me to test any fixes or workarounds, let me know.

John


Copied newsgroup post:
----------------------
  	Re: Tiger S2460 'freezing' under RedHat 7.2
From: "John" <john@crossbow.localdomain>
Newsgroups: alt.comp.periphs.mainboard.tyan
Date: Tue, 06 Nov 2001 01:40:57
 References: 1 
I have the same problem.  (I hadn't assumed it was the board's fault,
because my machine is slightly unconventional.)

I've got:
Redhat 7.2
Tiger MP, v1.02 bios
2 _XP_ 1600s
Volcano 5 heat sink fans
512MB Crucial reg. ddr
Enermax 431 power supply
AudioPCI sound card
Matrox Millennium II pci video card
3c905B nic
Toshiba 24x ide cd reader
10GB Maxtor & 4GB Western Digital Caviar ide hdds

I don't know the temps... I don't have lm_sensors installed yet.
But, the case is a Lian-Li pc60 (don't ask why I didn't spend the $$$
on a video card :)  and the cables are all out of the way... the airflow
is great.

I had an opportunity to try to ping it while it's frozen yet, so I don't
know whether it's a real crash or just the video locking up.  (I've tried
to look really quickly at the hdd light on the case to see if the kernel
is still alive after a freeze, and I think I saw it doing something once.
I'm not sure though.)

I have windows 2000 temporarily installed on the machine, too.  That
has frozen  many times, both during installation and while running
software.  It had a day of continuous uptime yesterday, and froze like
mad today.  I haven't payed too much attention to the Windows crashes,
though...  this is the first time I've used Windows in a couple of years,
and I don't know how it's supposed to behave any  more.

Two days ago I pounded on the machine with cpuburn, with two processes
running for an hour.  The machine ran just fine.  Also, I ran some
multi-threaded test routines from the FFTW fast fourier transform library,
which ran just fine.  (I issued commands like
./fftw_threads_test 2 -s 4096x4096.)  I hoped that the big multithreaded
jobs might trip across any problems in the untested smp parts of the XP
cpus.  No crash, though... it ran beautifully.

(Btw, I am awed by the speed of this thing. :)

Every time it has locked up, it has been under almost zero load.  One
time, it locked up while unmounting a cdrom.  I don't know that that's
useful, though... the other times the cdrom was idle, as far as I can
remember.

(Also, I did a cursory check for shared interrupts... according to
/proc/interrupts, only usb-ohci and eth0 share an interrupt, #11.
(Also, there are no usb devices installed.)

I wish I had kept a log so far, but I believe a lock-up happened
while the network and sound cards were removed.

I'm sure there's more info than that, but this is getting long winded.
I'd be happy to find out that only a bios update is the key, :) but the
info on the patch didn't list any stability fixes.

If someone has any clues or suggestions, I'm all ears. :)  Personally,
I'm just waiting for it to crash again, so I can try to ping it. :)

Later,
John
Comment 14 Mike Zingale 2001-11-06 14:44:49 EST
Thank you for your suggestions.  

After the last hang, I disabled acceleration in XFree by adding the 

Option "NoAccel"

line to my XF86Config-4 file.  I verified that it was recognized by looking in
the XFree86.0.log file, and just by moving a window across my screen (it was
painfully slow).

After a few hours, the system locked up once again.  As usual I was unable to 
switch to a virtual console or ssh into the machine from another one.  I *was*
able to ping the machine though -- this is something that I must not have tried
before. 

Also, yesterday, based on some comments I read online about experiences with
this motherboard, I booted up with the 'noapic' parameter to the kernel (SMP),
to see if this fixed any of the instabilities -- it did not.

Presently, I am back to the normal session -- X acceleration enabled, apic
enabled.  I am still trying to find a reliable way to reproduce this freezing,
and thus far, I seem to be able to do it if I download a large (i.e. >~ 600 MB)
file.  Usually this, or doing it again immediately after completion, will lock
the machine.  I will continue to experiment to find a reliable way to hang the
system.

Please let me know any other suggestions you may have.

Mike
Comment 15 Mike A. Harris 2001-11-06 15:43:13 EST
From everything you've described so far, this looks much more like
a kernel bug than an XFree86 bug to me if the whole box locks up.  The
NV driver shouldn't really be able to lock up the whole box IMHO.  In
lieu of that, I would hypothesize it is a hardware issue, such as a buggy
motherboard chipset or BIOS.  Can you please provide the full output
of:  cat /proc/cpuinfo
Comment 16 Arjan van de Ven 2001-11-06 15:45:15 EST
some revisions of the 760MP chipset require you to boot with "noapic" to work
stable. Could you try that ?
Comment 17 Mike Zingale 2001-11-06 15:49:18 EST
I have already tried passing 'noapic' to the kernel at bootup.  I was able to
confirm that it was accepted by looking in /var/log/messages, but it did not
stop the freezing.

I agree that this sounds more like a kernel issue than an XFree issue.  When it
first started, I did not know as much as we've uncovered to date.

My cpuinfo is:

[zingale@nan ~]$ more /proc/cpuinfo 
processor
: 0
vendor_id
: AuthenticAMD
cpu family	: 6
model
	: 6
model name	: AMD Athlon(tm) Processor
stepping
: 1
cpu MHz		: 1194.667
cache size	: 256 KB
fdiv_bug
: no
hlt_bug
	: no
f00f_bug
: no
coma_bug
: no
fpu
	: yes
fpu_exception
: yes
cpuid level	: 1
wp
	: yes
flags
	: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx
fxsr sse syscall mmxext 3dnowext 3dnow
bogomips
: 2385.51

processor
: 1
vendor_id
: AuthenticAMD
cpu family	: 6
model
	: 6
model name	: AMD Athlon(tm) Processor
stepping
: 1
cpu MHz		: 1194.667
cache size	: 256 KB
fdiv_bug
: no
hlt_bug
	: no
f00f_bug
: no
coma_bug
: no
fpu
	: yes
fpu_exception
: yes
cpuid level	: 1
wp
	: yes
flags
	: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx
fxsr sse syscall mmxext 3dnowext 3dnow
bogomips
: 2385.51


I am willing to try any further suggestions you may have.

Mike
Comment 18 Ben LaHaise 2001-11-06 15:50:20 EST
What power supply is this system using?
Comment 19 Mike Zingale 2001-11-06 15:51:59 EST
Created attachment 36681 [details]
cat /proc/cpuinfo
Comment 20 Mike Zingale 2001-11-06 15:52:47 EST
/proc/cpuinfo did not seem to paste well into bugzilla, so I added an attachment
containing its output.

Mike
Comment 21 Mike Zingale 2001-11-06 16:25:44 EST
The power supply is an Athlon SPI Sparkle Pwr Supply, FSP400-60GN SN10923708.
The vendor claims that it exceeds AMDs requirements.  It is a 400W power supply.

Comment 22 Mike Zingale 2001-11-06 17:18:32 EST
I saw some comments in the release notes about using the noathlon parameter to
the kernel.  I did a clean install on this box, and it appears that the i686
kernel was installed -- not the athlon optimized kernel, so am I correct in
believing that this parameter would have no effect?

'uname -a' gives:

Linux nan.ucolick.org 2.4.7-10smp #1 SMP Thu Sep 6 16:16:16 EDT 2001 i686 unknown

and 'rpm -qa | grep kernel' gives:

kernel-headers-2.4.7-10
kernel-2.4.7-10
kernel-smp-2.4.7-10


Comment 23 Arjan van de Ven 2001-11-06 17:20:27 EST
uname says "i686" even for athlon kernels.
So "noathlon" is worth a shot, and so is updating to the 2.4.9-13 kernel we
released last week...
Comment 24 John 2001-11-06 22:38:28 EST
Ok, I can now repeatably lock up cpu1.  (Doh.)

Steps:
1.  Open up a terminal, and start up top.
2.  Open up another terminal, and fire up seti@home.
3.  Hold down the space bar in the top terminal to watch the screen update
Comment 25 John 2001-11-06 22:39:57 EST
Ok, I can now repeatably lock up cpu1.  (Doh.)

Steps:
1.  Open up a terminal, and start up top.
2.  Open up another terminal, and fire up seti@home.
3.  Hold down the space bar in the top terminal to watch the screen update
    quickly.  In much less than a minute, cpu1 ends up fixed at 0% user,

Comment 26 John 2001-11-06 22:48:25 EST
Ok, I can now repeatably lock up cpu1.  (Doh.)

Steps:
1.  Open up a terminal, and start up top.
2.  Open up another terminal, and fire up seti@home.
3.  Hold down the space bar in the top terminal to watch the screen update
    quickly.  In much less than a minute, cpu1 ends up fixed at 0% user,
    0% system, 0% nice, and 100% idle.

Afterwards, the system locks up when trying to log out of GNOME.

The system is a Tyan Tiger MP (S2460), 512MB Crucial DDR PC2100 ram,
and 2 Athlon _XP_ 1600s.  The os is Redhat 7.2, fully updated as per
the Redhat updates, including the kernel.

The newsgroup message I pasted above includes the rest of the system
info,  (Power is a Enermax 431, and the case is a Lian-Li PC60, heat
sinks are Volcano 5s, so I don't think cooling or power are the
problem.)


Unless there's some obscure kernel bug I'm tripping on, (and I doubt
that) I think this is a sign that at least one of my XP 1600s has bad
smp logic.

I'd be very happy to hear if someone with MP rather than XP cpus could
reproduce this problem. :)!!!

Otherwise, I think I'm going to have to replace the cpus.

John

p.s. Sorry about the repeat, I hit tab then return.  Who would have though I'd
tab out of the text entry window? :)
Comment 27 Mike A. Harris 2001-11-07 05:53:40 EST
Reassigning bug to kernel component, as the kernel or the hardware
is more likely at fault here.  Updated summary.

Also note that AMD does not support SMP operation with Athlon XP.
Red Hat does not support SMP configurations that the CPU manufacturer
does not support.
Comment 28 Mike Zingale 2001-11-07 12:25:18 EST
I just confirmed with the person who assembled my machine that they are in fact
1.2 GHz Athlon MP processors, not XP.  These are qualified to run on the Tyan board.

I passed the "noathlon" parameter to the kernel yesterday.  I will wait to see
if any freezes develop.  My next step is to upgrade to the 2.4.9-13 kernel
available in your updates section.  I will use the i686 version.

Mike
Comment 29 Mike Zingale 2001-11-08 14:00:42 EST
With 'noathlon' passed to the kernel, the machine stayed up for almost 36 hours
without hanging -- then it froze hard.  This is by far the longest that this
machine had stayed up without freezing, typically I was seeing 2-3 hangs per day.  

I've rebooted with 'noathlon' passed to the kernel once again, to see if this
once again extends my sessions.

Can someone explain in more detail what noathlon does, and what it means if I
need this passed to my kernel to increase stability?

I will try upgrading the kernel next.

Please let me know if you have any further suggestions.

Mike
Comment 30 Arjan van de Ven 2001-11-08 14:55:43 EST
"noathlon" disables some athlon specific optimisations, specifically an
optimized memcopy function. Some machines seem to crash when high speed memory
copies are done. Unfortionatly similar code can run in userspace and that also
kills the machine according to some reports. Some people have seen a bios
upgrade fix it.
Comment 31 Mike Zingale 2001-11-08 15:14:05 EST
Thank you for the explaination.  I am told that there is a BIOS upgrade
available for this system.  I will try to get that and see if it helps.

Mike
Comment 32 John 2001-11-09 20:26:29 EST
Hi everyone,

My cpu1 lockups are cured... switching from XP to MP cpus fixed the problem.

So, chalk this up as a confirmed failure of Athlon XP cpus in an smp
configuration. :)  (The first one I have heard of.)

Since the original reporter's system uses MP chips, though, this doesn't
help him.

John
Comment 33 John 2001-11-09 20:50:31 EST
I was wrong.  Even with the MP cpus, I can still cause the lockups.

The XP cpus were fine.  (What a waste of money.)

Doing more troubleshooting now...

John
Comment 34 John 2001-11-10 16:55:10 EST
I've upgraded the motherboard to the latest bios (v1.03).

That didn't help.

I've also run the latest version of memtest86 on the machine...
there were no memory errors.

Also, I've done more tests w/ seti@home and top.  As far as I can
tell, neither on its own can lock up cpu1.  Both have to be used
together for the freeze to occur.  (Hopefully that narrows things
down a bit.)

This problem is serious.  I need to figure out whether this is a
linux bug or a motherboard bug.  I'm going to use this box for
some hardcore thesis number crunching...  scientific codes crashing
the machine may put a damper on that. :)  I'll see if I can narrow
down the problem even more.

Mike, does seti@home+top cause a repeatable crash for you?  If it
isn't, I should open a separate bug, since our two problems may be
less related than we originally thought.

John

p.s. I'm attaching tainted and /var/log/messages.

Comment 35 John 2001-11-10 16:57:33 EST
Created attachment 37201 [details]
John's /var/log/messages
Comment 36 John 2001-11-10 16:58:45 EST
Created attachment 37202 [details]
cat /proc/sys/kernel/tainted
Comment 37 Mike Zingale 2001-11-10 18:16:52 EST
I have not tried the seti@home yet -- I will try to get to that shortly,
although I am not sure if the network people here would frown upon this.

My current status is:

 2.4.7 RH kernel now locks up ~ every 24 hours, when booted with 'noathlon' and
 'noapic' parameters

 I've downloaded the i686 2.4.9 RH kernel (latest one available at the updates),
 and I will try this probably tomorrow.

 I can pretty much reliably get the system to hang when downloading a moderately
 sizes file(s) -- something in the range of 600 MB to 1 GB.  Usually I set up an 
 scp grabbing a dataset and go about other work, and after some time, the system
 locks up hard.  I cannot recall an instance where the machine locked up when I     
 was not copying some remote data onto my machine (either through the web or 
 scp).  This is probably why I initially associated the bug with mozilla.

 I am unaware of any conflicts with my ethernet card, a 3c59x based one.

 I never see any incriminating messages in /var/log/messages/

My current game plan (after getting some real work done on this machine today :)
is to try out the 2.4.9 kernel, and see if that makes any difference.  The
person who built the machine for me is going to be around at the end of next
week, and we will try swapping out some parts (memory, etc.) to see if there is
perhaps some bad hardware.  I will also try the BIOS update (although there is
some evidence above that this does not help), and try booting up with the
non-SMP kernel.

As always, I am open to further suggestions.  Thank you all for your patience
and help,

Mike
Comment 38 Mike Zingale 2001-11-26 15:41:31 EST
Hello, I swapped out the motherboard tray on this system, replacing the
motherboard, CPUs, and memory.  The CPUs are now a later stepping (02), 
and there are 4 sticks of memory totalling 2 GB instead of 3 sticks totalling 1
GB before.  The machine has run for almost a week without locking up -- and this
is without passing 'noathlon' or 'noapic' to the kernel at boot up.  This
suggests that the culprit was hardware, although the exact part that was failing
is unknown.

One final question -- after the swap I have 2 GB or memory instead of 1 GB.  The
memory is all recognized by the system, but when the OS was originally
installed, the swap space was chosen based on 1 GB of memory.  Do I still have
enough swap space with the increased amount of memory?

Thank you for all of your help,

Mike
Comment 39 Need Real Name 2001-11-28 11:24:35 EST
I suppose I should chime in here - I've been plagued with lockups since
upgrading to RH7.2, on a machine which was, prior to that, rock solid on redhat-7.0.

Gigabyte GA6BXDS motherboard
768 MB RAM
two slot-I Pentium II 350 CPUs

Currently running 2.4.9, but the initially installed 2.4.7 kernel showed similar
lockups.

I have tried shuffling interrupts, since there IS some sharing going on, and
will probably try booting with NOAPIC later tonight, when I am near the machine
again (it's presently locked solid in a remote location).

-Darren
Comment 40 Need Real Name 2001-12-04 11:54:14 EST
seti on tiger s2460:

i have RH7.1 with 2x 1.2 MP chips on Tyan Tiger S2460; two setis will lock this
up in a few minutes or a couple of days, although recently it locks up soon
after the second seti starts; requires heavy manual fsck'ing on reboot;  ECC is
off; one seti appears to be fine.  only locks up with setis AFAIK.
The RH7.1 installation is entirely 'out of the box'; have not done any updates. 
i'm running for a while w/o setis and will see how it goes.

some specs:
coolers are SK6's with 31 cfms, arctic silver II, 
PC60, all fans at high;  
PSU is enermax whisper 365 (350 Watt); 
MSI TNT2 32 MB card
4 x crucial 256 MB RAM ECC/REG
Seagate 40Gb 72000 rpm
3Com 905cx-txnm 10/100
Comment 41 Need Real Name 2001-12-04 12:06:51 EST
UPDATE - Since booting RedHat-7.2's 2.4.9-13smp with "noapic", the machine's
been stable.

bash-2.05$ uptime
 12:00pm  up 5 days, 11:59,  7 users,  load average: 1.95, 2.03, 2.02

This was unheard of previously - it would be up a day at the most.

Let me know if you would like to know anything more about this system. I'm going
to stick with NOAPIC . . . I'm not at all clear why that would suddenly be
needed, but I'm not complaining with the uptime!

-d
Comment 42 Need Real Name 2001-12-05 11:37:57 EST
I am having similar problems with an Intel-based machine, a Dell 530
workstation with 1.5GHz Xeon, 1GB RAM, NVidia Quatro 64 MB and latest NVIDIA
drivers. 

Running 7.1, I had no problems. After a clean install of 7.2, I have a crash
(screen frozen, unresponsive to ping) about once a day. I have not found any
info in system logs. Kernel is redhat 2.4.9-13. Machine is fully up2date. 

Comment 43 Arjan van de Ven 2001-12-05 11:43:14 EST
tippett@iri.columbia.edu: nvidia driver bugreports should go to nvidia, not here.
Comment 44 Need Real Name 2001-12-05 11:43:21 EST
I'm sorry, but I must recant - the machine locked up overnight last night.So
noapic MIGHT have had an effect (5 days of uptime was a dramatic improvement)
but it was not a complete cure.

Here's some more detail:

[root@hewes darren]# cat /proc/cpuinfo 
processor
: 0
vendor_id
: GenuineIntel
cpu family	: 6
model
	: 5
model name	: Pentium II (Deschutes)
stepping
: 2
cpu MHz		: 349.070
cache size	: 512 KB
fdiv_bug
: no
hlt_bug
	: no
f00f_bug
: no
coma_bug
: no
fpu
	: yes
fpu_exception
: yes
cpuid level	: 2
wp
	: yes
flags
	: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips
: 696.32

processor
: 1
vendor_id
: GenuineIntel
cpu family	: 6
model
	: 5
model name	: Pentium II (Deschutes)
stepping
: 2
cpu MHz		: 349.070
cache size	: 512 KB
fdiv_bug
: no
hlt_bug
	: no
f00f_bug
: no
coma_bug
: no
fpu
	: yes
fpu_exception
: yes
cpuid level	: 2
wp
	: yes
flags
	: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips
: 697.95


[root@hewes darren]# lspci -n
00:00.0 Class 0600: 8086:7190 (rev 02)
00:01.0 Class 0604: 8086:7191 (rev 02)
00:07.0 Class 0601: 8086:7110 (rev 02)
00:07.1 Class 0101: 8086:7111 (rev 01)
00:07.2 Class 0c03: 8086:7112 (rev 01)
00:07.3 Class 0680: 8086:7113 (rev 02)
00:08.0 Class 0400: 109e:036e (rev 02)
00:08.1 Class 0480: 109e:0878 (rev 02)
00:09.0 Class 0200: 1011:0019 (rev 41)
00:0a.0 Class 0104: 105a:4d30 (rev 02)
00:0b.0 Class 0401: 1102:0002 (rev 05)
00:0b.1 Class 0980: 1102:7002 (rev 05)
00:0c.0 Class 0100: 9004:7895 (rev 04)
00:0c.1 Class 0100: 9004:7895 (rev 04)
01:00.0 Class 0300: 102b:0525 (rev 04)
Comment 45 Need Real Name 2001-12-05 12:00:09 EST
Arjan?

I don't think the change to "Short Summary" was entirely appropriate. My
motherboard is a Gigabyte GA6BXDS. I'm not buying the NVIDIA bug thing either,
since my card is a MATROX ;-)

-d
Comment 46 Arjan van de Ven 2001-12-05 12:04:09 EST
I'd really prefer if the problems which are not on Tyan Tiger boards get their
own bug; this bug is just dozens of different things mixed together and the
original description was WAY to vague and would fit just about every problem.
Comment 47 Need Real Name 2001-12-05 12:08:54 EST
Your point is well taken. I'll see if I can find the time to file a new one. Can
you guide me on what I should include with mine? Recall that I've got a
previously solid machine (was redhat-7.0) which is now locking up regularly (and
silently) since the upgrade to redhat-7.2.

If you can tell me the things you'd need to see, I'll see if we can fork to a
new bug which is sufficiently clear.

-d
Comment 48 Need Real Name 2001-12-10 09:19:13 EST
UPDATE - the issue is non-seti and continues to degreade;  the tiger crashed
overnite without any load at all and also began to crash under non-seti loads; 
on one occasion, the system only recognized one of four dimms on reboot.  i am
currently running with only two dimms and reduced loads.  so far there has been
no crash with this. a RH tech pointed me to memtest.  i will try this within the
next few days and report back.  my current guess: either one (or more) dimms
failed, or crucial's performance may not be up to snuff.   crucial represents
their dimms as DDR2100/ECC/REG/CAS2.5 which satisfies tyan's requirements,
although crucial has not to my knowledge been certified by tyan for this board.  

one poster on a newsgroup reported instabilities on a tyan tiger similar to
those reported here; he also had the crucial ram and his problems were resolved
by replacing with corsair ram.  there seem to be an equal number of people
reporting unsuccessful and successful experiences with crucial ram on these
boards; regardless the tiger is notoriously finicky on RAM and i am hopeful that
the ram is the source of my problems. i hope to provide better evidence for you
shortly.

best wishes, -David
Comment 49 John 2001-12-10 23:40:02 EST
I found a new way to freeze the second processor...  I tried building the ATLAS
blas tonight, and didn't make it very far.

I don't think this is a Linux problem.  A guy emailed me recently, and said he
tried my seti@home test on two Tiger MPs, with no problems.  I think it's either
my memory or my motherboard.  (I have two dimms of Crucial ram... I tried
running on each by itself, with no improvement.)

I've been trying to contact Tyan for a couple of weeks now.  They won't answer
email, so I'm going to call them tomorrow.

When I get replacement parts, I'll let everyone know the results.

John
Comment 50 John 2001-12-14 03:07:58 EST
Switching out the ram for Tyan-certified Corsair ram didn't help...

The motherboard is the only possibility left.
Comment 51 Need Real Name 2001-12-17 09:09:10 EST
jsk : sorry to hear about the RAM not helping; i have continued to operate
without a crash for 10 days now with only 2 dimms instead of 4; so i think my
problem is the memory, but i have encountered some people with bad tyan S2460's,
so it is reasonable to suspect that, as you appear to have ruled other things
out.

In particular, I had a bad mobo and documented it fairly carefully; it was
leaking current to its ground plane in a manner that could be easily measured
with a multimeter;  the vendor gladly replaced it and the new mobo resolved many
problems.   it appears that fixing the ram will resolve the remaining problems
(i hope).  best wishes with the mobo.
Comment 52 John 2001-12-17 09:41:41 EST
Thanks for the well-wishes. :)  I haven't actually pulled the motherboard tray
out and physically or electrically inspected it yet.  I'm getting a new mb cross
shipped from the vendor, which will arrive on Wednesday.  When it comes in, I'm
going to look more closely at the old one.  (It works for day-to-day tasks, but
the machine locks up when I try to build the ATLAS BLAS, so I can't do any of my
numerical work on the machine.)

I'll let you all know if the new mb solves the problems.

John
Comment 53 John 2001-12-19 16:57:07 EST
Cured!!! It's a miracle!!!

Actually, it's a new motherboard. :)  The dealer replaced the old one, and the
new Tiger MP works beautifully.


So, Linux worked perfectly, and I should have had faith in it from the start. :)

John
Comment 54 Need Real Name 2002-01-03 10:21:08 EST
congrats - i have not had time to run memtests, but i have not yet had a crash
in the 2-dimm config, a telling sign.  And I have been running it pretty hard so
far, although not yet with quite everything I've got.  I, too, must say that i
should have had more faith in linux....  best wishes all.
Comment 55 Mike Zingale 2002-01-21 12:39:39 EST
Hi, I've read some recent news about a hardware bug in some Athlons that
cause intermittent lock-ups.  This is posted at http://www.gentoo.org/, and
also discussed on /.
(http://slashdot.org/article.pl?sid=02/01/21/0750226&mode=nested)

The gist of the bug is:

``As you may know, x86 systems have traditionally
managed memory using 4K pages. However, with the introduction of the
Pentium processor, Intel added a new feature called extended paging,
which allows 4Mb pages to be used instead. Here's the problem -- many
Athlon and Duron CPUs experience memory corruption when extended
paging is used in conjunction with AGP. And, this problem hits us
because Linux 2.4 kernels compiled with a Pentium-Classic or higher
Processor family kernel configuration setting will automatically take
advantage of extended paging (for kernel hackers out there, this is
the X86_FEATURE_PSE constant defined in
include/asm-i386/cpufeature.h.) Fortunately, there is a quick and easy
fix for this problem. If you have been experiencing lockups on your
Athlon, Duron or Athlon MP system when using AGP video, try passing
the mem=nopentium option to your kernel (using GRUB or LILO) at
boot-time. This tells Linux to go back to using 4K pages, avoiding
this CPU bug. In addition, it should also be possible to avoid this
problem by not using AGP on affected systems. As soon as I discovered
that this CPU bug existed (which happened, unfortunately, because my
CPU has the bug), I informed kernel hacker Andrew Morton of the issue;
he put me in touch with Alan Cox. Alan is going to try to add some
kind of Athlon/AGP CPU bug detection code to the kernel so that it
will be able to auto-downgrade to 4K pages when necessary.''


Some reports indicate the later steppings of the Athlon chip fix this
problem.  When I swapped hardware, my processors are now of a stepping
that reportedly does not have this problem, and my machine has been up
for 48 days straight now.

It would seem that trying 'mem=nopentium' on the kernel line would be a 
good option for anyone still experiencing the problem.

Mike
Comment 56 Arjan van de Ven 2002-01-21 12:41:52 EST
This is unfortionatly a load of bullshit from slashdot....
The kernel DOES use 4Mb pages but NEVER EVER does the thing that's broken
on athlons..... What Nvidia's module does is unknown but that I don't care
about.
Comment 57 Jim Meyer 2002-01-22 18:54:02 EST
We have 5 systems with the Tiger S2460 motherboard.
1 motherboard was DOA.  Of the remaining 4, all locked
up constantly with the RedHat 7.2 stock kernel (2.4.7).
Using the updated kernel (2.4.9-13), 1 of these 4 stopped
locking up, but the other three still freeze.  After 
about a week, the one that worked started exhibiting the 
memory detection failure that was noted here.  The 
replacement for that board has also worked flawlessly, 
but the remaining 3 still lock up.

The odd thing is this is virtually identical to the 
problems I had on Intel systems with bug #39233.  
In that instance, though, the problem turned out to
be a faulty kernel (2.4.2).  In this case the same 
kernel (2.4.9) that fixed all of our problems on 
Intel systems only fixed the problems on 1 of 4 
Athlon systems.  We will see how the replacement 
for the DOA system behaves.

One of our users reports that playing an mp3 by the
Counting Crows seems to trigger this system freeze.

Does anyone have any information about markings on the
motherboard that might distinguish a good Tiger motherboard
from a broken one?
Comment 58 David Krovich 2002-03-07 11:21:46 EST
I was having a lot of problems with a Dual Processor setup using the Tyan Tiger
S2466 motherboard with the AMD 760 MPX chipset.  I could make the machine hang
for about 3-5 seconds by running bonnie.  Basically, it seemed like anything
that did a flurry of disk activity would make the machine freeze.  It would
never crash, just hang for awhile and then come back.  The fix for me was to
make sure DMA was turned on.  By default, with my 7.2 installation it wasn't. 
So I edited the /etc/sysconfig/hardisks file and set the USE_DMA flag to 1.  The
machine seems to be running really well now.
Comment 59 Need Real Name 2002-06-05 16:06:04 EDT
I have a dual athlon 1800+ system with a Tyan Tiger MPX 2466 motherboard
and a VisionTek 2964 GeForce-3 based video card.  I am running RedHat 7.2
with the 2.4.9-31 update kernel, XFree86-4.1.0-15, and the NVIDIA 1.0-2960 
drivers.  I experience frequent lockups of the entire system.  Other non-
Athlon-based systems I have do not have freezing problems. One scenerio 
looks like this:

- Run the AWadvs-04 SPECviewperf benchmark.
- After a while (3 seconds to 2 minutes or so), the display will freeze, but 
  I can still move the mouse.  The machine is still pingable, but I get no
  response from the machine from telnet, rsh, or ssh login requests.
- If I tap a key, the mouse freezes.  Sometimes the machine is still pingable 
  even at this point (but not always).

The crash happens regardless of if I am using the NVIDIA AGP driver or 
AGPGART.  Fast Writes and SBA are both disabled.

I can stop the crashes from happening by disabling AGP entirely (by setting 
Option "NvAGP" "0" in XF86Config-4).  Using the mem=nopentium kernel option
as suggested in the NVidia README does *not* prevent the system freezes.

I would like to try adjusting the AGP rate, but the kernel option 
does not appear to be recognized as I get an "invalid parameter" 
message:


greg@spock:~ [12:17] <66> sudo insmod NVdriver NVreg_ReqAGPRate=1
Using /lib/modules/2.4.9-31smp/kernel/drivers/video/NVdriver
Warning: loading /lib/modules/2.4.9-31smp/kernel/drivers/video/NVdriver will
taint the kernel: non-GPL license - NVIDIA
/lib/modules/2.4.9-31smp/kernel/drivers/video/NVdriver: invalid parameter
parm_NVreg_ReqAGPRate


What is the proper parameter to issue to the driver in order to set 
the AGP rate?   Has any progress been made as to tracking down the 
source of this problem?  

I have also sent this message to linux-bugs@nvidia.com
Comment 60 Need Real Name 2002-06-18 15:38:40 EDT
Seeing as this appears to be related to the known Athlon/AGP cache 
coherency problem, here is an article about progress made on this 
issue.  They include a link to a kernel patch to work around the
problem:

http://www.linuxjournal.com/modules/NS-articles/lighter/6148s1.php

Any idea if/when this patch will be integrated into the RedHat kernel?
Comment 61 Jay F 2002-06-19 22:22:56 EDT
Hi there,

I have had a Tyan Tiger S2460 with two Athlon MP 1200 running almost 
continously, day and night, for six months. It will now no longer run for more 
than six hours without a hard lock up.

I am running console only, no x-windows. Typical error message is
"Kernel Panic: Aiee, killing interrupt handler.
In interrupt handler - not syncing"

On one occasion, followed by
"<3> dpti0  Bad preserved MFA (c07d63c8) - dropping frame".

I can find no error entries in /var/log/messages.

I am running Redhat 7.2, and have tried both the 2.4.9-31 and -34 kernels, -smp 
and uniprocessor, with and without the 'noapic', 'noathlon', 
and 'mem=nopentium' options, all to no avail.

As an aside, how do you turn off the console 'screen saver'? When the machine 
locks up, the screensaver seems to prevent visible error messages.

Jay.
Comment 62 Mike A. Harris 2002-09-11 14:07:31 EDT
Since the original bug reporter seems to think this was a hardware
problem, and it certainly has sounded like it is a hardware problem
all along, zingale, do you think we can safely close this as NOTABUG
or somesuch now?

Also, as a note to all others who've posted comments as well - I think
most of your problems while similar perhaps are individual problems
which may or may not be hardware bugs/problems, or might be kernel bugs
or something else.

If the original poster thinks this bug is no longer an issue, or is/was
hardware flawed, and we close this bug, if anyone else has an issue
that they think is really a kernel bug, they should open a new bug
report.  One problem per bug report please.

Setting bug to NEEDINFO.
Comment 63 Mike Zingale 2002-09-11 14:12:50 EDT
yes, I believe that this can be closed now.  Thank you for you assistance.
Comment 64 Mike A. Harris 2002-09-11 14:20:13 EDT
Ok, thanks - closing NOTABUG (not a software bug).

Note You need to log in before you can comment on or make changes to this bug.