116036 – (SMP NFS AUTOMOUNT) smp kernels crash on dual athlon mp and dual opteron boxes

Bug 116036 - (SMP NFS AUTOMOUNT) smp kernels crash on dual athlon mp and dual opteron boxes

Summary: (SMP NFS AUTOMOUNT) smp kernels crash on dual athlon mp and dual opteron boxes

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	1
Hardware:	athlon
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-02-17 19:14 UTC by Carlos A. Villegas
Modified:	2007-11-30 22:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-29 20:05:58 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/proc/cpuinfo files from a system that crashes about once per day (889 bytes, text/plain) 2004-02-17 19:18 UTC, Carlos A. Villegas	no flags	Details
/proc/cpuinfo of another system that also crashes very often (up to 2 times per day) (890 bytes, text/plain) 2004-02-17 19:19 UTC, Carlos A. Villegas	no flags	Details
network dump described in previous entry (3.20 KB, text/plain) 2004-02-18 22:45 UTC, Carlos A. Villegas	no flags	Details
View All

Description Carlos A. Villegas 2004-02-17 19:14:34 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.21  [en]

Description of problem:

Our dual processor systems running Fedora Core 1 (32 bit version) are 
showing horrible uptimes (about 1 day on some systems) before hanging 
with no apparent reason. There are no logs that indicate the problem. 
We've added the "debug" argument to the kernel to see if this helped, 
but we still get no logs indicating a panic or any other kernel 
event. Some systems have a tainted kernel, but some don't (and those 
"clean kernels" only use standard modules), so I doubt the problem 
lays outside the standard fedora kernel.

The problem has been seen with several versions of the smp kernel, 
since 2135. We have tried disabling apm, apic and acpi, with no 
improvements. The hardware varies, but all of them are either Athlon 
MP or Opteron processors (Running the 32 bit version of FC1), all 
dual processor systems. 

We're testing some dual opterons with the UP kernel to see if the 
behaviour is better, so far they've been up for over 24 hours, so 
I'll update in a few days.




Version-Release number of selected component (if applicable):
2.4.22-1.2166.nptlsmp

How reproducible:
Always

Steps to Reproduce:
1. Set up a dual processor machine (athlon mp or opteron)
2. Install Fedora Core 1
3. Leave running for as long as possible.
    

Actual Results:  
Uptimes of up to 5 days, but on some systems less than a day.

Expected Results:  
Far longer uptimes

Additional info:


I'm attaching /proc/cpuinfo for the two systems that seem to crash 
more often (exhibit the worse uptimes with FC1, but had good uptimes 
with RHL 9).

Comment 1 Carlos A. Villegas 2004-02-17 19:18:28 UTC

Created attachment 97762 [details]
/proc/cpuinfo files from a system that crashes about once per day

Comment 2 Carlos A. Villegas 2004-02-17 19:19:30 UTC

Created attachment 97763 [details]
/proc/cpuinfo of another system that also crashes very often (up to 2 times per day)

Comment 3 John Stokes 2004-02-17 23:38:50 UTC

We have a dozen FC1 workstations with AMD 2400+ MP on Tyan 2466. We 
have been experiencing the same issue.

We also have experienced lockups during shutdown at automount with 
nfs mounts. We are not sure if these are related, but we have noticed 
that FC1 is the worst stable release since RH5.

Comment 4 John Stokes 2004-02-17 23:41:10 UTC

We have tried a few solutions that others have suggested on similar 
crashing for SMP Xeon including adding the following to the 
bootparamters: "noapic noacpi" also "ACPI=force APIC"

None of these have worked.

Comment 5 Carlos A. Villegas 2004-02-18 00:20:06 UTC

I noticed that I didn't send the exact kernel option we've tried:

ro debug acpi=off apm=off root=LABEL=/ noapic

We have also seen the autofs lockups on shutdown, but not always. I'm 
not sure if also on single processor systems or not, but I think 
we've seen it on single processor systems, which apparently don't 
suffer from this instability. I just add this for completeness.

Comment 6 Carlos A. Villegas 2004-02-18 22:37:33 UTC

Today when a system crashed, I noticed that I could still ping it and 
that nmap -sT would show port 22 open and the rest closed. However 
ssh would fail. I thought this was interesting, since from the 
network the machine seemed to be more or less ok, while it was 
actually down; so I got a network dump from another machine to that 
one. I'll attach it's output. All I did was ping it, then tried to 
ssh to it, canceled the ssh attempt, and then pinged it again. The 
dump shows that the TCP session doesn't reach the ESTABLISHED state 
on the crashed box, even though the handshake was successful.

I think that the reason there are no kernel dumps is because the 
kernel is actually running, but locked in some strange state. I'm not 
sure if this is a cause or effect of the problem, but it might also 
be the reason of the lockups on autofs shutdown.

Comment 7 Carlos A. Villegas 2004-02-18 22:45:37 UTC

Created attachment 97814 [details]
network dump described in previous entry


I have munged the real IP addresses, I have the full network dump (the actual
packets exchanged between the two systems). Please let me know if you think it
might be useful.

Comment 8 John Stokes 2004-02-20 20:03:11 UTC

As time goes on. I'm really starting to think that this is related to 
autofs (automount). To go one guess farther, I would guess that we 
are having issues with the kernel not handling NFS file locks very 
well. Below is the basis of my assumptions:

-About two weeks ago one of our NFS servers went out to lunch. It 
nearly crashed every Dual Athlon 2400+MP running FC1. This crash left 
every machine ping-able, but non-reponsive.

-Our process bound machines crash less than our network and disk I/O 
bound machines.

-Only one of our many of Dual Athlons running FC1 is not in our 
office and does not use automount or any NFS crossmounts. This 
machine has NEVER crashed.

-Crashing occurs less if no user is logged in to the GUI console. 
This routes back to NFS file locking issues and how the kernel is not 
handling it. When a user logs into Gnome, gconf (i think) places 
several file locks that the NFS servers haven't liked in the past. 
These issues of file locking went away as our NFS servers and clients 
matured. But i think we are revisiting this issue because there is 
something wrong with the kernel.

-Logging into Gnome somehow creates the automounter to mount every 
mount within /home instead of just the users mount. This 
charateristic is not present in non-smp kernels.

-Non smp kernels have not crashed and have not had the same issues 
with automount.

-On shutdown the system hangs (but still pingable) at stopping the 
autmount daemon. On non-smp kernels. This may take a moment to unload 
each mount, but with smp kernels it is almost guranteed to lock. 
Maybe the kernel is handling too many of these at once and going into 
a weird state?

-Even though FC2 test 1 is incomplete and has some GUI issues. We 
have not seen any mal behavior in automount of NFS.

In conclusion, All fingers point to how the kernel is handling (or 
not) NFS crossmounts (file locking) and possibly a problem with it's 
automounter unmounting shares but most likely just the kernel 
handling (or deadlocking) on umounting in general..

Much help would be appreciated. Even more, an smp kernel that works.

Comment 9 David Jansen 2004-02-27 12:27:59 UTC

I can confirm much of these findings, we have a couple of dual athlon
2000+ machines that exhibit this same behaviour after upgrading to FC1
last week (they worked fine under Redhat 8.0).

Just a question: I note this bug entry is fairly recent (and it is the
only one when searching for "dual athlon"). Does this mean the problem
only occurs using the most recent kernels? Or is it the same bu as
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497 which only
talks about dual PIII and dual Xeon machines?

Comment 10 Damon Gunther 2004-03-24 14:31:24 UTC

I would also like to confirm this SMP bug.  I have two boxes that are
running 2174, one is dual Athlon, one is dual Xeon.  Both are not
running an automounter, but they do have permanent NFS mounts.  Both
also run a nightly cron that manually mounts an NFS mount.  Checking
the sa logs, the dual Xeon box crashed just after the cron started on
the last crash this weekend.

The crashes on both are erratic.  The dual Athlon box has been up 6
days now, but there are times it will crash back to back nights.  The
same goes for the dual Xeon.  Also, both have twin hardware running RH
7.3 and they have no problems.

One interesting data point here is that I have at least three P4 boxes
with Hyperthreading, so the SMP kernel is running to take advantage of
it and they are all stable.  They also have the permanent NFS mounts
and the nightly cron.  One box has been up 25 days.

On a side note, the dual Xeon has HT on, so it shows up as having 4
processors.

Comment 11 John Stokes 2004-03-25 20:48:35 UTC

I built 2.4.25 vanilla using all of the same configs in the 
configs/kernel-2.4.22-athlon-smp.config and I have had zero crashes.

I'm curious as to what is holding back the 2.4.22 revisions from this?
WTF^M?

Comment 12 Damon Gunther 2004-03-29 14:45:55 UTC

I just had a new single Xeon system (but a dual CPU motherboard) with
hyperthreading on crash over the weekend with the SMP kernel.  The
time of the crash was the same time of the cron that uses an NFS mount.

The P4 hyperthreading boxes running SMP are still stable.

Comment 13 Damon Gunther 2004-04-06 15:52:38 UTC

OK, an update here on some testing, and I think I'm on to something. 
On my test dual Athlon box, I installed the 2.6 kernel from Core2
test2 and it went a week without a crash.  But it logged a lot of APIC
errors, which got me thinking about APIC.  I went back to the 2174
version of 2.4 with the noapic boot option and it was stable last
night.  It needs more uptime before I believe it fixes the problem.

However, this APIC business would explain why I only have the problem
on dual CPU motherboards, independent on whether I have one or two
processors or HT on.  It also would explain drop outs for certain
devices causing the machine to hang (network cards for instance in my
case).  If the APIC controller remapped those devices, then those
would be the problem devices.

Can any of you guys try noapic on your boxes to see if it helps?  I've
been googling and I haven't seen one way or the other if it is a good
thing or not, but it certainly seems to fix a lot of SMP problems.

Comment 14 Carlos A. Villegas 2004-04-06 16:30:05 UTC

I've been using noapic since before posting the bug (check comment 5 
on this thread), and it makes no difference... I've installed FC2 
test 2 on a couple of the problematic boxes and they seem to be 
stable, although it's too soon for me to be convinced. I read in some 
of the links for bug 109497 (read comment 9), that the problem seems 
to come from upstream, maybe moving the kernel version to 2.4.24 or 
so will fix it, but haven't had the time to try it (and probably 
won't have it anytime soon...).

Carlos

Comment 15 Damon Gunther 2004-04-06 20:23:08 UTC

It's been a couple weeks since I read the whole thing, I should have
rechecked to see your noapic comment.  Thanks for the link to 109497,
I missed that one in my search.

109497 looks like the same basic problem, so this bug should probably
be marked a dup.  It has a lot of info for those that may have not
seen it.

Is there anyone at RedHat that can comment on the progress of finding
this bug?  Is it elusive or too complicated a fix to apply a patch?  I
could build vanilla kernels (which others have said worked fine) but
letting yum/rpm manage my kernel patches sure is much easier,
especially after all the security bugs found last year.

Comment 16 Alan Cox 2004-05-03 19:45:43 UTC

APIC errors normally indicate hardware problems on the apic bus.
Actual copies of the APIC errors would be useful here ot take a look

Autofs sounds a possible candidate here, but not really NFS alone -
lots of very stable NFS dual xeons and athlons around.

The only other dual athlon weirdness to watch is they can hang if
using IDE and you have no PS/2 mouse attached. Thats dual athlon
specific and is a chipset errata

Comment 17 Damon Gunther 2004-05-03 20:02:34 UTC

Thanks for the info, Alan.  It seems that the low latency patch was
doing it, because the 2179 release and later is as stable as can be. 
I was going to give it more time before I posted that, but your
response prompted me to do it now.  Release 2188 has the fix for
everyone on the list here.

See bugzilla IDs 109497 and 109962.  This can probably be considered a
dup of the latter.

Comment 18 Carlos A. Villegas 2004-05-05 18:09:41 UTC

 
The 2188 kernel seems to be stable (I never tried the 2179). A system that kept 
crashing once or twice per day has been up for a couple of days now. For the record: it 
didn't have a mouse...

Comment 19 David Lawrence 2004-09-29 20:05:58 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.