Red Hat Bugzilla – Bug 116036
(SMP NFS AUTOMOUNT) smp kernels crash on dual athlon mp and dual opteron boxes
Last modified: 2007-11-30 17:10:36 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.21 [en]
Description of problem:
Our dual processor systems running Fedora Core 1 (32 bit version) are
showing horrible uptimes (about 1 day on some systems) before hanging
with no apparent reason. There are no logs that indicate the problem.
We've added the "debug" argument to the kernel to see if this helped,
but we still get no logs indicating a panic or any other kernel
event. Some systems have a tainted kernel, but some don't (and those
"clean kernels" only use standard modules), so I doubt the problem
lays outside the standard fedora kernel.
The problem has been seen with several versions of the smp kernel,
since 2135. We have tried disabling apm, apic and acpi, with no
improvements. The hardware varies, but all of them are either Athlon
MP or Opteron processors (Running the 32 bit version of FC1), all
dual processor systems.
We're testing some dual opterons with the UP kernel to see if the
behaviour is better, so far they've been up for over 24 hours, so
I'll update in a few days.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Set up a dual processor machine (athlon mp or opteron)
2. Install Fedora Core 1
3. Leave running for as long as possible.
Uptimes of up to 5 days, but on some systems less than a day.
Far longer uptimes
I'm attaching /proc/cpuinfo for the two systems that seem to crash
more often (exhibit the worse uptimes with FC1, but had good uptimes
with RHL 9).
Created attachment 97762 [details]
/proc/cpuinfo files from a system that crashes about once per day
Created attachment 97763 [details]
/proc/cpuinfo of another system that also crashes very often (up to 2 times per day)
We have a dozen FC1 workstations with AMD 2400+ MP on Tyan 2466. We
have been experiencing the same issue.
We also have experienced lockups during shutdown at automount with
nfs mounts. We are not sure if these are related, but we have noticed
that FC1 is the worst stable release since RH5.
We have tried a few solutions that others have suggested on similar
crashing for SMP Xeon including adding the following to the
bootparamters: "noapic noacpi" also "ACPI=force APIC"
None of these have worked.
I noticed that I didn't send the exact kernel option we've tried:
ro debug acpi=off apm=off root=LABEL=/ noapic
We have also seen the autofs lockups on shutdown, but not always. I'm
not sure if also on single processor systems or not, but I think
we've seen it on single processor systems, which apparently don't
suffer from this instability. I just add this for completeness.
Today when a system crashed, I noticed that I could still ping it and
that nmap -sT would show port 22 open and the rest closed. However
ssh would fail. I thought this was interesting, since from the
network the machine seemed to be more or less ok, while it was
actually down; so I got a network dump from another machine to that
one. I'll attach it's output. All I did was ping it, then tried to
ssh to it, canceled the ssh attempt, and then pinged it again. The
dump shows that the TCP session doesn't reach the ESTABLISHED state
on the crashed box, even though the handshake was successful.
I think that the reason there are no kernel dumps is because the
kernel is actually running, but locked in some strange state. I'm not
sure if this is a cause or effect of the problem, but it might also
be the reason of the lockups on autofs shutdown.
Created attachment 97814 [details]
network dump described in previous entry
I have munged the real IP addresses, I have the full network dump (the actual
packets exchanged between the two systems). Please let me know if you think it
might be useful.
As time goes on. I'm really starting to think that this is related to
autofs (automount). To go one guess farther, I would guess that we
are having issues with the kernel not handling NFS file locks very
well. Below is the basis of my assumptions:
-About two weeks ago one of our NFS servers went out to lunch. It
nearly crashed every Dual Athlon 2400+MP running FC1. This crash left
every machine ping-able, but non-reponsive.
-Our process bound machines crash less than our network and disk I/O
-Only one of our many of Dual Athlons running FC1 is not in our
office and does not use automount or any NFS crossmounts. This
machine has NEVER crashed.
-Crashing occurs less if no user is logged in to the GUI console.
This routes back to NFS file locking issues and how the kernel is not
handling it. When a user logs into Gnome, gconf (i think) places
several file locks that the NFS servers haven't liked in the past.
These issues of file locking went away as our NFS servers and clients
matured. But i think we are revisiting this issue because there is
something wrong with the kernel.
-Logging into Gnome somehow creates the automounter to mount every
mount within /home instead of just the users mount. This
charateristic is not present in non-smp kernels.
-Non smp kernels have not crashed and have not had the same issues
-On shutdown the system hangs (but still pingable) at stopping the
autmount daemon. On non-smp kernels. This may take a moment to unload
each mount, but with smp kernels it is almost guranteed to lock.
Maybe the kernel is handling too many of these at once and going into
a weird state?
-Even though FC2 test 1 is incomplete and has some GUI issues. We
have not seen any mal behavior in automount of NFS.
In conclusion, All fingers point to how the kernel is handling (or
not) NFS crossmounts (file locking) and possibly a problem with it's
automounter unmounting shares but most likely just the kernel
handling (or deadlocking) on umounting in general..
Much help would be appreciated. Even more, an smp kernel that works.
I can confirm much of these findings, we have a couple of dual athlon
2000+ machines that exhibit this same behaviour after upgrading to FC1
last week (they worked fine under Redhat 8.0).
Just a question: I note this bug entry is fairly recent (and it is the
only one when searching for "dual athlon"). Does this mean the problem
only occurs using the most recent kernels? Or is it the same bu as
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497 which only
talks about dual PIII and dual Xeon machines?
I would also like to confirm this SMP bug. I have two boxes that are
running 2174, one is dual Athlon, one is dual Xeon. Both are not
running an automounter, but they do have permanent NFS mounts. Both
also run a nightly cron that manually mounts an NFS mount. Checking
the sa logs, the dual Xeon box crashed just after the cron started on
the last crash this weekend.
The crashes on both are erratic. The dual Athlon box has been up 6
days now, but there are times it will crash back to back nights. The
same goes for the dual Xeon. Also, both have twin hardware running RH
7.3 and they have no problems.
One interesting data point here is that I have at least three P4 boxes
with Hyperthreading, so the SMP kernel is running to take advantage of
it and they are all stable. They also have the permanent NFS mounts
and the nightly cron. One box has been up 25 days.
On a side note, the dual Xeon has HT on, so it shows up as having 4
I built 2.4.25 vanilla using all of the same configs in the
configs/kernel-2.4.22-athlon-smp.config and I have had zero crashes.
I'm curious as to what is holding back the 2.4.22 revisions from this?
I just had a new single Xeon system (but a dual CPU motherboard) with
hyperthreading on crash over the weekend with the SMP kernel. The
time of the crash was the same time of the cron that uses an NFS mount.
The P4 hyperthreading boxes running SMP are still stable.
OK, an update here on some testing, and I think I'm on to something.
On my test dual Athlon box, I installed the 2.6 kernel from Core2
test2 and it went a week without a crash. But it logged a lot of APIC
errors, which got me thinking about APIC. I went back to the 2174
version of 2.4 with the noapic boot option and it was stable last
night. It needs more uptime before I believe it fixes the problem.
However, this APIC business would explain why I only have the problem
on dual CPU motherboards, independent on whether I have one or two
processors or HT on. It also would explain drop outs for certain
devices causing the machine to hang (network cards for instance in my
case). If the APIC controller remapped those devices, then those
would be the problem devices.
Can any of you guys try noapic on your boxes to see if it helps? I've
been googling and I haven't seen one way or the other if it is a good
thing or not, but it certainly seems to fix a lot of SMP problems.
I've been using noapic since before posting the bug (check comment 5
on this thread), and it makes no difference... I've installed FC2
test 2 on a couple of the problematic boxes and they seem to be
stable, although it's too soon for me to be convinced. I read in some
of the links for bug 109497 (read comment 9), that the problem seems
to come from upstream, maybe moving the kernel version to 2.4.24 or
so will fix it, but haven't had the time to try it (and probably
won't have it anytime soon...).
It's been a couple weeks since I read the whole thing, I should have
rechecked to see your noapic comment. Thanks for the link to 109497,
I missed that one in my search.
109497 looks like the same basic problem, so this bug should probably
be marked a dup. It has a lot of info for those that may have not
Is there anyone at RedHat that can comment on the progress of finding
this bug? Is it elusive or too complicated a fix to apply a patch? I
could build vanilla kernels (which others have said worked fine) but
letting yum/rpm manage my kernel patches sure is much easier,
especially after all the security bugs found last year.
APIC errors normally indicate hardware problems on the apic bus.
Actual copies of the APIC errors would be useful here ot take a look
Autofs sounds a possible candidate here, but not really NFS alone -
lots of very stable NFS dual xeons and athlons around.
The only other dual athlon weirdness to watch is they can hang if
using IDE and you have no PS/2 mouse attached. Thats dual athlon
specific and is a chipset errata
Thanks for the info, Alan. It seems that the low latency patch was
doing it, because the 2179 release and later is as stable as can be.
I was going to give it more time before I posted that, but your
response prompted me to do it now. Release 2188 has the fix for
everyone on the list here.
See bugzilla IDs 109497 and 109962. This can probably be considered a
dup of the latter.
The 2188 kernel seems to be stable (I never tried the 2179). A system that kept
crashing once or twice per day has been up for a couple of days now. For the record: it
didn't have a mouse...
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/