From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.21 [en] Description of problem: Our dual processor systems running Fedora Core 1 (32 bit version) are showing horrible uptimes (about 1 day on some systems) before hanging with no apparent reason. There are no logs that indicate the problem. We've added the "debug" argument to the kernel to see if this helped, but we still get no logs indicating a panic or any other kernel event. Some systems have a tainted kernel, but some don't (and those "clean kernels" only use standard modules), so I doubt the problem lays outside the standard fedora kernel. The problem has been seen with several versions of the smp kernel, since 2135. We have tried disabling apm, apic and acpi, with no improvements. The hardware varies, but all of them are either Athlon MP or Opteron processors (Running the 32 bit version of FC1), all dual processor systems. We're testing some dual opterons with the UP kernel to see if the behaviour is better, so far they've been up for over 24 hours, so I'll update in a few days. Version-Release number of selected component (if applicable): 2.4.22-1.2166.nptlsmp How reproducible: Always Steps to Reproduce: 1. Set up a dual processor machine (athlon mp or opteron) 2. Install Fedora Core 1 3. Leave running for as long as possible. Actual Results: Uptimes of up to 5 days, but on some systems less than a day. Expected Results: Far longer uptimes Additional info: I'm attaching /proc/cpuinfo for the two systems that seem to crash more often (exhibit the worse uptimes with FC1, but had good uptimes with RHL 9).
Created attachment 97762 [details] /proc/cpuinfo files from a system that crashes about once per day
Created attachment 97763 [details] /proc/cpuinfo of another system that also crashes very often (up to 2 times per day)
We have a dozen FC1 workstations with AMD 2400+ MP on Tyan 2466. We have been experiencing the same issue. We also have experienced lockups during shutdown at automount with nfs mounts. We are not sure if these are related, but we have noticed that FC1 is the worst stable release since RH5.
We have tried a few solutions that others have suggested on similar crashing for SMP Xeon including adding the following to the bootparamters: "noapic noacpi" also "ACPI=force APIC" None of these have worked.
I noticed that I didn't send the exact kernel option we've tried: ro debug acpi=off apm=off root=LABEL=/ noapic We have also seen the autofs lockups on shutdown, but not always. I'm not sure if also on single processor systems or not, but I think we've seen it on single processor systems, which apparently don't suffer from this instability. I just add this for completeness.
Today when a system crashed, I noticed that I could still ping it and that nmap -sT would show port 22 open and the rest closed. However ssh would fail. I thought this was interesting, since from the network the machine seemed to be more or less ok, while it was actually down; so I got a network dump from another machine to that one. I'll attach it's output. All I did was ping it, then tried to ssh to it, canceled the ssh attempt, and then pinged it again. The dump shows that the TCP session doesn't reach the ESTABLISHED state on the crashed box, even though the handshake was successful. I think that the reason there are no kernel dumps is because the kernel is actually running, but locked in some strange state. I'm not sure if this is a cause or effect of the problem, but it might also be the reason of the lockups on autofs shutdown.
Created attachment 97814 [details] network dump described in previous entry I have munged the real IP addresses, I have the full network dump (the actual packets exchanged between the two systems). Please let me know if you think it might be useful.
As time goes on. I'm really starting to think that this is related to autofs (automount). To go one guess farther, I would guess that we are having issues with the kernel not handling NFS file locks very well. Below is the basis of my assumptions: -About two weeks ago one of our NFS servers went out to lunch. It nearly crashed every Dual Athlon 2400+MP running FC1. This crash left every machine ping-able, but non-reponsive. -Our process bound machines crash less than our network and disk I/O bound machines. -Only one of our many of Dual Athlons running FC1 is not in our office and does not use automount or any NFS crossmounts. This machine has NEVER crashed. -Crashing occurs less if no user is logged in to the GUI console. This routes back to NFS file locking issues and how the kernel is not handling it. When a user logs into Gnome, gconf (i think) places several file locks that the NFS servers haven't liked in the past. These issues of file locking went away as our NFS servers and clients matured. But i think we are revisiting this issue because there is something wrong with the kernel. -Logging into Gnome somehow creates the automounter to mount every mount within /home instead of just the users mount. This charateristic is not present in non-smp kernels. -Non smp kernels have not crashed and have not had the same issues with automount. -On shutdown the system hangs (but still pingable) at stopping the autmount daemon. On non-smp kernels. This may take a moment to unload each mount, but with smp kernels it is almost guranteed to lock. Maybe the kernel is handling too many of these at once and going into a weird state? -Even though FC2 test 1 is incomplete and has some GUI issues. We have not seen any mal behavior in automount of NFS. In conclusion, All fingers point to how the kernel is handling (or not) NFS crossmounts (file locking) and possibly a problem with it's automounter unmounting shares but most likely just the kernel handling (or deadlocking) on umounting in general.. Much help would be appreciated. Even more, an smp kernel that works.
I can confirm much of these findings, we have a couple of dual athlon 2000+ machines that exhibit this same behaviour after upgrading to FC1 last week (they worked fine under Redhat 8.0). Just a question: I note this bug entry is fairly recent (and it is the only one when searching for "dual athlon"). Does this mean the problem only occurs using the most recent kernels? Or is it the same bu as https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497 which only talks about dual PIII and dual Xeon machines?
I would also like to confirm this SMP bug. I have two boxes that are running 2174, one is dual Athlon, one is dual Xeon. Both are not running an automounter, but they do have permanent NFS mounts. Both also run a nightly cron that manually mounts an NFS mount. Checking the sa logs, the dual Xeon box crashed just after the cron started on the last crash this weekend. The crashes on both are erratic. The dual Athlon box has been up 6 days now, but there are times it will crash back to back nights. The same goes for the dual Xeon. Also, both have twin hardware running RH 7.3 and they have no problems. One interesting data point here is that I have at least three P4 boxes with Hyperthreading, so the SMP kernel is running to take advantage of it and they are all stable. They also have the permanent NFS mounts and the nightly cron. One box has been up 25 days. On a side note, the dual Xeon has HT on, so it shows up as having 4 processors.
I built 2.4.25 vanilla using all of the same configs in the configs/kernel-2.4.22-athlon-smp.config and I have had zero crashes. I'm curious as to what is holding back the 2.4.22 revisions from this? WTF^M?
I just had a new single Xeon system (but a dual CPU motherboard) with hyperthreading on crash over the weekend with the SMP kernel. The time of the crash was the same time of the cron that uses an NFS mount. The P4 hyperthreading boxes running SMP are still stable.
OK, an update here on some testing, and I think I'm on to something. On my test dual Athlon box, I installed the 2.6 kernel from Core2 test2 and it went a week without a crash. But it logged a lot of APIC errors, which got me thinking about APIC. I went back to the 2174 version of 2.4 with the noapic boot option and it was stable last night. It needs more uptime before I believe it fixes the problem. However, this APIC business would explain why I only have the problem on dual CPU motherboards, independent on whether I have one or two processors or HT on. It also would explain drop outs for certain devices causing the machine to hang (network cards for instance in my case). If the APIC controller remapped those devices, then those would be the problem devices. Can any of you guys try noapic on your boxes to see if it helps? I've been googling and I haven't seen one way or the other if it is a good thing or not, but it certainly seems to fix a lot of SMP problems.
I've been using noapic since before posting the bug (check comment 5 on this thread), and it makes no difference... I've installed FC2 test 2 on a couple of the problematic boxes and they seem to be stable, although it's too soon for me to be convinced. I read in some of the links for bug 109497 (read comment 9), that the problem seems to come from upstream, maybe moving the kernel version to 2.4.24 or so will fix it, but haven't had the time to try it (and probably won't have it anytime soon...). Carlos
It's been a couple weeks since I read the whole thing, I should have rechecked to see your noapic comment. Thanks for the link to 109497, I missed that one in my search. 109497 looks like the same basic problem, so this bug should probably be marked a dup. It has a lot of info for those that may have not seen it. Is there anyone at RedHat that can comment on the progress of finding this bug? Is it elusive or too complicated a fix to apply a patch? I could build vanilla kernels (which others have said worked fine) but letting yum/rpm manage my kernel patches sure is much easier, especially after all the security bugs found last year.
APIC errors normally indicate hardware problems on the apic bus. Actual copies of the APIC errors would be useful here ot take a look Autofs sounds a possible candidate here, but not really NFS alone - lots of very stable NFS dual xeons and athlons around. The only other dual athlon weirdness to watch is they can hang if using IDE and you have no PS/2 mouse attached. Thats dual athlon specific and is a chipset errata
Thanks for the info, Alan. It seems that the low latency patch was doing it, because the 2179 release and later is as stable as can be. I was going to give it more time before I posted that, but your response prompted me to do it now. Release 2188 has the fix for everyone on the list here. See bugzilla IDs 109497 and 109962. This can probably be considered a dup of the latter.
The 2188 kernel seems to be stable (I never tried the 2179). A system that kept crashing once or twice per day has been up for a couple of days now. For the record: it didn't have a mouse...
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/