Bug 109497 - Default Fedora Core 1 SMP Kernel Hang on Dual Xeon System
Summary: Default Fedora Core 1 SMP Kernel Hang on Dual Xeon System
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 1
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Dave Jones
QA Contact:
URL:
Whiteboard:
: 113148 118990 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-11-08 17:57 UTC by Ugo Viti
Modified: 2015-01-04 22:03 UTC (History)
39 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-06-19 23:50:06 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
serial console logfile (24.48 KB, text/plain)
2003-12-04 12:35 UTC, David Alden
no flags Details
possible fix (1.60 KB, patch)
2004-01-14 21:07 UTC, David Woodhouse
no flags Details | Diff
messages log from crash and using alt-sysrq-p/alt-sysrq-t keys (29.64 KB, text/plain)
2004-01-17 19:52 UTC, Jerry DaSilva
no flags Details
sysrq-t trace of a frozen machine #1 (62.17 KB, text/plain)
2004-01-23 02:02 UTC, Lars Damerow
no flags Details
sysrq-t trace of a frozen machine #2 (67.81 KB, text/plain)
2004-01-23 02:02 UTC, Lars Damerow
no flags Details
sysrq-t trace of a frozen machine #3 (67.52 KB, text/plain)
2004-01-23 02:03 UTC, Lars Damerow
no flags Details
boot messages (15.62 KB, text/plain)
2004-02-04 00:46 UTC, Norman Gaywood
no flags Details
sysrq-M (970 bytes, text/plain)
2004-02-04 00:47 UTC, Norman Gaywood
no flags Details
sysrq-P (30.64 KB, text/plain)
2004-02-04 00:49 UTC, Norman Gaywood
no flags Details
sysrq-T (30.64 KB, text/plain)
2004-02-04 00:50 UTC, Norman Gaywood
no flags Details
sysrq-P (3.37 KB, text/plain)
2004-02-04 03:03 UTC, Norman Gaywood
no flags Details

Description Ugo Viti 2003-11-08 17:57:18 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031016

Description of problem:
I've installed Fedora Core 1 in a new Server System having the
following hardware configuration:

Motherboard: ASUS PR-DLS533/2GBL/SCSI1030
CPUs: 2 Intel Xeon @2,4Ghz (with MultiThreading Enabled)
RAM: 512MBx2 ECC Memory (1024 MB Total)

For a full detailed MotherBoard hardware list:
http://www.asus.com/prog/spec.asp?m=PR-DLS533&langs=01

The system boot cleanly and seem works right, but after some minutes
or hours (in random mode) the system completly crash. The network IP
address of the server become unreachable, but if i go to server
console the keyboard seem response, but if i press the Caps Lock, Num
or Scroll Key the led is not lighting. I can login, but as soon as
lanch a program (like top) the system hang completly, hard reboot is
needed.

If i boot the system using the kernel-2.4.22-1.2115.nptl (non SMP
version) the system doesn't have any instability problem.

Before post this bug report i tryed everything:
I installed kernel 2.6.0-test9 in smp mode and used it for hours
without crash or strange things.
So, i tryed the vanilla 2.4.22 kernel compiled with smp support (it
show 4 MultiThreading CPUs).
Used the stress test for 2 hours whitout any crash, the system runs
rock solid.
the stress line command used is: 
# stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --hdd 2 --timeout 120m

Stress is a tool downloadable from:
http://weather.ou.edu/~apw/projects/stress/

I hope these descriptions help to solve this bug.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.22-1.2115.nptl

How reproducible:
Always

Steps to Reproduce:
1. Boot using kernel kernel-smp-2.4.22-1.2115.nptl
2. Use the system
3. After some minutes the system completly hang
    

Additional info:

Comment 1 Ivo 2003-11-12 11:48:31 UTC
We've had similar freezes on 4 different SMP machines. These machines
have either dual pentium III and VIA82C686 chipset, or dual Pentium4
Xeon and Intel E7500 chipset. System freeze oocurs typically within
half day unrelated to any particular activity. All machines were
running stable under various redhat 9 kernels.

Comment 2 Herbert Gasiorowski 2003-11-13 07:51:01 UTC
Here, 2 P4 Xeon - one CPU, with hyperthreading - freeze after some hours
using fedora 1 final.
One of these machines - now with hyperthreading disabled - is running
since 20 hours without problems (And with 5 hours "stress"-ing).

I encountered such a random freeze with fedora beta2 on one simular
machine. But with fedora beta3 all runs fine on 5 of these machines
for about 3 weeks!

Comment 3 Herbert Gasiorowski 2003-11-21 15:16:14 UTC
Just before im leaving for weekend:

One hyperthreading host is running for more than 4 hours
with "stress" nearly all the time ....

Maybe the kernel switch "noapic" has caused this ...?

$ cat /proc/cmdline 
ro root=LABEL=/ hdd=ide-scsi rhgb noapic
$ uname -r
2.4.22-1.2115.nptlsmp


Comment 4 Nuno Higgs 2003-11-22 10:31:15 UTC
Dual P3 Here, with 2 GB ram, Board Tyan.

SMP enabled, noapic not present, and the system hangs in about 2 days 
of reboot.
Hangs expecialy when accessing mountpoints of samba exported from 
other sistems to this one

Comment 5 Herbert Gasiorowski 2003-11-24 07:56:28 UTC
back from weekend and just after 5 minutes load the machine hangs:

Thus: "noapic" does not help

Comment 6 Nuno Higgs 2003-11-24 12:19:28 UTC
w
12:19:32  up 1 day, 15:58,  6 users,  load average: 1.40, 1.54, 1.64

cat /proc/cmdline
ro root=/dev/hda2 noapic

And still up... probably will hang after sending this mail :(
Will keep reporting

Comment 7 Rick Weatherly 2003-11-24 14:18:41 UTC
I'm seeing the same problem as well with the nptl smp kernel, just
crashed about 1/2 hour ago.  Dual pentium III 650Mhz.  Thought it
seemed to be X related, since most all of the hangs were when loggging
in under Gnome.  Switched to runlevel 3 last night, but did have vnc
running.  It  hung right after connecting from work.

I will switch to non-smp tonight and do some testing to see if it helps.

Comment 8 Nuno Higgs 2003-11-24 16:31:12 UTC
I've found this on the fedora mailing-list:


On Tue, 2003-11-18 at 04:43, Joseph M Bironas wrote:

> I'm not sure. I know that I'm getting two processors to boot now. Top
> reports both processors, and I get two little penguins on FB boot
> -always an important indicator.

I think this could be the CONFIG_NR_CPUS patch that's causing this.
It makes the assumption that APICs are contiguously numbered,
so if you have them sparsely numbered, you could end up with some
of them being unused.

This would explain several similar bugs in bugzilla too.
For those that want to test, setting this option to 32 should
restore to the previous behaviour.

        Dave






Comment 9 Nuno Higgs 2003-11-26 16:03:23 UTC
My system hanged again, 10 mn after mounting an network filesystem.
It had been running for almost 4days at 90% Load.

Comment 10 Rick Weatherly 2003-12-02 03:56:20 UTC
I've build a stock 2.4.23 kernel with smp support.

CONFIG_SMP=y
CONFIG_NR_CPUS=2

This was done on 11/30/2003, and the system was booted from this
kernel around 12:00pm.  Have several SMB mounts and also have X and
Xvnc running.  The system has been stable now for over 33 hours, which
is a first for an SMP kernel.  I ran stress today for quite a while,
and completely build a new kernel (in an xterm, under Xvnc) to give
the system a workout.  This was done with both X and Xvnc running as
well as the SMB mounts active.  Also tried several times stopping /
starting X which would usually cause the freeze, but has not done so yet.

I've probably just jinxed myself.

-- Rick

Comment 11 Nuno Higgs 2003-12-02 20:31:46 UTC
Hmmm the problem is really at a netfs/kernel support level.
I'm only running local filesystems and the system seems rock solid

#w
 20:32:25  up 6 days,  4:35,  5 users,  load average: 2.16, 2.86, 2.34



Comment 12 Ivo 2003-12-03 07:59:03 UTC
Installed the new kernel-smp-2.4.22-1.2129.nptl yesterday, in the
morning the machine was frozen as usual. 
A problem with nfs seems likely to me too, we've had lots of freezes
overnight when the machine is not doing anything except for an
occasional mount/umount. Typically the last syslog message is from the
automounter.

Comment 13 David Alden 2003-12-04 12:35:22 UTC
Created attachment 96334 [details]
serial console logfile

Hi,
  Here's a me too.  Single 3.2GHz P4 w/hyperthreading enabled.	If I boot the
UP kernel, system runs for days.  If I boot the smp kernel, system doesn't
last more than 1 day.  When it locks up, I can still ping it, but I can't do
anything else.	I attached a serial console, no error messages or anything.
I'll attach the output of SysRq's for showMem, showPc and ShowTasks, hopefully
that'll help.
...dave

ps  This is with both the 2115 and 2129 kernels.

Comment 14 Roger Strandberg 2003-12-05 17:40:54 UTC
My system hangs after i change Netvault Server settings, totaly hang.
I also have nfs and i'm not sure if it was running smp, but that is 
first in boot list, (machine is att customers place).

/Roger

Comment 15 Nuno Higgs 2003-12-08 22:33:25 UTC
After updating to the new kernel, the problem still remains... 
If i dont have any netfs mounts, the system runs like girbratar... a 
rock ;)

Comment 16 Rick Weatherly 2003-12-08 23:54:00 UTC
It does appear to be netfs related, and in the RedHat specific
kernels.  I switched to a stock 2.4.23 smp kernel 8 days ago and the
system has been up continuously since then.  I use it to backup a
couple of other W2k boxes over SMB shares daily, so there has been
considerable network related file activity.  I have no NFS mounts.

Comment 17 Nuno Higgs 2003-12-09 00:11:03 UTC
I belive the problem beeing that, samba nfs mounts have a timeout 
delay. 
If you have an samba mount too long inactive and if you try to access 
it, either beeing in a df or anything else, the kernel will send an 
retry connection. (You can see this using dmesg after the df in that 
condition).

I dont know why, but the kernel, after a wille, is not able of 
sending this retry, and the system hangs in a kernel panic, as it 
would if a physical disk would fail.

Comment 18 Adam Bertsch 2003-12-11 22:34:47 UTC
I have a similar bug, filed as #111527 which after careful of reading
of this bug is likely the same bug.  System is Dell PowerEdge 1650
with dual 1.4ghz PentiumIII CPUs.  Crashes happen on mount/unmount
during boot and shutdown.  Local filesystems only.  I'll try some
additional stressing of mount/unmount after business hours today. 
System has been rock stable on the non-SMP kernel.

Comment 19 Matthew Cormie 2003-12-12 00:22:08 UTC
I can duplicate this problem on a Dell PowerEdge 2650 Dual Xeon 2.4GHz.  The 
system hangs after 'Probing Modules' during bootup when using the smp kernels ( 
2.4.22-1.2115.nptlsmp and 2.4.22-1.2129.nptlsmp ).

The non-smp kernels ( 2.4.22-1.2115.nptl and 2.4.22-1.2129.nptl ) seem to be 
working well.

Comment 20 Joshua M. Thompson 2003-12-13 02:41:49 UTC
Same problem here with a fresh install of Fedora Core 1 + all updates.
System is a Dell PowerEdge 2650 with dual 3.06 GHz Xeon processors and
2 gig of RAM. Any attempt to boot SMP will freeze solid, usually at
"Mounting local filesystems" or "Enabling file system quotas."

This is a dev system so if there is a test kernel to try I will be
happy to give it a shot. When I get a chance I'll also see if I can
get some debug info using serial console.



Comment 21 MGeiger 2003-12-15 04:03:04 UTC
Some descriptions of this bug indicate it may also be related to
#109962, in which SMP kernel hangs during unmounting of filesystems in
shutdown.

Comment 22 Ivo 2003-12-18 08:34:43 UTC
Tried the 2332 and 2335 kernels from updates-testing.
2332 ran almost two days before hanging; 
2335 oopses (attempting to kill idle task) on boot.

Comment 23 Joshua M. Thompson 2003-12-22 14:18:11 UTC
Ok the problem is definitely USB, at least in my case. I've also
reproduced this on a much older Dell PowerEdge with dual 600 MHz P3
processors. The common link between the two machines is the USB chipset:

00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev
04) (prog-if 10 [OHCI])
        Subsystem: ServerWorks OSB4/CSB5 OHCI USB Controller
        Flags: bus master, medium devsel, latency 32, IRQ 11
        Memory at feb00000 (32-bit, non-prefetchable) [size=4K]

On both machines on which I've had the boot lockups the problem goes
away if I comment out all the usb controllers in /etc/modules.conf. So
it seems that on ServerWorks chipsets the OHCI driver is not SMP safe.


Comment 24 Nuno Higgs 2003-12-23 10:34:42 UTC
I've been running almost for 19 days.... and no hangs... but no nfs on
any mount point....

Comment 25 Eric André 2003-12-23 12:15:05 UTC
Another me too (2xpIII 700 on tyan mobo). And yes, it seems 
definitivly smb and/or nfs related
(hang happens with both services)

Comment 26 Roland Roberts 2003-12-25 05:46:56 UTC
Another "me too" and like others I suspect NFS.  System is dual Xeon
2.4GHz, 2GB RAM, M/B is SuperMicro X5DAE.  Home directories are
mounted via NFS from elsewhere.  Samba is running, but nothing is
using it at present.  Nothing that looks like a "crash," i.e., no
"oopses" even when it hangs at the console.  Most recent hang was on
reboot; it hung while shutting down autofs.

I've just installed kernel 2135 and will see what happens now.

Oh, I've also usually had setiathome running, and one hang took place
within minutes of it being launched.  The setiathome binaries live in
my home directory which is mounted via NFS.

Comment 27 Nuno Higgs 2003-12-26 10:01:35 UTC
Just added the new kernel 2135 and will see what happens now.
The systems is now fully loaded... if anybody has any problems please
let me know... i dont want the cursed machine crashing while i'm 50
miles from it....

Comment 28 Roland Roberts 2003-12-27 03:23:14 UTC
kernel 2135 has the same problem :-(

FWIW, I have anothoer SMP machine that is mostly idle but which also
has *one* network mount (NFS) which was mounted manually.  It's in the
process of being set up as a new firewall.  Given that it is totally
idle, it's hard to really compare, but it *has* been up for 19 days.

Last item related to network mounts...I have had a large number of NFS
timeouts since installing FC1.  I get the messages "nfs server foo not
responding" then an okay message.  The nfs server is a RH8.0 host.  I
never saw these either with RH9

Comment 29 Scott A. Friedman 2004-01-08 20:00:40 UTC
kernel 2140 feezes too

Dell 2550 - dual pIII, NFS, autofs and samba in use. The last message in 
log always expiring an NFS mount samba has initiated.

Comment 30 Scott A. Friedman 2004-01-08 22:41:26 UTC
Update - moved my samba server to a Dell 2650 - dual P4 Xeon and it hung 
within ten minutes using 2140. Both of these server have been running
NFS and autofs. Seems to be realted to Samba using NFS.

2.4.22-1.2140.nptlsmp
autofs-3.1.7-42
nfs-utils-1.0.6-1
samba-3.0.0-15

I have switched the 2650 to the single CPU 2140 kernel and will see
how things go.

Both of these machines have dozens of NFS exports which serve a bunch
of SGI workstations without problems (except when the server freezes)

It seems that when a samba share uses an NFS/autofs mounted filesystem
. When the mount times out and then samba tries to access the share
again things go south.

As I said before the last message in my logs is always a mount expiring.

Comment 31 Tomasz Kepczynski 2004-01-09 13:26:23 UTC
Dual Pentium (old MMX from 1996) with 2135 smp kernel (compiled from
src rpm, no modifications to configuration) also fails. A few points here:
1. I've got USB UHCI so it is not OHCI driver as someone suggested.
2. I do not have any nfs or samba mounts.
3. I do have one nfs exported directory, but according to log it was
   umounted over 90 minutes before machine freezed.
4. 2 setiathome processes were running and probably were not yet
   ready to sent work results (over 10% work left which on this
   machine means over 4hrs).
5. After freeze machine was fully responsive to pings (ping flood on
   local 100Mbit network didn't drop a single packet out of well over
   15000).
6. Initial tcp connection sequence seemed ok, the problem probably
   kicked in when application (httpd, sendmail or ssh) was supposed
   to do any work.
The machine survived over 9 days, but it is not stressed at all
(occasional email/www traffic and nfs access every few days). I don't
know if UP kernel causes any problems as I havn't really tried it.
I also don't know if console access was possible (no console
connected and server far away from here).
Hope this helps.

Comment 32 Max Power 2004-01-09 19:39:19 UTC
Just tried the 2.4.22-1.2149.nptl kernel on my dual Xeon Dell
Precision 650n and the kernel hangs at boot time when loading modules.
 It has done this consistently for all the Fedora kernels.  I am
currently running the non-smp kernel and it works just fine.

Comment 33 Dave Jones 2004-01-09 20:45:29 UTC
Max, is this with acpi=on, or left at default ?

It'd be interesting to know if its the same module each time causing
the problem. Moving /sbin/modprobe to /sbin/modprobe.real and hacking
up a /sbin/modprobe script to..

#!/bin/bash
echo loading $1
sleep 1
modprobe.real $1

might give us the culprit..


Comment 34 Wade Hampton 2004-01-12 15:19:09 UTC
I have a dual 2.2 XEON (hyperthread disabled) with FC1.0, kernel 2115.
It worked for over 25 days with NFS and SMB.  It finally hung last
night after I disconnected the cable to the NFS server and did a DF. 
Message was posted to the Fedora mailing list.  I am upgrading to 2140
kernel and will re-test (and will use Stress).  Please CC me on
resolution.



Comment 35 Max Power 2004-01-12 19:09:29 UTC
Dave, with the modprobe script and acip=on, the SMP kernel
consistently hung at "Initializing USB Interface."  With the standard
modprobe executable it failed a few lines down at "Finding module
dependencies."  In the past it has hung at various other places as
well.  SMP works fine on this machine with Redhat 9.0, however, I'm
sticking with Fedora and currently can only run on one processor with
the non-SMP kernel (2.4.22-1.2149.nptl).

Comment 36 Scott A. Friedman 2004-01-12 20:40:57 UTC
Well, it's not samba related (no surprise) another server running 2140
hung over the weekend. My other 2140 machine has been running for four
days now with single processor kernel (with samba :-) )

Another thing, the load does not appear to matter. Since we are all
just coming back from the holiday here at school our server load
values are at or below 1.0 and yet we are still getting the freeze.

Also, hyperthreading does not seem to matter -- on or off.

Comment 37 Michael Metz 2004-01-13 14:38:03 UTC
perhaps #109463 gives a workaround? (disabling USB-Support via 
modules.conf)

Comment 38 Scott A. Friedman 2004-01-13 19:10:11 UTC
I tried disabling usb. First tried disabling via modules.conf and then
in the BIOS and it still froze. ;-(

My problem seems to have something to do with NFS or AutoFS. Anyone
listening in that is using these (not someone with one export) and NOT
having the system freeze problem. Because it seems that there are at
least two different problems being described here.


Comment 39 Dave Jones 2004-01-13 22:34:29 UTC
When these hangs occur, is it totally dead (as in even pushing numlock
doesn't change the keyboard LED) ? If it isn't getting backtraces will
be useful.

1) Turn on Magic System Request Keys....
(i.e. echo 1 > /proc/sys/kernel/sysrq)

2) Alt-SysRq-p to show where the processors are
3) Alt-SysRq-t to show processes states

Additionally, ctrl-scrolllock will do backtraces, and shift scrolllock
will show current memory states


Comment 40 Dave Jones 2004-01-13 22:35:45 UTC
Oh, one other thing, the backtraces will produce a lot of output,
which won't fit on the screen.  alt-sysrq-s will sync the drives
afterwards
(to make sure its in the logs), alt-sysrq-u will umount the partitions
and alt-sysrq-b will then reboot.


Comment 41 Scott A. Friedman 2004-01-13 23:26:49 UTC
Okay, just restarted a couple of machines that have been running the
single processor 2140 kernel without problems. Well, for four days anyway.

Dell-2550 2xPIII w/Fedora-2140smp
Dell-2650 2xP4 Xeon w/HT enabled w/Fedora-2140smp

sysreq enabled on both and both clients and servers for NFS, both
using autofs (version 3), The 2650 has some samba shares as well that
are actually NFS automounts.

Stay tuned...

Comment 42 Dave Jones 2004-01-14 04:28:15 UTC
booting with nmi_watchdog=1 may also be useful if the machines really
are stuck when they hang.


Comment 43 Aaron Belovsky 2004-01-14 09:11:06 UTC
SOLUTION!! -- I believe I have found the solution...

SYSTEM:
  Fedora Core 1
  Dual Xeon 2.4GHz
  2GB RAM

SOLUTION:
  I found that the issue was in the advanced power management for the
processors.  Leads on this path were made previously in this forum
discussion, but I have taken the full step.  There are two areas where
APIC (the advanced power management system) must be disabled.  The
first and foremost is in the BIOS.  The second is in the kernel load
sequence.  
  To disable APIC in BIOS, restart your system and hit whatever key
brings you into BIOS.  Then, just look around for APIC in all of the
settings and turn it off.
  Next, startup a working non-smp kernel and edit your grub config
file (/boot/grub/grub.conf).  On the smp enabled kernel, add noapic to
the end of the kernel line.  Example: kernel
/vmlinuz-2.4.22-1.2149.nptlsmp ro root=LABEL=/ hdd=ide-scsi rhgb noapic
  This repaired all issues for me.  I certainly hope that this will be
the solution for many others to come until they make whatever needs to
be compatible with APIC compatible.

  -Oz


Comment 44 Tomasz Kepczynski 2004-01-14 10:35:55 UTC
Sorry, but there is something I don't understand. APIC stands for
Advanced Programmable Interrupt Controller as far as I know, opposed
to ACPI which is Advanced Configuration and Power Interface. Which one
do you mean? And I am quite sure you can't disable APIC in BIOS...
For the record - I have Pentium WITHOUT ACPI (but with APIC) and still
have problems.

Comment 45 Scott A. Friedman 2004-01-14 16:59:17 UTC
Back again...

One of my servers died within four hours using 2140smp. SysRq reports

SysRq<p>
.text.lock.inode [kernel]
nsfd_iget [nfsd]
find_fh_dentry [nfsd]
recalc_task_prio [kernel]
fh_verify [nfsd]
nfsd_access [nfsd]
svc_sock_enqueue [sunrpc]
nfsd3_proc_access [nfsd]
svc_udp_recvfrom [sunrpc]
nfsd_procedure3 [nfsd]
nfsd_dispatch [nfsd]
nfsd_version3 [nfsd]
nfsd_dispatch [nfsd]
...

SysRq<t>
S automount
pipe_wait [kernel]

R umount
update_process_times [kernel]
invalidate_inode_buffers [kernel]
invalidate_list [kernel]
nfs_fs_type [nfs]

Restarted using single processor kernel.

Then, when restarting my other test machine to go back to single
processor it HUNG also!! It stopped during the Stopping automounter
step of the shutdown sequence. The SysRq keys showed the hang in the
SAME PLACE as above.

The other thing was that the second server that did not hang last
night (but did during the reboot) had a load average of over 200.
normally its hovers around 1.

Hope this helps debug this particular problem.


Comment 46 Scott A. Friedman 2004-01-14 17:00:49 UTC
Also, disk sync, unmount and reboot magic keys did not work after the
hang. Both machines had to be powered off.



Comment 47 Wade Hampton 2004-01-14 18:33:14 UTC
Update on my system (dual xeon 2.2G, 2G RAM ATAPI soft RAID).  I
updated to a stock 2.4.24 kernel (made with gcc 3.3) and it has not
crashed, yet.  I am re-making the kernel using gcc32 (recommended in
bug 113148).

What steps should those of us following this bug take to try to gather
more data on it?   Should we each report our MB, processor, RAM,
kernel, etc., or does it seem to be a more generic bug?  Is anyone on
the linux kernel mailing list following this?  Does this bug seem
specific only to Fedora users?  Is it specific to only Fedora kernels
and not stock 2.4.20-24 or 2.6 kernels? 

Note:  I also posted this to bug 113148 which I recommended be marked
as a duplicate of this bug.

Comment 48 Dave Jones 2004-01-14 18:51:39 UTC
There has been something similar upstream, which could be the same
problem.

http://testing.lkml.org/slashdot.php?mid=443124


Comment 49 Dave Jones 2004-01-14 18:58:16 UTC
*** Bug 113148 has been marked as a duplicate of this bug. ***

Comment 50 Scott A. Friedman 2004-01-14 20:13:31 UTC
Do you think it would be useful/productive to try the proposed patch
discussed in that thread?

If so, what would be the best way to proceed - apply to 2140?


Comment 51 David Woodhouse 2004-01-14 21:07:42 UTC
Created attachment 96992 [details]
possible fix

Please try this version.

Comment 52 Scott A. Friedman 2004-01-14 21:53:25 UTC
Probably obvious...but attachment will not patch against the inode.c
that is part of the 2140 kernel. 

Comment 53 Dave Jones 2004-01-15 01:02:02 UTC
ah, it relies upon the recent refile_inodes change in 2.4.25pre.
which in turn, depends on the -aa VM changes.. needs rediffing.


Comment 54 Scott A. Friedman 2004-01-15 03:06:07 UTC
A little more legwork than I have time for. I think I am going to sit
tight on the single processor kernel until an updated kernel package
is ready.

Comment 55 Aaron Belovsky 2004-01-15 04:02:27 UTC
Sorry about the confusion,
ACPI must be disabled in BIOS and APIC must be disabled in the
bootloader config.  Disabling ACPI in BIOS made it so that the system
only saw two processors like it should instead of four like it was
doing.  Disabling APIC stopped the system from hanging.

Comment 56 Roger Strandberg 2004-01-15 09:59:09 UTC
I tried kernel 2115,2129 and now 2140.
2140 has not hung yet.
I have 2 almost identical system only motherbord is diffrent.
Both is ASUS from same serie but on has the addition with integreted 
Gigbit
Network.
On both i run D-Link DFE-530TX network cards, and not any Gigabit.

The machine with not Gigabit is still runing 2115 and has NEVER HUNG 
and is
rock solid, it run's as a NFS SERVER.
This machine alway's has IDLE or 100%-99% and almost no load.
(both machine is NFS server and server two HPUX machine int 2 diffrent
datahall).

The Gigabit machine has always 0% IDLE and always 1.00 in load or 
more, even
if i close every service and only run text mode and no network, some 
times
in "top" is shows like this:

CPU states:  cpu    user    nice  system    irq  softirq  iowait    
idle
     total    0.0%    0.0%    0.0%  33.3%    33.4%   33.1%    0.0%

I been looking at this bug since it started.

In one way i point's to NFS but why does i have NO problem with 
exactly same
HPUX version on one linux, problem with the other.
Some times it hangs (gigabit one) when i start or stop NETVAULT 
SERVICE. but
never on the other.

For me it point on something connected to SMP functionality even if 
its
inside a single processor kernel, or some thing with usb.

A question, Those the has problem what does there TOP says?
Because this is the only place where i can se any diffrent.

[root@PBKSE-BS16 root]# uname -a
Linux PBKSE-BS16 2.4.22-1.2140.nptl #1 Tue Jan 6 20:20:43 EST 2004 
i686 i686
i386 GNU/Linux

[root@js_volga root]# uname -a
Linux js_volga 2.4.22-1.2115.nptl #1 Wed Oct 29 15:42:51 EST 2003 
i686 i686
i386 GNU/Linux


/Roger


Comment 57 David Woodhouse 2004-01-15 10:45:51 UTC
Disregard my patch. It fixes a problem in a 2.4.24 patch which isn't
actually included in the FC1 kernel anyway.

Comment 58 Jerry DaSilva 2004-01-15 13:43:04 UTC
I also have been having major problems with the Fedora SMP kernel.
I used to use RedHat 6.2 for many years and it was rock solid and 
stupid me decided it was time to upgrade since I had the machine down 
to fix some failed CPU fans and a hard drive anyways.

I have a AMI Goliath board with Quad 200 Mhz 256k L2 PPRO processors 
installed. Uses dual Orion chipsets. The system has 1 gig of ECC EDO 
DRAM. There is a AMI MegaRAID Express 300 card running in native mode 
(I2O mode crashes Fedora on bootup), a Intel Dual EEPro Server Net 
adapter, and a N9 Imagine Series 2 video card. So its not just new 
fancy machines with this issue. I have no USB, APM, APCI, or any of 
the latest interfaces. So those couldn't be the cause of this problem.

Yes its an old system, but with all 4 CPUs going, it is very 
responsive for my needs. Until I installed Fedora that is. Now I 
cannot seem to keep the machine up. It will always boot and work for 
awhile and then some random time later I find the machine completely 
locked up. No display, no keyboard response, unable to connect over 
the network. Its completely frozen. None of the system logs show 
anything that hints what happened.

I did try running the single CPU kernel and everything has been 
stable for 2 days now, albeit quite a bit slower having only one CPU 
to run everything on. This machine is useless until this bug gets 
fixed.


Comment 59 Jerry DaSilva 2004-01-15 14:01:09 UTC
Forgot to add one thing. I did find a way to make the bug surface 
extremely quick on my system. Have a ftp server running, in this case 
vsftpd, log into the ftp from a remote machine, proceeded to upload a 
huge file. The SMP kernel completely locks up exactlly the same way I 
find my machine randomly locked up in the past. No video, keyboard 
lights all off, num lock no longer works. This difference is I can 
get it to lock up within a few secs everytime and repeatable when 
uploading a large file through ftp.

The latest smp kernel version 2140 also crashes on my machine.



Comment 60 Tomasz Kepczynski 2004-01-15 15:09:14 UTC
For those looking for quick fix - you may try the latest errata kernel
for RedHat 9 (2.4.20-28.9smp). Works for me for over 72hrs now (but
as I mentioned earlier - the machine is not very stressed).


Comment 61 Steve Dickson 2004-01-15 15:14:28 UTC
Jerry,

Could you please turn on the Magic System Request Keys
as descripted in Comments #39 and #40 of this thread.

Then post the output of Alt-SysRq-p and Alt-SysRq-t commands
when the system hangs.


Comment 62 Need Real Name 2004-01-15 17:17:10 UTC
Hate to add a me too but I have 3 dual Xeon machines and they reach
the totally dead state within minutes of boot..... even the numlock
doesn't work.

Comment 63 Phil Randal 2004-01-15 18:23:22 UTC
Hang occurs on a single Xeon 2.4GHz Dell 2650 wiht hyperthreading
enabled and SMP kernel.  Non-SMP kernel runs fine.

Comment 64 Roger Strandberg 2004-01-15 19:25:36 UTC
I read Jerry's comment about HUGE files on SMP.
I run single bur the newer MotherBord has som type of (or it's cpu) 
SMP functionality because it wants to install SMP kernel.

I also has HUGE files, i run via NETVAULT a Virtual Tape Library 
where every file is 10GB i have around 25 and 6 can be loaded in the 
same time. Also around 20-50GB of data is transferd as a quick backup 
of a HPUX system over NFS (around 120-150 files), With the .2115 
kernel it hangs randomly, with .2129 hangs almost like every 10th 
time i restart service for NetVault. with .2140 i don't know it has 
not run for 2 days yet i'll know a few weeks from know.

And as wrote before i have exactly duplicate system with only 6 month 
older motherbord and that has no integrated gigabit card. and that is 
rocksolid with .2115

BTW i only run single and not SMP, but it's seams to be connected to 
eachother

/Roger

Comment 65 Phil Randal 2004-01-16 09:31:53 UTC
So far so good...  My single Xeon 2.4GHz Dell 2650 (Serverworks
chipset) has been running kernel-2.4.22-1.2149.nptlsmp has been
running with USB disabled for 15 hours without problems.  See comment #23.

Comment 66 Jerry DaSilva 2004-01-16 15:52:53 UTC
I would capture the sysreq back traces if I could, but my system is 
in a hard locked state. The keyboard no longer works for me to press 
the keys. I wish there was some way of capturing the system state 
when it crashes, but I have found no way in doing this.

Tried the 2149 kernel, it crashes too.



Comment 67 Up2Long 2004-01-16 16:36:25 UTC
I just found a copy of the new kernel ... 
ftp://ftp.net.usf.edu/pub/fedora/linux/core/updates/testing/1/i386/ker
nel-smp-2.4.22-1.2154.nptl.i686.rpm

Has anyone tried it?

...2149 is buggy on the following server:

HP Proliant DL 580
4  2.5 GHz Xeon
2  Gig RAM
4  72 Gig HD

Problems with X Windows (irratic behavior) and unable to 
reboot/shutdown cleanly.

I have also disabled the ACPI in the services applet and still have 
issues.



Comment 68 Dave Jones 2004-01-16 22:26:59 UTC
There's a large number of changes in the -testing kernel (2163) which
may fix this. It's something of a sledgehammer to crack a walnut, but
that kernel puts us back in sync with mainline VM, and includes all
recent fixes there too.


Comment 69 Scott A. Friedman 2004-01-16 22:59:37 UTC
Will try it starting tonight - we have a long weekend ahead here to test.


Comment 70 Jerry DaSilva 2004-01-17 00:04:50 UTC
Just tried 2163 with my FTP test. Locked up immediately as before.
It looks like I had some keyboard control this time before it 
completely went to a blank screen. Tomorrow when I have more time, 
I'll try to get some backtraces.

-Jerry

Comment 71 Jerry DaSilva 2004-01-17 19:52:50 UTC
Created attachment 97075 [details]
messages log from crash and using alt-sysrq-p/alt-sysrq-t keys

Looks to me like the trace is not complete. Saw more info pop up on my terminal
display than got recorded into the messages log file. I'm guessing the extra
info got lost in a disk cache somewhere and never made it to the messages file.

Comment 72 Nuno Higgs 2004-01-19 15:34:14 UTC
Hello Guys,

New development here. It appears, that the problem is also been
reported when using removable storage devices.
I have several Iomega Jaz devices who are locking the same way (atempt
to access after timeout of the device).
So it now appears, that the problem may not be directly connected, to
the NFS style mountpoints, but to any removeable or non-local storage
devices.


Comment 73 Scott A. Friedman 2004-01-19 20:20:51 UTC
Bad news, 2163 hung in the same spot as before. That is SysRq<p>
reports that cpu1 is sitting in .text.lock.inode

The task list (that I can see) shows umount running(heh) 

here is a call stack I managed to copy

do_IRQ [kernel]
invalidate_inode_buffers [kernel]
invalidate_list [kernel]
nfs_fs_type [nfs]
invalidate_inodes [kernel]
nfs_sups [nfs]
kill_super [kernel]
sys_umount [kernel]
filp_close [kernel]
sys_oldumount [kernel]
system_call [kernel]

Back to single processor kernel for now...


Comment 74 Up2Long 2004-01-20 15:45:29 UTC
More bad news  ..... 2163 is just as bad.  My test systems were all 
hung this morning when I came in.  They also still exhibit many of 
the behaviors as before. <sigh>

What can I do to better determine the issues with the kernel?  Can 
anyone e-mail some procedures I can do to try to narrow these issues 
down?  Thanks.



Comment 75 Dave Jones 2004-01-20 16:29:29 UTC
did you try the nmi watchdog as mentioned in #42 ?


Comment 76 Max Power 2004-01-20 20:52:38 UTC
Thanks for the hard work you fellows are putting in on this problem
but 2163 is still failing for me on a Dell Pecision 650N (dual Xeon).
 On boot, it now gets past the USB init but either hangs on module
dependencies or setting local filesystem quotas.  On my single
processor Dell XPS system (Xeon), the SMP kernel fails with a DMA
timeout for my SATA drive.  Both systems work well with the non-SMP
kernel.

Comment 77 Roger Strandberg 2004-01-21 10:49:28 UTC
bug.cx wrote about problems with any mounts.

And it seams to get back to this again and again (plus some usb).

But i get same with single kernel also, and i run a totaly single cpu 
system.

I have as i said before: 2 system running NetVault.
Both mounts logical devises.
I have a single kernel, but experiens same as you all when you all  
run smp.
On the machine that is stable as a rock when it finaly starts runs
2115. 
The new hanged today when i restarted NetVault for the second time, 
this machine runs 2149.
Both machine is ASUS P800 Deluxe, but the one with 2149 is newer.

How can i get same situation with TOTALY keylock or some times 
blinking chaps lock light and scroll lock with the single kernel.
If i try the smp, it goes dead directly.

It could also be the driver for teh Adaptec SATA raid, but then why 
does not the other (2115) hang?.....

/Roger

Comment 78 Lars Damerow 2004-01-23 02:00:13 UTC
Sadly, I too am having this problem, even with the 2163 kernel. I have
a dual-proc Xeon 2.0GHz machine, hyperthreading disabled. I have
several traces made with sysrq-t while the machine is frozen; I'll
attach them here.

I've found that the best way for me to reproduce them is to run the
system monitor applet in my GNOME panel. It often seems like that's the
process sitting in .text.lock.inode (it's called multiload-applet).

For the record, we're having similar freezes on RedHat 7.1 machines
running a more recent 2.4.20 kernel--I have no idea if the cause is
related, but the symptoms are very similar. I don't have a sysrq-t
trace for one of those yet.

thanks for all of the work you're doing!
-lars

Comment 79 Lars Damerow 2004-01-23 02:02:10 UTC
Created attachment 97201 [details]
sysrq-t trace of a frozen machine #1

with 2129 kernel

Comment 80 Lars Damerow 2004-01-23 02:02:36 UTC
Created attachment 97202 [details]
sysrq-t trace of a frozen machine #2

with 2129 kernel

Comment 81 Lars Damerow 2004-01-23 02:03:23 UTC
Created attachment 97203 [details]
sysrq-t trace of a frozen machine #3

with 2163 kernel

Comment 82 Wade Hampton 2004-01-23 13:26:04 UTC
I upgraded to 2163 on my SuperMicro dual XEON (HT enabled) and it
seemed stable for nearly a week.  However, when I took a box to a
customer's to be integrated, after 1 day it promptly locked up (looks
quite bad). 

Do you think a stock 2.4.24 kernel would be better at this time (I am
resisting moving to another disto this weekend).

Comment 83 Nuno Higgs 2004-01-23 16:24:57 UTC
Another hang... again in a removable media. Does anybody have the same
problem with removable medias? The media had journaling filesystem.
Will remove the Journaling Filesystem, and format it, with a non
journaling filesystem and see what happends.

Ah by the way it hang too with nfs ;)

Comment 84 Meelis Saar 2004-01-25 17:12:05 UTC
I have dual Xeon 2.8 4G RAM on Intel SE7501HG2 board. Fedora Core 1 
with LTSP 4.

I tried all Fedora kernels up to 2.4.22-1.2149.nptlsmp. All of them 
gived up without ANY error message after 6-12h uptime. Server load 
was not important. It can hang with heavy load and without any load.
I tried all suggestions from this thread (noapic, acpi, nousb, etc.) 
with no success. 

With self-compiled 2.4.23 kernel my server had 6 days uptime.

Now I try self-compiled 2.6.1 smp kernel. So far 21 hours uptime. 
Better than original kernel.

Comment 85 Nuno Higgs 2004-01-26 09:10:13 UTC
I've tried during the weekend to re-criate, the error that lock up my
machine with removable disks. If there is no logging then the machine
will not lockup. So it would appear, that the removable /
not-on-machine storage module in the kernel is having problems. 
It would appear that, the kernel, is not able to differ between non
removable and removable, and by that reason, locks out, as it would if
a physical disk went bad.

Comment 86 Eduardo Romero 2004-01-26 16:12:13 UTC
I updated to 2149 version on dual SMP dell PE2650 2.8Ghz HT, and now 
we have not freeze, the USB parameters were disable at BIOS. More 
reports later...

The PCI boot logs says:
ACPI: RSDP (v000 DELL                                      ) @ 
0x000fdc60
ACPI: RSDT (v001 DELL   PE2650   0x00000001 MSFT 0x0100000a) @ 
0x000fdc74
ACPI: FADT (v001 DELL   PE2650   0x00000001 MSFT 0x0100000a) @ 
0x000fdca4
ACPI: MADT (v001 DELL   PE2650   0x00000001 MSFT 0x0100000a) @ 
0x000fdd18
ACPI: SPCR (v001 DELL   PE2650   0x00000001 MSFT 0x0100000a) @ 
0x000fdda0
ACPI: DSDT (v001 DELL   PE2650   0x00000001 MSFT 0x0100000a) @ 
0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 Pentium 4(tm) XEON(tm) APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 Pentium 4(tm) XEON(tm) APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 Pentium 4(tm) XEON(tm) APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 Pentium 4(tm) XEON(tm) APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x04] polarity[0x1] trigger[0x1] lint[0x1])
Using ACPI for processor (LAPIC) configuration information
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: DELL     Product ID: PE 0121      APIC at: 0xFEE00000
I/O APIC #8 Version 17 at 0xFEC00000.
I/O APIC #9 Version 17 at 0xFEC01000.
I/O APIC #10 Version 17 at 0xFEC02000.
Processors: 4


Comment 87 Ivo 2004-01-29 08:09:33 UTC
Just tried 2163 for the first time on my dual PIII. Crashed overnight
as usual. alt-sysrq-p shows something like
comm: ypserv CPU: 0 EIP at.text.lock.read_write

alt-sysrq-t shows as last entry 

umount        R 
Call Trace:  invalidate_inode_buffers invalidate_list nfs_fs_type
invalidate_inodes nfs_sops kill_super sys_umount ...


Comment 88 Philippe Froidevaux 2004-01-29 18:42:59 UTC
Same here, one 3 different machines (all SMP) and all Fedora Kernel
until 2163. Same with custom 2.4.24 (gcc33).

Was ok with Redhat 7.3. Seems to be ok with Debian.

Kernel 2.6 doesn't boot.

(heeeeeeeeeeelp)

Comment 89 Norman Gaywood 2004-02-02 01:21:34 UTC
This is just a "ME TO" note.
                                                                     
                                                     
I've been getting this hang in text.lock.inode on our systems since I
upgraded to FC1. From what I can tell, the hang occurs when either the
amd automounter would make an umount call or when the system is
shutting down and attempts to unmount a filesystem.
                                                                     
                                                     
My systems include a 4 processor Xeon and a 2 processor Xeon.
                                                                     
                                                     
Also I have a 16 node beowulf cluster of single processor AMDs. These
also lock up intermittently, or when restarted, in text.lock.inode.
These were running the smp kernel. I've now started the nodes with a
non-smp kernel to see what happens.
                                                                     
                                                     
The latest kernel I'm using is 2.4.22-1.2149.nptlsmp but the hang has
happened on all the Fedora kernels I've tried since I started using
FC1 in Dec last year.
                                                                     
                                                     
Perhaps a silly question, but are the upstream kernel developers aware
that a umount seems to be the most reliable way of triggering this
hang?  I don't get that impression from the kernel thread mentioned above.


Comment 90 Norman Gaywood 2004-02-02 06:24:41 UTC
I know have a system that I can reliably hang. I've got:
 
Quad processor Pentium III (550MHz), 2 Gig of memory
 
kernel /vmlinuz-2.4.22-1.2166.nptlsmp ro root=LABEL=/ nmi_watchdog=1
 
initrd /initrd-2.4.22-1.2166.nptlsmp.img
 
/proc/sys/kernel/sysrq = 1
 
If I run this:
 
#!/bin/sh
while [ true ]; do
  mount /dev/sdh1 /mnt
  echo -n '['
  umount /mnt
  echo -n ']'
done
 
The system will hang in less than a minute.
 
alt-sysrq P shows the processors in text.lock.inode, text.lock.namei,
and text.lock.sched
 
nmi_watchdog does not seem to give any oops.
 
I will attempt to do a serial console tomorrow and get some more
precise messages.
 
If anyone has some suggestions, I'll be willing to try them.


Comment 91 Meelis Saar 2004-02-02 18:09:28 UTC
I have 4 days uptime with self-compiled 2.6.1 kernel.
Some weeks ago I tried 2.6.0 but i couldn't compile it. I don't know 
why because same .config file with 2.6.1 compiled perfectly now.

All FC1 kernels crashed within day.
Self-compiled 2.4.23 lasts 6 days.

to: Philippe - don't forget mkinitrd if your ext3 filesystem is not 
compiled into kernel :)

Meelis

Comment 92 Lars Damerow 2004-02-03 18:59:43 UTC
The script in comment #90 reliably crashes my machine. It only happens
when the machine is booted with an SMP kernel and the noapic boot flag
doesn't help.

thanks,
lars

Comment 93 Norman Gaywood 2004-02-04 00:46:27 UTC
Created attachment 97450 [details]
boot messages

Comment 94 Norman Gaywood 2004-02-04 00:47:59 UTC
Created attachment 97451 [details]
sysrq-M

Comment 95 Norman Gaywood 2004-02-04 00:49:20 UTC
Created attachment 97452 [details]
sysrq-P

Comment 96 Norman Gaywood 2004-02-04 00:50:52 UTC
Created attachment 97453 [details]
sysrq-T

Comment 97 Norman Gaywood 2004-02-04 00:53:40 UTC
The script in comment #90 is not as reliable as I stated. Once I got
my serial console going, it seemed to take a lot longer than before.
It did work after several minutes however. There was probably an amd
umount while the script was running that caused the hang (though I'm
not sure).

Anyway, I've attached the sysrq details. These ones had the
nmi_watchdog enabled which I don't think has happened for any of the
other sysrq messages posted here and on Bug #109962

kernel grub entry looks like:

title Fedora Core (2.4.22-1.2166.nptlsmp)
        root (hd0,0)
        kernel /vmlinuz-2.4.22-1.2166.nptlsmp ro root=LABEL=/
console=tty0 console=ttyS0,9600n81 panic=60 nmi_watchdog=1
        initrd /initrd-2.4.22-1.2166.nptlsmp.img

Comment 98 Norman Gaywood 2004-02-04 03:03:41 UTC
Created attachment 97454 [details]
sysrq-P

Previous sysrq-P was a sysrq-T, doh!

Comment 99 Nuno Higgs 2004-02-11 10:32:48 UTC
I've had enough... just compiled a kernel.org 2.4.24 kernel and the
system is running smoothly for almost 3 days.

I've tried to recreate the problem with de removable disks, and the
system didn't hang... Let's see for next days.
I am using NFS mounts and CIFS mounts... so if there is a problem, it
will show really soon.

Comment 100 Phil Randal 2004-02-11 13:26:57 UTC
kernel-2.4.22-1.2149.nptlsmp has been running happily here for the
last 26 days with USB disabled, but I can't test
kernel-smp-2.4.22-1.2166.nptl.i686.rpm until sometime next week.

Can anyone confirm (or deny) that this fixes the problems for them?



Comment 101 Scott A. Friedman 2004-02-12 20:48:38 UTC
Disabling USB does *NOT* fix the problem most reported here. System
will still hang in text.lock.inode etc. 

Have been running 2163 for 20 days now using single processor kernel



Comment 102 Jeff Sheltren 2004-02-12 22:55:22 UTC
I have been experiencing the same issue on ~30 machines with FC 1
installed (hardware is single Pentium 4 with hyperthreading, 1 GB RAM,
SATA WD HDD, ASUS p4p800 MB).  2.4.22-1.2166.nptlsmp does not seem to
help, as I have still seen them hang over the last day since it has
been installed.  I have *not* seen the crashes/hangs when booted into
the UP kernel.  By the way, the script posted in comment #90 seems to
reliably crash all the machines within a minute or two with the new
2166 kernel.

Also experiencing same problems on "true" dual-CPU systems (dual Xeon)
when using the smp kernel.

Comment 103 Nuno Higgs 2004-02-12 23:39:00 UTC
The problem is with stock kernel 2.4.22 built for fedora.

Both machines running fedora, one with stock kernel, and the other 
with the same version but compiled from source.

Results:

The compiled machine is running smoothly....

The system with the stock kernel, hangs about every two hours......

Has anyone tried with other kernel, besides the stocked mess? 

Comment 104 Tim Keitt 2004-02-13 01:38:15 UTC
I see the same problem on my dual-cpu Dell Precision 650. SMP kernel
locks hard (required me to pull the plug). Non-smp kernel runs fine.

Comment 105 Tim Keitt 2004-02-13 01:42:57 UTC
I should add this is with the smp version of 2.4.22-1.2166.nptl.

Comment 106 Jeff Sheltren 2004-02-13 17:04:11 UTC
Using the 2.4.24 kernel from kernel.org has not crashed in over a day...

Comment 107 Nuno Higgs 2004-02-13 21:58:54 UTC
uname -a
Linux XXXX.XXX.XXX 2.4.24 #43 SMP Wed Feb 11 09:26:01 WET 2004 i686 
i686 i386 GNU/Linux
:~> uptime
 22:02:26  up 2 days, 12:30,  8 users,  load average: 1.58, 2.29, 2.39

With a LOT of smb and samba mounts.....

;)



Comment 108 Paul Furness 2004-02-16 09:59:50 UTC
I'm still working on properly testing this idea now, but it looks to
me like it might be related to MRTG running as a cron job.

If I stop cron, the problem goes away. If I start cron, but remove
/etc/cron.d/mrtg, the problem goes away. I have had this system
running now for 4 days since I took away /etc/cron.d/mrtg.

I did try removing all the anacron entries, but this seemed to make no
difference.

I know this isn't an answer, but maybe it'll give someone a clue where
to look :)

Comment 109 Eric André 2004-02-16 10:16:54 UTC
quote:
> might be related to MRTG running as a cron job.

Wow, is that confirmed? At least thats the best idea i heard so far.
I have got MRTG runnig as a cron job, too.

Comment 110 Philippe Froidevaux 2004-02-16 10:27:46 UTC
Eric, Paul,

not, it's probably not the problem. Beacause 1/ I don't have MRTG
running, and 2/ the crash test on comment #90 has nothing to do with MRTG.

I thought first the problem is caused by my "slocate" cron job, but
actually all filesystems I/O could crash the system.


Comment 111 Ivo 2004-02-16 10:39:35 UTC
Status update: All of our smp machines (>12 various mainboards and
processors) crash with all fedora kernels. Highest uptime achieved
with an fedora kernel was 5 days, machines with more nfs traffic hang
earlier. alt-sysreq-t (when possible) shows an nfs umount.
All these machines run autofs and are both nfs clients and servers.
From ruptime it seems that many crashes happen around 4am when daily
cron jobs are started; apart from the usual, we run autoupdate or yum
and on some machines a dsm backup (no mrtg)

Vanilla 2.4.24 seems to run stably on these same machines (e.g. 12
days uptime on my desktop).

Comment 112 Lance Feagan 2004-02-18 02:15:18 UTC
I just got my Dell Precision 610 (Dual PIII Xeon 550MHz) booted up
with the 2.4.22-1.2166 nptlsmp kernel downloaded from the Fedora
updates.  I booted the kernel by passing the options:

noapic acpi=off

In addition, I had all power management support turned off in the
BIOS.  I hope this helps out anyone who may be having issues.  As a
general note, the option acpi=off can be extremely helpful in
resolving issues with systems.  I find it fixes problems many times.

Comment 113 Nuno Higgs 2004-02-18 11:02:49 UTC
#uname -a
Linux XXX.XXX.XXX 2.4.24 #43 SMP Wed Feb 11 09:26:01 WET 2004 i686
i686 i386 GNU/Linux
#w
 11:06:05  up 7 days,  1:33,  8 users,  load average: 3.04, 2.60, 2.31

18 NFS Remote mounts... 3 CIFS/SMB mounts.... and no crashes... The
problem is with the stock kernel from fedora.

Comment 114 Norman Gaywood 2004-03-04 01:47:42 UTC
Here is something new, that works for me. I have an FC1 installation
with RH EL  kernelinstalled. You can get these from whitebox linux at:

ftp://mirror.physics.ncsu.edu/pub/whitebox/3.0/en/updates/i686

I installed kernel-smp-2.4.21-9.0.1.EL and
kernel-smp-unsupported-2.4.21-9.0.1.EL

I have run the test script at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497#c98

for over 20,000 mount/umounts while doing an updatedb for slocate. Any
FC1 SMP kernel I have tried will hang before 500 mount/umounts. Most
times well before that.

You need to install the kernel packages with:

rpm --oldpackage -hiv kernel...



Comment 115 Norman Gaywood 2004-03-04 01:54:49 UTC
Wrong link to test script in last message. Should be:

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109962#c40

Comment 116 Ivo 2004-03-05 16:13:34 UTC
Norman's script work's for me as well. SMP machines running fedora
kernels hang within 1000 iterations. Vanilla 2.4.24 has no problems.



Comment 117 Scott Denham 2004-03-12 23:08:33 UTC
I have a situation that looks to be the same or related to this bug.

Platform:  IBM Intellistation Z-PRO (MSI Board) Dual 2.8 GH Xeon.
NVIDIA Quadro Pro 980XL, running in generic VGA mode (no NVIDIA driver)
OS:  Fedora core 1 clean install

Symptom: On console login, keyboard becomes unresponsive in a cyclic
pattern; (~10 seconds dead/~10 seconds active ...). With rapid
keyboard activity system will eventually become unresponsive; no ping,
no keyboard response.

This is a lab system that is frequently re-installed and has shown no
such symptoms under RH 8.0, RH 9, RH ES 3.0, SUSE, or kernel.org
2.4.24 kernels. Symptoms appear immediately upon installation of a
fresh Fedora image.  

Duplicated on 2.4.22-1.2115.nptlsmp and 2.4.22-1.2174.nptlsmp. 
Corresponding Fedora non-smp kernels do not exhibit the symptom.
Telnet/ssh sessions do not exhibit this problem; only local console
(tty1-tty6) sessions do.



Comment 118 Thomas J. Baker 2004-03-15 19:48:34 UTC
I have a system that exhibits the same problems. It's a Dell Precision
530. It hangs after a random amount of time with nothing in the logs.
The nmi_watchdog doesn't seem to help.

Comment 119 Tim Keitt 2004-03-15 20:06:13 UTC
Just installing the smp kernel causes the non-smp kernel to oops when
loading the firewire driver. Adding the "nofirewire" kernel option
causes the system to hang elsewhere during boot. Removing the smp
kernel restores "normal" behavior in the non-smp kernel. This is on a
Dell Precision 620 (dual Xeon). Its simpler for the time being just to
run the non-smp kernel.

<conspiracy-rant-mode>
So is this the new business model? Break Fedora on high-end hardware
and force people to the enterprise products?
</conspiracy-rant-mode>

(More likely you guys are too busy with FC2 and 2.6 to track this
down. Hopefully, things will be better with the newer kernels...)



Comment 120 Matt Phillips 2004-03-19 16:32:50 UTC
The script from comment #119 causes a hang here as well on several 
different SMP systems.  No problem when booting to the UP kernel. 
 
Using 2.4.22-1.2174.nptl kernel. 
 
Any chance of a fix for this? 

Comment 121 Andrew Thomson 2004-03-22 13:35:53 UTC
I can also confirm this bug (or something that looks like it) on the 
x86_64 release of FC1.  I'm running dual operton 242s on a Tyan 2885 
mb.  Standard FC1 with updates from up2date (except I hand-updated 
the XFree86 exe and the radeon drivers to 4.4.0 to fix another 
problem).  I have no NFS or samba mounts or shares, and no removable 
media except cd/dvd drives which i don't use much.  All files are on 
storage attached to a 3Ware 8500-8.  I'm running LVM (1.x), and /home 
is about 1.5TB I think I have an automounter running, but it does not 
do anything at this stage.  Crashes occur usually when idle (I also 
suspected cron at first, and have not totally ruled it out yet).  It 
took me ages to find anything about this one because I was looking 
for an x86_64 problem rather than an SMP one!  Looks to me like it is 
time to try a hand-rolled kernel.  Thanks to everyone here for some 
ideas on how to proceed as I had almost run out.  I can post more 
details tomorrow if anyone else is having this problems with x86_64.  
I can't ssh to the box tonight as it hung hours ago :-(



Comment 122 Karl DeBisschop 2004-03-23 21:20:45 UTC
people have asked if this is fixed in later kernels - i have run the
script from comment 114 with a recent FC2 devel 2.6.3 kernel on a dual
P3 box and had no crashes.

Once FC2-test2 is released, I will repeat with NFS (I just got the
hardware allocated for that test and I don't have the time to do a
complete install for this when the test2 isos are going to be released
shortly)

Comment 123 John Stokes 2004-03-25 20:42:45 UTC
Dual Athlon Tyan S2466 AMD 2400+MP. Same issues, NFS crossmount via 
AUTOFS which is the most likely cause of crash. Built vanilla 2.4.25, 
no crashes since.

Comment 124 Douglas Willis 2004-04-14 12:30:56 UTC
I'm experiencing the same problems with an HP/Compaq DL370 Dual
processor machine.  We usually get a lockup within 15 days of a re-boot.

It's running the redhat 9.0 Distro with the following kernel.

Linux xxx.nerc-bas.ac.uk 2.4.20-20.9smp #1 SMP Mon Aug 18 11:32:15 EDT
2003 i686 i686 i386 GNU/Linux

We also have a DL360/G2 with two processors. It is not experiencing
any problems and it is running Redhat 8.0 with the following kernel.

Linux xxx.nerc-bas.ac.uk 2.4.18-14smp #1 SMP Wed Sep 4 12:34:47 EDT
2002 i686 i686 i386 GNU/Linux

I'm planning on creating a stock SMP kernel from source and testing
that on the ML370.


Comment 125 Need Real Name 2004-04-14 16:21:14 UTC
*** Bug 118990 has been marked as a duplicate of this bug. ***

Comment 126 Ivo 2004-04-19 09:58:24 UTC
I had almost given up hope that this bug will be fixed before we all
move to fedora 2, but it seems the new 2.4.22-1.2179.nptlsmp does it.
I now have six SMP machines with uptimes over 3 days, and the script
from comment #115 runs for hours without effect.

Comment 127 Alan Cox 2004-05-03 22:28:29 UTC
2179 dropped the low-latency patch which may have been related. Is
anyone here still seeing it with 2179 or later

Comment 128 Jason Tibbitts 2004-05-03 22:35:56 UTC
No problems here with 2188; everything that used to trigger the
problem works fine now.

Comment 129 Brian Hanna 2004-05-04 16:36:39 UTC
I have FC1 2188 installed on PE 2650 dual processor with aacraid.
SMP kernel hangs consistently during boot, UP kernel runs fine.
Tried suggested "noapic", "noapic acpi=off" solutions - no joy.
Sounds like this may have been two bugs - NFS mount problem fixed?
However boot problem still there for me.

Comment 130 Alan Cox 2004-05-04 18:21:10 UTC
I think the boot one is in fact different - want to open a new bug for
it and just reference this bug in the description ?


Comment 131 Brian Hanna 2004-05-04 18:48:13 UTC
Sure, can do. Found that disabling USB in BIOS and modules.conf allowed
boot with SMP kernel on two servers, when SMP kernel hung 100% before.
Using "noapci acpi=off" corrected "cat /proc/cpuinfo", which was
showing 4 cpus. I am up and running now on SMP. Woo-hoo!

Comment 132 Ty 2004-05-04 21:07:55 UTC
Experiencing intermittent system hang with FC1 2118.  Server is Dell
1750 with dual Xeon 3.2GHz.  I haven't had any problems with the UP
kernel.

Comment 133 Ben Fabre 2004-05-04 21:56:45 UTC
I try all suggestions (except BIOS settings because I'm far away the 
server) including the last kernel version (FC1 2188) as Alan suggest, 
but my Dell PE 1750 dual Xeon 2.4 still have system hang.

For me, the only solution for the moment seems to be a non SMP kernel.


Comment 134 Need Real Name 2004-05-05 10:42:32 UTC
No problems here, on a dual Xeon, since 2179 & 2188. The system used
to hang almost every quarter of an hour before the update.

Comment 135 Mark Wilkinson 2004-05-05 11:24:58 UTC
Have just tried smp kernel 2188 on a Dual 2.4G Dell PowerEdge 2650 
with embedded raid and had it hang. (same as with 2179 and below).

Have tried the following after reading Brian Hanna's comments (#129 & 
#131) :-

usb bios setting originally 'on with bios support'

plain boot - hung
Appending 'noapic acpi=off' - still hung
Disabling bios usb (off)& appending 'noapic acpi=off' - booted
Disabled bios usb (off) not appending 'apic acpi=off' - booted
Changed usb bios setting to 'on with no bios support', nothing 
appended - booted !!!

Don't know if this is going to help anyone, or weather it will give 
anymore hints to the problem


Comment 136 Alan Cox 2004-05-05 16:32:54 UTC
That helps a great deal in some ways. USB disabled working says the
problem is either the BIOS USB emulation or our USB drivers. The fact
USB works without BIOS magic pretty much points the finger at the BIOS
firmware (the stuff doing 'make the USB keyboard appear to be a PS/2
keyboard, work in DOS, BIOS etc)

2.6 kernels are much cleverer about how they handle the PS/2 keyboard
so may be tripping a bug in the BIOS - or doing something naugty that
the BIOS trips over on. Hard to be sure which of the two. I'll ping a
Dell guy but you might want to check for bios updates.



Comment 137 Craig Demyanovich 2004-05-05 20:54:42 UTC
We've been facing this issue here on our 2550 dual processor machine.
 Finally we appear to have solved it by disabling USB in the BIOS
(A09) and booting the 2.4.22-1.2188.nptlsmp kernel.  We'll need
repeatable successful reboots and more uptime under normal load to be
sure.

Comment 138 Nuno Higgs 2004-05-06 09:30:47 UTC
Still using the same Kernel as in comment #113. Uptime now in 45 Days...
Does RH wants to make everyone migrate to it's payed linux with this
stunt?

The problem is in the stock kernel. Using a kernel from kernel.org the
system is rock solid.

Comment 139 Mark Wilkinson 2004-05-06 10:46:51 UTC
Just a quick update on my comment (#135) don't take the 'usb with no 
bios support' as fully working - i've since re-booted with this 
option and had 2188 smp kernel hang on 1 machine.

Dell Bios version is A10, and as I have a couple more Dual Xeon DELL 
PE 2650's I'll check what happens with them.

Comment 140 Mark Wilkinson 2004-05-06 12:13:18 UTC
ok, after a morning of testing, here are some more results :-

Bios version A10
Cold boot, usb no bios support - hangs
warm Boot, no usb - boots
Cold Boot, no usb - boots
Warm Boot, usb no bios support - hangs

Bios Updated to A17
Warm Boot, no usb - boots
Warm boot, usb no bios support - hangs
Cold boot, usb no bios support - hangs

Unfortunatly I can't verify how I managed to get the system to boot 
with 'usb no bios support' yesterday.

Comment 141 Ty 2004-05-06 13:48:35 UTC
I disabled usb in bios and in modules.conf in an attempt to fix the
problem in comment #132 and had no success.  Multiple startups and
shutdowns of an Oracle database installed on the system will
consistently cause the system to hang.

Comment 142 Alan Cox 2004-05-06 14:19:57 UTC
Re: comment #138 all kernels have bugs. Fedora happened to have one
that people hit in the low latency stuff. But you don't actually need
to buy RHEL to play with the RHEL kernel - you can download the source
rpm from ftp.redhat.com. 

Comment 143 Jerry DaSilva 2004-05-10 17:46:10 UTC
I installed 2188 SMP on my Quad PPro 200 last night, this morning it 
was hung again. There is still a problem with this latest kernel.

Is this ever going to get fixed? I'm guessing I'll be using Fedora 
Core 2 before this kernel is ever gonna get fixed.

-Jerry

Comment 144 André Schaapherder 2004-05-11 11:48:48 UTC
Is there a confirmed working kernel ?

I have trouble with the 2.4.22-1.2188.nptlsmp kernel and switched 
back to the 2.4.22-1.2179.nptlsmp because I am quite sure that I did 
not have this problem a couple of weeks ago.

Running FC1 on a dual Xeon Dell PowerEdge 2650.

André

Comment 145 Ty 2004-05-11 18:21:21 UTC
Regarding comment #132 and #141:
The following is dumped out to the console when the system hangs:

Uhhuh. NMI received for unknown reason 21 on CPU 0.
Dazed and confused, but trying to continue.
Do you have a strange power saving mode enabled?

Comment 146 Alan Cox 2004-05-17 17:57:27 UTC
NMI usually indicates a system problem, memory parity error or the
like. Dell may be able to tell you what NMI code 21 is. I'd normally
guess at bad memory - but its very odd that your box is reliable with
one CPU only if so.


Comment 147 Stuart Hayes 2004-05-17 18:33:55 UTC
The "unknown reason" code is just what the system read from I/O port 
0x61 when the NMI occurred.  This is pretty much useless.  The bits 
mean things like "speaker clock" (the output of the counter used to 
drive the speaker), "refresh detect" (a bit that toggles every time 
the memory is refreshed), some NMI enable bits (0x21 would mean that 
NMI is enabled from the two sources IOCHK and SERR), "speaker 
data", "speaker enable", and one bit that indicates if an NMI has 
occurred because of a PCI PERR or SERR (bit 7--no PERR or SERR).


Comment 148 Roger Strandberg 2004-05-17 21:45:50 UTC
I have been following this rather long since december i think.
I have problem att upstart also somethimes, the fedora 2 prerealse 
seams to work better.

A question, does RHEL realy differ so much from Fedora?
And if it works in RHEL, why does it not work in Fedora.

There has med NFS,USB, and good allot of stuff.
I say that people had problem with Oracle, i had BIG problem with 
Bakbone's NetVault (backup system) almost every time i wrote "service 
netvault stop" the kernel hanged... 

I apologize for the folowing meaning:
Was this bug planted in Fedora by RHEL, to lett peoble with dual 
system give up and go RHEL instead?
I can't even remeber this fault in the kernel with same number from 
kernel.org....
And again sorry for my meaning, for all you that was offended by my 
meaning, i know you all work hard to solve this, but perhaps we 
should just let it go, and upgrade to newer fedora kernel.

And again, i'm sorry

Best regards 
Roger Strandberg
Sweden


Comment 149 Ty 2004-05-20 19:54:03 UTC
Regarding comment #132, it looks like we had a bad processor.  Dell
diagnostic utilities didn't catch it, but we tried to reinstall
Windows on the box and it choked during startup.  The CPU has been
replaced, so I'll need to start testing again from the beginning...

Comment 150 Seth Vidal 2004-06-01 03:58:17 UTC
Something worth trying for people running FC1 who can regularly
experience this bug. Install the RHEL 3 kernels either from RHEL3 or
from centos-3.1 and see if it goes away. The kernels should install
w/o any pain on an FC1 system.



Comment 151 Barry K. Nathan 2004-06-01 07:57:58 UTC
For anyone who wants to try Seth's recommendation in comment 150, make
sure to use RPM's "--oldpackage" option. You'll need it since RPM
considers 2.4.21 to be older than 2.4.22.

Comment 152 Ben Fabre 2004-06-08 16:46:09 UTC
Hi,

Is this bug still present in fedora core 2 ?

Thanks
Ben

Comment 153 Alan Cox 2004-06-08 21:42:49 UTC
FC2 is a totally different kernel (2.6). I've certainly not seen any
evidence of matching problems, although since this bug is a composite
of about half a dozen things, most of which are fixed its hard to be
definitive. If you find any FC2 problems - please open a *new* bug for
them ;)


Comment 154 Fabrizio Steiner 2004-06-15 18:17:55 UTC
somebody tried the "noht" flag for the kernel? this should disable 
the hyperthreading

Comment 155 Seth Vidal 2004-06-19 14:10:20 UTC
Installed RHEL3 entirely on this system. Hot springy death after about
8 hours of running on a poweredge 1750. Appears to be the same error
and it is immediately preceeded by an autofs mount expiring.

Just ANOTHER datapoint of pain.



Comment 156 Barry K. Nathan 2004-06-19 21:14:33 UTC
Seth, is that RHEL 3 update 2 or just the original RHEL 3? (i.e. is
that with kernel 2.4.21-15.EL or later?)

BTW, on every machine I've encountered that had this problem with
Fedora Core 1, installing a Fedora Core 2 kernel has fixed it.

Comment 157 Seth Vidal 2004-06-19 21:22:38 UTC
RHEL3U2. I really don't want to install a 2.6 kernel if I can avoid
it. I installed rhel so I wouldn't have to play with kernels for a while.

I figured dell equipment should get along well with rhel.


Comment 158 Dave Jones 2004-06-19 23:50:06 UTC
This bug has turned into a complete mish-mash of a number of problems
& reports for what turned out to be several different bugs.
Trying to pick through it, and find out the bits that are still
causing problems, and still unfixed is a nightmare.

I'm going to close this bug as fixed (as backing out the low-latency
patch did fix this problem for a number of users).  If you are still
seeing problems with kernels past 2188, please open a new bug instead
of adding to this one, even if your bug sounds identical to someone
elses. (If it is, I can mark it as a duplicate easily -- if it isn't
it adds to the noise, and we end up with bugs like this monster).

If you are still seeing it, include as much info as possible
(Do *not* put 'See 109497', as that defeats what I'm trying to do with
this exercise).  Try out some of the suggestions mentioned above
(acpi=on, nmi_watchdog=1, noapic, other such options..)

I don't care if it means I end up with another 150 bugs. It's easier
to weed out duplicates that way than it is trying to make sense of the
situation with this bug.

With Fedora Core 1 only having a finite amount of shelflife left
before it gets handed off to the fedora-legacy folks, it'd be good to
get a better idea of whats going wrong, both for Red Hat folks still
working on this such as myself, and for the legacy folks that'll pick
up when we're done.

If you're seeing this problem in RHEL3 / FC2 or whatever else, mark it
as such in the new bugs. "similar" problems to this bug frequently
aren't as the comments above show.

Thanks.



Note You need to log in before you can comment on or make changes to this bug.