Bug 10203

Summary: Kernel panic on SMP HP Netserver
Product: [Retired] Red Hat Linux Reporter: Sergio Tadini <sergio.tadini>
Component: kernelAssignee: Michael K. Johnson <johnsonm>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 6.0CC: dautrevaux, sergio.tadini
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2000-09-05 09:11:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio Tadini 2000-03-16 10:13:32 UTC
On an Hewlett Packard NetServer LC3 currently running RH Linux 6.0 since
sep. 1999 I tried to upgrade to smp installing the second CPU (Intel PIII
550) since at installation time The pc was recognised as a smp machine
(kernel already smp).
A few days later the server started to crash every 5/10 minutes.
I decided to upgrade the System bios and to re-install RH Linux with 2 cpu.
After 2 days of running it started to crash every few hours.
Hewlett Packard assistance told me that the hardware is ok (and since it's
tested from RedHat and HP with this configuration they don't konw what's
the problem).

At the same time, I have installed some old ALR Dual processor Pentium 166
with the same problem: when running as single cpu machine everithing is
ok, but when the kernel run in smp mode they start to crash.

I tried also to upgrade the kernel to 2.2.13 but with the same results.
I think the problem could be in some driver, but the netserver is a "All
from hp" machine (RAM, CPUs, RAID controller, HDs, NIC), and the old ALR
have very different configuration.

Since I have 4 customers with SMP machines running with a single cpu, I
need a prompt reply please!

Comment 1 Bernard DAUTREVAUX 2000-03-27 13:24:59 UTC
I've about the same kind of problem; I'm  trying to add the second
processor to a Linux box on a dual PIII/Xeon-550MHz Intel C440GX+ board, and I
get a bunch of problems; the machine runs perfectly for about 24 hours (and
it's incredible how fast it compiles :-)), then freezes ?-(.

I'm usually no able to get any indication as to why it crashed (as it seems to
like crashing in the middle of the nightly builds :-)), but occasionnaly it
crshes in the day, and then I get the following behaviour:

As long as you are not accessing an NFS mounted file system, for example
logging as root from the system console, all is working perfectly, but as soon
as you try to access one, you're dead :-(

As long as it is working I get occasional complaints like these:
	svc: unknown program 100227 (me 100003)
	svc: unknown version (3)
Note however that I also get these messages in single processor mode, so I'm
not sure they are related to the problem.

When freezed you from time to time see the following message on the system
console:
	nfs: task 37637 can't get a request slot
where the task number may change from message to message (I've seen at least
37638 and 37639)

At this point the CPU is idle (top reports 1 running process and 99.8% idle
CPU, with about 60Mb free memory out of a total 1Gb and no swap at all; swap is
not even configured).

Note that all these messages are related to NFS accesses to filesystems
exported from a Solaris-2.6/PC system (running on an dual PII-450 SMP platform).

I was using kernel 2.2.12-5 from RedHat-6.0, then 2.2.12-32 from RedHat-6.1 in
uniprocessor mode, then switched to 2.2.12-32smp and now kernel 2.2.14-8smp (as
provided by Ed Schlunder on http://www.ajusd.org/~edward/silkhat-
6.1/i386/kernel-smp-2.2.14-8.i686.rpm) on my RedHat-6.1 install. I get 'svc:'
messages on all configs and crashes on all SMP kernels.

Is there any other workaround than unplugging the second PII-550? even if it
were aesthetic I don't thing my boss will appreciate I display a 1K$ proc on
the wall over my desk :-(

Comment 2 Alan Cox 2000-08-08 20:28:36 UTC
A significant amount of SMP work was done for 2.2.16 - has the 2.2.16 errata
kernel helped >


Comment 3 Bernard DAUTREVAUX 2000-08-09 12:42:44 UTC
I just install it today (taking advantage of the fact that the whole team is 
now on holidays) to experiment with kernel-2.2.16-3smp and I keep you informed 
of th eresult; however I am also leaving for about two weeks so don't expect 
anything new before, except if it starts crashing faster than usual :-)

Comment 4 Bernard DAUTREVAUX 2000-09-05 09:11:47 UTC
Thanks for the good job :-) It's now about one month I'm running the 2.2.16 
kernel errata in SMP mode and I've never crashed!

So this seems to have cured my problem. Note that I still get the "kernel: 
svc: " messages from NFS however so that was not related to the SMP crashes at 
all :-)


Comment 5 Alan Cox 2000-09-15 18:26:46 UTC
The svc message is logged when the solaris box tries to talk NFSv3 to us. Its
probably a bit of excess verbosity on the Linux side to log this I agree.

Glad to hear its happier. Reopen the bug if it turns out to be luck only


Comment 6 Bernard DAUTREVAUX 2000-11-29 08:55:48 UTC
Back to my problem of SMP kernel crashing.

As said above I've updated to kernel-2.2.16-3smp in July and all works fine 
till about end of September. I then got one or two "silent" crashes during 
October: not fun but still not too bad except when it crashes during a week-end 
rebuild!

But now, I'm starting a new phase in our projects and I have HUGE make batches, 
running for several days, getting the source files from a Solaris-7 box and 
putting all resulting files on a local SCSI disk, and it crshes about twice a 
day consistently since then, with the kernel freezed with the dreaded 
    nfs: task xxxxx can't get a request slot
(replace xxxxx by your favorite task ID)

Note that since July several users are compiling in parallel, but their current 
directory was also NFS-mounted from the Solaris box and we seldomly crash; the 
difference now is that the current directory for the make runs is local and 
only the source files are picked (using VPATH) from an NFS-mounted tree.

So it seems that the errata kernel do not fix this problem; IIRC I got this 
problem when testing the build environment I'm now using and stop compiling 
locally at about the time I install the errata. I'm afraid I've not enforced 
strictly enough the "all other things equal" paradigm :-o