170441 – Running >4 jobs on 4-way dual-core opteron crashes system

Bug 170441 - Running >4 jobs on 4-way dual-core opteron crashes system

Summary: Running >4 jobs on 4-way dual-core opteron crashes system

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jim Paradis
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-10-11 19:14 UTC by Peter Ruprecht
Modified:	2013-08-06 01:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-01 23:01:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
output from dmidecode (24.93 KB, text/plain) 2005-10-11 19:16 UTC, Peter Ruprecht	no flags	Details
sysreport for this system (947.70 KB, application/x-bzip) 2005-10-13 21:22 UTC, Peter Ruprecht	no flags	Details
View All

Description Peter Ruprecht 2005-10-11 19:14:41 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050921 Red Hat/1.7.12-1.4.1

Description of problem:
After updating from kernel-smp-2.6.9-11.EL to kernel-smp-2.6.9-22.EL on a 4-way dual-core Opteron system, using more than 4 processors for CPU-intensive tasks causes the system to power off.

The motherboard is a Tyan Thunder K8QSD Pro (S4882-D) with BIOS rev E 1.02, the processors are Opteron 870s, it has 32 GB of RAM, and the 6 SATA disks are on a 3ware 9500 controller. I'll attach the output of dmidecode.

The problem occurs if I run >4 concurrent CPU-intensive processes, or a parallel job that uses >4 processors.

The cpuspeed daemon is not running and /proc/cpuinfo shows the 8 processors at their maximum speed of 2 GHz.

I thought that perhaps the system was drawing too much power for the 600W power supplies to handle, but on the previous kernel I could have 8 processors working, the RAID being hammered, and the nvidia FX-5200 card running with no problem. On the new kernel, with the nvidia card removed and the disks not being used, the problem occurs.

Using the kernel parameter "numa=on" doesn't have any effect.

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-22.EL-x86_64

How reproducible:
Always

Steps to Reproduce:
1. boot to kernel-smp-2.6.9-22.EL
2. start more than 4 CPU-intensive processes
3.

Actual Results: System powers off

Expected Results: System should continue to run

Additional info:

As an aside, even using the 2.6.9-11 kernel, something is not quite right when using >4 processors. One would assume that for non-memory- or -disk-bound jobs, 8 processes would run in approximately the same time as 4. However, 8 take something like 5 times longer. Also, when using parallel programs such as GAMESS or NAMD, the results of calculations using >4 processes could be inconsistent. (But at least the system didn't crash.)

Comment 1 Peter Ruprecht 2005-10-11 19:16:49 UTC

Created attachment 119817 [details]
output from dmidecode

Is there any other information that I can include that would be helpful?

Comment 4 Brian Maly 2005-10-12 15:19:03 UTC

Does the system provide any usefull info before it powers off? Is the
motherboard shutting itself down (voltage or temp out of range) or is the kernel
causing a panic?

With the kernel-smp-2.6.9-11.EL kernel, what was the reported CPU frequency? Was
it the same?

It might be usefull to try booting without PowerNow enabled to try and narrow
down the possibilities.

Also, the problem listed in "additional info" could be a scheduler issue.  Using
a different scheduler may give better performance. Pass 
elevator={"as"|"cfq"|"deadline"|"noop"}  in at boot to select which scheduler is
used.

Comment 5 Peter Ruprecht 2005-10-13 18:00:36 UTC

Unfortunately there are no relevant entries in the system logs, and no entries
at all in the BIOS error reports.  So I'm not sure whether the kernel is
panicking or if the mobo is shutting down.

The reported CPU frequencies and bogomips are the same with both
kernel-smp-2.6.9-11.EL and kernel-smp-2.6.9-22.EL.  

The BIOS doesn't seem to have an option for turning off PowerNow or ACPI.  Is
there a way to do this via kernel options?

With kernel-smp-2.6.9-22.EL, the crashes occur in both single user mode and
runlevels 3 and 5.  If I boot the uniprocessor kernel-2.6.9-22.EL, I can run 8
concurrent processes without a crash.

Some BIOS settings that may be of interest:
Multiproc Spec:  1.4
MP Tables uses PCI:  Yes
ACPI SRAT Table:  Enabled  
RSDT FADT Rev:  1
HPET Timer:  Enabled
MTRR mapping:  Continuous
Memhole mapping:  Disabled
Enable mem clocks:  Populated
Controller conf mode:  Auto
Timing conf mode:  Auto
DRAM Bank Interleave:  Auto
Node Mem Interleave:  Disabled

Comment 6 Jim Paradis 2005-10-13 19:23:17 UTC

Can you do a sysreport and attach the results to this bug report?  It might
provide us with some more information...

Comment 7 Peter Ruprecht 2005-10-13 21:22:57 UTC

Created attachment 119951 [details]
sysreport for this system

Comment 8 Peter Ruprecht 2005-10-13 21:49:15 UTC

The behavior of this problem seems somewhat hardware-related to me, so I thought
I would try to see whether the power supply was being overloaded when a lot of
jobs were running.  

With the -11smp kernel, the current in the power cable going to the wall was
5.75A with 8 CPU-intensive background processes running, a tar job going to the
5-disk RAID, and a video-intensive molecular simulation program running.  

With the -22smp kernel, the current was about 6.20A with the system in runlevel
3 with no user jobs running.  With 5 CPU-intensive jobs, the current went up to
6.26A and the system shut off.  It's possible that the 600W power supply is
being overloaded, though I don't know how to prove this.

But why does the new kernel cause the system to draw more power?

Comment 9 Jim Paradis 2005-10-15 00:02:55 UTC

When you run with the -11smp kernel, does it report different bogomips than the
-22smp?  I'm wondering if it's an issue with power/frequency management...

Comment 10 Peter Ruprecht 2005-10-15 15:24:00 UTC

The CPU frequency and bogomips (reported in /proc/cpuinfo) are the same on both
the -11smp and -22smp kernels.

Comment 11 Brian Maly 2005-10-17 14:55:36 UTC

lmsensors info (or other environmental info like motherboard temps, voltage,
etc) would also be helpfull if possible. Is there any temp or voltage
differences between these 2 kernels?

Comment 12 Peter Ruprecht 2005-10-17 20:06:29 UTC

Unfortunately, the Tyan System Monitor
(http://www.tyan.com/support/html/software_utilities.html) won't install on this
system.  The docs say it's supported only through RH 9.

Also, the lm_sensors sensors.conf file that Tyan provides for the S4882 board
says that it only works with kernel 2.6.10 or higher.  It requires the
i2c-amd756-s4882  module, which doesn't seem to be present in
kernel-smp-2.6.9-11.EL.  The docs for i2c itself say not to compile it with any
2.6 kernel.

Do you have any other ideas about how I might get the mobo info?

Comment 13 Brian Maly 2005-10-17 20:20:08 UTC

there might be other packages to read this info, but lmsensors is the standard
way. Maybe an older version of lmsensors can be used with the older 2.6.9 kernel?

It was my understanding that the AMD 756/766 driver was a module for the RHEL4
U2 kernel: /lib/modules/2.6.9-21.1.ELsmp/kernel/drivers/i2c/busses/i2c-amd756.ko

modprobe i2c-amd756 should work on any recent RHEL4 kernel. The .11 kernel may
not have had this though...

Comment 14 Peter Ruprecht 2005-11-23 18:47:52 UTC

Update: I got a bigger power supply and the problem went away.  Apparently the
more recent RHEL 4 kernels for some reason cause the system to draw a bit more
power.  

As for lmsensors, I can get partial info for the mobo using the i2c-amd756
module, but the output is essentially identical regardless of which kernel I'm
using or how heavy the system load is.  I don't think that it was a voltage or
something similar going out of range that caused the problem ... I think the
power supply was too small.

Somewhat as an aside, Tyan's provided sensors.conf file requires the following
modules:   i2c-amd756-s4882, i2c-isa, lm85, lm63, and w83627hf. 
i2c-amd756-s4882 and lm63 are not included in the current RHEL 4AS kernel. 
Would it be possible to include these in a future version?

Comment 15 Jim Paradis 2006-03-01 23:01:21 UTC

Since this issue was determined to be caused by hardware, I am closing this as
NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.