From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050921 Red Hat/1.7.12-1.4.1 Description of problem: After updating from kernel-smp-2.6.9-11.EL to kernel-smp-2.6.9-22.EL on a 4-way dual-core Opteron system, using more than 4 processors for CPU-intensive tasks causes the system to power off. The motherboard is a Tyan Thunder K8QSD Pro (S4882-D) with BIOS rev E 1.02, the processors are Opteron 870s, it has 32 GB of RAM, and the 6 SATA disks are on a 3ware 9500 controller. I'll attach the output of dmidecode. The problem occurs if I run >4 concurrent CPU-intensive processes, or a parallel job that uses >4 processors. The cpuspeed daemon is not running and /proc/cpuinfo shows the 8 processors at their maximum speed of 2 GHz. I thought that perhaps the system was drawing too much power for the 600W power supplies to handle, but on the previous kernel I could have 8 processors working, the RAID being hammered, and the nvidia FX-5200 card running with no problem. On the new kernel, with the nvidia card removed and the disks not being used, the problem occurs. Using the kernel parameter "numa=on" doesn't have any effect. Version-Release number of selected component (if applicable): kernel-smp-2.6.9-22.EL-x86_64 How reproducible: Always Steps to Reproduce: 1. boot to kernel-smp-2.6.9-22.EL 2. start more than 4 CPU-intensive processes 3. Actual Results: System powers off Expected Results: System should continue to run Additional info: As an aside, even using the 2.6.9-11 kernel, something is not quite right when using >4 processors. One would assume that for non-memory- or -disk-bound jobs, 8 processes would run in approximately the same time as 4. However, 8 take something like 5 times longer. Also, when using parallel programs such as GAMESS or NAMD, the results of calculations using >4 processes could be inconsistent. (But at least the system didn't crash.)
Created attachment 119817 [details] output from dmidecode Is there any other information that I can include that would be helpful?
Does the system provide any usefull info before it powers off? Is the motherboard shutting itself down (voltage or temp out of range) or is the kernel causing a panic? With the kernel-smp-2.6.9-11.EL kernel, what was the reported CPU frequency? Was it the same? It might be usefull to try booting without PowerNow enabled to try and narrow down the possibilities. Also, the problem listed in "additional info" could be a scheduler issue. Using a different scheduler may give better performance. Pass elevator={"as"|"cfq"|"deadline"|"noop"} in at boot to select which scheduler is used.
Unfortunately there are no relevant entries in the system logs, and no entries at all in the BIOS error reports. So I'm not sure whether the kernel is panicking or if the mobo is shutting down. The reported CPU frequencies and bogomips are the same with both kernel-smp-2.6.9-11.EL and kernel-smp-2.6.9-22.EL. The BIOS doesn't seem to have an option for turning off PowerNow or ACPI. Is there a way to do this via kernel options? With kernel-smp-2.6.9-22.EL, the crashes occur in both single user mode and runlevels 3 and 5. If I boot the uniprocessor kernel-2.6.9-22.EL, I can run 8 concurrent processes without a crash. Some BIOS settings that may be of interest: Multiproc Spec: 1.4 MP Tables uses PCI: Yes ACPI SRAT Table: Enabled RSDT FADT Rev: 1 HPET Timer: Enabled MTRR mapping: Continuous Memhole mapping: Disabled Enable mem clocks: Populated Controller conf mode: Auto Timing conf mode: Auto DRAM Bank Interleave: Auto Node Mem Interleave: Disabled
Can you do a sysreport and attach the results to this bug report? It might provide us with some more information...
Created attachment 119951 [details] sysreport for this system
The behavior of this problem seems somewhat hardware-related to me, so I thought I would try to see whether the power supply was being overloaded when a lot of jobs were running. With the -11smp kernel, the current in the power cable going to the wall was 5.75A with 8 CPU-intensive background processes running, a tar job going to the 5-disk RAID, and a video-intensive molecular simulation program running. With the -22smp kernel, the current was about 6.20A with the system in runlevel 3 with no user jobs running. With 5 CPU-intensive jobs, the current went up to 6.26A and the system shut off. It's possible that the 600W power supply is being overloaded, though I don't know how to prove this. But why does the new kernel cause the system to draw more power?
When you run with the -11smp kernel, does it report different bogomips than the -22smp? I'm wondering if it's an issue with power/frequency management...
The CPU frequency and bogomips (reported in /proc/cpuinfo) are the same on both the -11smp and -22smp kernels.
lmsensors info (or other environmental info like motherboard temps, voltage, etc) would also be helpfull if possible. Is there any temp or voltage differences between these 2 kernels?
Unfortunately, the Tyan System Monitor (http://www.tyan.com/support/html/software_utilities.html) won't install on this system. The docs say it's supported only through RH 9. Also, the lm_sensors sensors.conf file that Tyan provides for the S4882 board says that it only works with kernel 2.6.10 or higher. It requires the i2c-amd756-s4882 module, which doesn't seem to be present in kernel-smp-2.6.9-11.EL. The docs for i2c itself say not to compile it with any 2.6 kernel. Do you have any other ideas about how I might get the mobo info?
there might be other packages to read this info, but lmsensors is the standard way. Maybe an older version of lmsensors can be used with the older 2.6.9 kernel? It was my understanding that the AMD 756/766 driver was a module for the RHEL4 U2 kernel: /lib/modules/2.6.9-21.1.ELsmp/kernel/drivers/i2c/busses/i2c-amd756.ko modprobe i2c-amd756 should work on any recent RHEL4 kernel. The .11 kernel may not have had this though...
Update: I got a bigger power supply and the problem went away. Apparently the more recent RHEL 4 kernels for some reason cause the system to draw a bit more power. As for lmsensors, I can get partial info for the mobo using the i2c-amd756 module, but the output is essentially identical regardless of which kernel I'm using or how heavy the system load is. I don't think that it was a voltage or something similar going out of range that caused the problem ... I think the power supply was too small. Somewhat as an aside, Tyan's provided sensors.conf file requires the following modules: i2c-amd756-s4882, i2c-isa, lm85, lm63, and w83627hf. i2c-amd756-s4882 and lm63 are not included in the current RHEL 4AS kernel. Would it be possible to include these in a future version?
Since this issue was determined to be caused by hardware, I am closing this as NOTABUG.