Bug 808450 - RHEL 5.8 kernel 2.6.18-308 causes PROC IERR errors in the ESM and hangs on Dell C2100 servers (no kernel messages produced)
Summary: RHEL 5.8 kernel 2.6.18-308 causes PROC IERR errors in the ESM and hangs on D...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Charles Rose (Dell)
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-30 12:32 UTC by csb sysadmin
Modified: 2016-03-07 13:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-13 22:02:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshot of errors (265.38 KB, image/png)
2012-05-03 12:03 UTC, ansuls
no flags Details

Description csb sysadmin 2012-03-30 12:32:55 UTC
Description of problem:

Installed RHEL5.8 with the 2.6.18-308 kernel on a Dell PE C2100. Within 2 days the server would hang with no output to the console, no messages in the kernel, and only this error in the ESM under no load :

PROC_IERR_STATUS: Processor sensor, IERR was asserted

and this in the OMSA logs :

CPU 11 has an internal error (IERR).

Switched the CPUs to different sockets (to see if the CPU 11 error would move, but it did not), replaced the motherboard, but the error kept happening. Downgraded to the RHEL5.4 kernel 2.6.18-164.15.1.el5 and the server has been up for 6+ days now .

Version-Release number of selected component (if applicable):

kernel-2.6.18-308.el5.x86_64

How reproducible:

Always.

Steps to Reproduce:
1. I'm using a 1 hour load 1 hour no load regimen with stress and cron to get the system to crash more quickly, however the system always crashes when it's under 0 load  and with the 5.8 kernel :

#!/bin/bash
# /root/start_stress
# system has 144GB RAM, dual X5670 processors

cd /tmp
stress -v -c 24 -i 2 -d 2 --hdd-bytes 10G -m 12 --vm-bytes 10G 2>&1 > /dev/null &
sleep 1h
killall stress

2. cron :

0 */2 * * *  /root/start_stress 2>&1 > /dev/null

Actual results:

System crashes with 

PROC_IERR_STATUS: Processor sensor, IERR was asserted

and :

CPU 11 has an internal error (IERR).

errors in the BMC/ESM .


Expected results:

System should not crash

Additional info:

Downgrading to the RHEL 5.4 2.6.18-164.15.1.el5 fixes the problem. Haven't tried 5.5 - 5.7 kernels.

Comment 1 csb sysadmin 2012-04-16 14:31:29 UTC
The system crashed with the same error using the "certified" 5.4 kernel. But it had an uptime of 21+ days this time rather than 2-3 with the 5.8 kernel.

Comment 2 ansuls 2012-05-03 09:49:01 UTC
I am also facing same issue with RHEL 6.0 64 bit OS with Intel S7000FC4Ur MB with 16GB  and 2 CPU .

Intermediate Machine is giving CPU IERR error and on rebooting machine working fine in working hours and again given CPU IERR On 0 load. 

It may work for 2 days but again issues come up. we have changed entire H/w but Still problem presist.

Machine was working fine with RHEL 5.5 64 bit without any issue.

Comment 3 csb sysadmin 2012-05-03 11:48:53 UTC
We just got our replacement C2100 and are testing it now. 4 days uptime so far under same regimen of tests with 5.4, so far no crashes. We won't give it a clean bill of health until it's up for at least 1 month though.

Comment 4 ansuls 2012-05-03 12:00:31 UTC
whether your machine is running fine with RHEL 5.8 after replacement ?

Comment 5 ansuls 2012-05-03 12:03:26 UTC
Created attachment 581837 [details]
Screenshot of errors

Display drivers error

Comment 6 csb sysadmin 2012-05-03 12:05:01 UTC
5.4

Comment 7 csb sysadmin 2012-05-03 12:08:52 UTC
Do you get those errors all the time, or just before it crashes?

Comment 8 ansuls 2012-05-03 12:13:49 UTC
Yes , normally thats errors reported continously ; have you got same errors ?

Comment 9 csb sysadmin 2012-05-03 12:20:53 UTC
ok, then I doubt it's related to that. The C2100 uses aspeed graphics. I'm wasn't able to check the console after it crashed, even through the dell bmc ikvm which basically showed a blank screen, and as mentioned previously there were no kernel messages in the logs after power cycling it.

Comment 10 csb sysadmin 2013-11-01 20:54:52 UTC
Just to update this, we got fed up with the C2100 and dell replaced it with a PE 720xd several months ago. I would not recommend these cloud edge servers, you're better off buying supermicros.

Comment 11 RHEL Program Management 2014-03-07 13:32:55 UTC
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 12 Charles Rose (Dell) 2014-03-12 20:17:43 UTC
Closing based on comment #10.

Please feel free to reopen if necessary.


Note You need to log in before you can comment on or make changes to this bug.