This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 808450 - RHEL 5.8 kernel 2.6.18-308 causes PROC IERR errors in the ESM and hangs on Dell C2100 servers (no kernel messages produced)
RHEL 5.8 kernel 2.6.18-308 causes PROC IERR errors in the ESM and hangs on D...
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.8
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Charles Rose (RH)
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-30 08:32 EDT by csb sysadmin
Modified: 2016-03-07 08:44 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-03-13 18:02:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot of errors (265.38 KB, image/png)
2012-05-03 08:03 EDT, ansuls
no flags Details

  None (edit)
Description csb sysadmin 2012-03-30 08:32:55 EDT
Description of problem:

Installed RHEL5.8 with the 2.6.18-308 kernel on a Dell PE C2100. Within 2 days the server would hang with no output to the console, no messages in the kernel, and only this error in the ESM under no load :

PROC_IERR_STATUS: Processor sensor, IERR was asserted

and this in the OMSA logs :

CPU 11 has an internal error (IERR).

Switched the CPUs to different sockets (to see if the CPU 11 error would move, but it did not), replaced the motherboard, but the error kept happening. Downgraded to the RHEL5.4 kernel 2.6.18-164.15.1.el5 and the server has been up for 6+ days now .

Version-Release number of selected component (if applicable):

kernel-2.6.18-308.el5.x86_64

How reproducible:

Always.

Steps to Reproduce:
1. I'm using a 1 hour load 1 hour no load regimen with stress and cron to get the system to crash more quickly, however the system always crashes when it's under 0 load  and with the 5.8 kernel :

#!/bin/bash
# /root/start_stress
# system has 144GB RAM, dual X5670 processors

cd /tmp
stress -v -c 24 -i 2 -d 2 --hdd-bytes 10G -m 12 --vm-bytes 10G 2>&1 > /dev/null &
sleep 1h
killall stress

2. cron :

0 */2 * * *  /root/start_stress 2>&1 > /dev/null

Actual results:

System crashes with 

PROC_IERR_STATUS: Processor sensor, IERR was asserted

and :

CPU 11 has an internal error (IERR).

errors in the BMC/ESM .


Expected results:

System should not crash

Additional info:

Downgrading to the RHEL 5.4 2.6.18-164.15.1.el5 fixes the problem. Haven't tried 5.5 - 5.7 kernels.
Comment 1 csb sysadmin 2012-04-16 10:31:29 EDT
The system crashed with the same error using the "certified" 5.4 kernel. But it had an uptime of 21+ days this time rather than 2-3 with the 5.8 kernel.
Comment 2 ansuls 2012-05-03 05:49:01 EDT
I am also facing same issue with RHEL 6.0 64 bit OS with Intel S7000FC4Ur MB with 16GB  and 2 CPU .

Intermediate Machine is giving CPU IERR error and on rebooting machine working fine in working hours and again given CPU IERR On 0 load. 

It may work for 2 days but again issues come up. we have changed entire H/w but Still problem presist.

Machine was working fine with RHEL 5.5 64 bit without any issue.
Comment 3 csb sysadmin 2012-05-03 07:48:53 EDT
We just got our replacement C2100 and are testing it now. 4 days uptime so far under same regimen of tests with 5.4, so far no crashes. We won't give it a clean bill of health until it's up for at least 1 month though.
Comment 4 ansuls 2012-05-03 08:00:31 EDT
whether your machine is running fine with RHEL 5.8 after replacement ?
Comment 5 ansuls 2012-05-03 08:03:26 EDT
Created attachment 581837 [details]
Screenshot of errors

Display drivers error
Comment 6 csb sysadmin 2012-05-03 08:05:01 EDT
5.4
Comment 7 csb sysadmin 2012-05-03 08:08:52 EDT
Do you get those errors all the time, or just before it crashes?
Comment 8 ansuls 2012-05-03 08:13:49 EDT
Yes , normally thats errors reported continously ; have you got same errors ?
Comment 9 csb sysadmin 2012-05-03 08:20:53 EDT
ok, then I doubt it's related to that. The C2100 uses aspeed graphics. I'm wasn't able to check the console after it crashed, even through the dell bmc ikvm which basically showed a blank screen, and as mentioned previously there were no kernel messages in the logs after power cycling it.
Comment 10 csb sysadmin 2013-11-01 16:54:52 EDT
Just to update this, we got fed up with the C2100 and dell replaced it with a PE 720xd several months ago. I would not recommend these cloud edge servers, you're better off buying supermicros.
Comment 11 RHEL Product and Program Management 2014-03-07 08:32:55 EST
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Comment 12 Charles Rose (RH) 2014-03-12 16:17:43 EDT
Closing based on comment #10.

Please feel free to reopen if necessary.

Note You need to log in before you can comment on or make changes to this bug.