Description of problem: Installed RHEL5.8 with the 2.6.18-308 kernel on a Dell PE C2100. Within 2 days the server would hang with no output to the console, no messages in the kernel, and only this error in the ESM under no load : PROC_IERR_STATUS: Processor sensor, IERR was asserted and this in the OMSA logs : CPU 11 has an internal error (IERR). Switched the CPUs to different sockets (to see if the CPU 11 error would move, but it did not), replaced the motherboard, but the error kept happening. Downgraded to the RHEL5.4 kernel 2.6.18-164.15.1.el5 and the server has been up for 6+ days now . Version-Release number of selected component (if applicable): kernel-2.6.18-308.el5.x86_64 How reproducible: Always. Steps to Reproduce: 1. I'm using a 1 hour load 1 hour no load regimen with stress and cron to get the system to crash more quickly, however the system always crashes when it's under 0 load and with the 5.8 kernel : #!/bin/bash # /root/start_stress # system has 144GB RAM, dual X5670 processors cd /tmp stress -v -c 24 -i 2 -d 2 --hdd-bytes 10G -m 12 --vm-bytes 10G 2>&1 > /dev/null & sleep 1h killall stress 2. cron : 0 */2 * * * /root/start_stress 2>&1 > /dev/null Actual results: System crashes with PROC_IERR_STATUS: Processor sensor, IERR was asserted and : CPU 11 has an internal error (IERR). errors in the BMC/ESM . Expected results: System should not crash Additional info: Downgrading to the RHEL 5.4 2.6.18-164.15.1.el5 fixes the problem. Haven't tried 5.5 - 5.7 kernels.
The system crashed with the same error using the "certified" 5.4 kernel. But it had an uptime of 21+ days this time rather than 2-3 with the 5.8 kernel.
I am also facing same issue with RHEL 6.0 64 bit OS with Intel S7000FC4Ur MB with 16GB and 2 CPU . Intermediate Machine is giving CPU IERR error and on rebooting machine working fine in working hours and again given CPU IERR On 0 load. It may work for 2 days but again issues come up. we have changed entire H/w but Still problem presist. Machine was working fine with RHEL 5.5 64 bit without any issue.
We just got our replacement C2100 and are testing it now. 4 days uptime so far under same regimen of tests with 5.4, so far no crashes. We won't give it a clean bill of health until it's up for at least 1 month though.
whether your machine is running fine with RHEL 5.8 after replacement ?
Created attachment 581837 [details] Screenshot of errors Display drivers error
5.4
Do you get those errors all the time, or just before it crashes?
Yes , normally thats errors reported continously ; have you got same errors ?
ok, then I doubt it's related to that. The C2100 uses aspeed graphics. I'm wasn't able to check the console after it crashed, even through the dell bmc ikvm which basically showed a blank screen, and as mentioned previously there were no kernel messages in the logs after power cycling it.
Just to update this, we got fed up with the C2100 and dell replaced it with a PE 720xd several months ago. I would not recommend these cloud edge servers, you're better off buying supermicros.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Closing based on comment #10. Please feel free to reopen if necessary.