Description of problem: We get frequent kernel bug errors in the logs on a newly-installed ia64 system with 4 physical CPUs. These look like: kernel: BUG: soft lockup detected on CPU#3! See attached logfile for the full call traces. Each lockup results in the system becoming unresponsive for a while, so this is obviously problematic for our servers. Version-Release number of selected component (if applicable): kernel-2.6.18-53.1.4.el5 How reproducible: Occurs apparently randomly, every 10-15 minutes. The system is not heavily loaded (CPU or disk IO) - in fact, it was only recently installed and is not yet running any production services. Steps to Reproduce: 1. Boot system normally. Actual results: System reports kernel bugs in the logs. Expected results: System boots and does not report kernel bugs in the logs. Additional info: We didn't see this problem with the original installed kernel (2.6.18-53.el5) - the problem only started to occur once we 'yum update'd to EL5.1. Our temporary workaround will be to downgrade the kernel - I'll follow up on this bug report whether this fixes the problem for us. This particular machine was previously running RHAS3 with no problems. (We reinstalled, not upgraded, for RHEL5.)
Created attachment 289835 [details] Excerpt of /var/log/messages
We have since added the kernel option "nosoftlockups", which resulted in frequent spontaneous reboots of the server. We then downgraded to 2.6.18-53.el5. The server still has softlockups, but not as frequent, when there is no load on the machine.
Any chances to try base kernel also?
Sorry - I guess we forgot to update this. Turned out that some new memory had recently been installed in that server, and it wasn't correctly seated. Once it was correctly installed, the soft lockup problem disappeared. So I guess you can close this bug report, unless you think that this is still a genuine kernel bug in that it should have reacted differently to this misperforming hardware.
Thanks for the update. It sounds more like firmware's responsiblity to correctly report physical memory map...