Red Hat Bugzilla – Bug 445422
Feature: allow panic on softlockup warnings
Last modified: 2010-10-22 20:45:55 EDT
Escalated to Bugzilla from IssueTracker
We have been hit with a huge number of softlockup problems that have been proving themselves hard to track down. We would like to be able to have the option of having the node panic when it gets a softlockup and then we could analyze the crashdump. It appears that Ingo Molnar has submitted a patch to upstream with the desired behavior. We would like to have it backported to RHEL5. http://people.redhat.com/mingo/softlockup-patches/softlockup-allow-panic-on-lockup.patch This event sent from IssueTracker by jwest [SEG - Feature Request] issue 178763
Created attachment 305272 [details] RHEL5 fix for this issue
Is there a bugzilla tracking the softlockup messages? My impression is this patch is just a temporary workaround to a real bug. The better patch would be to add the info we need to analyze these softlockup messages in the future. I see them all the time in my tests and it's hard to tell if they are meaningful or just false positives (where a piece of code is known to take forever and therefore needs to kick the timer). We added some code in 5.2 to allow more info to be displayed, if that isn't helpful enough perhaps we should add more?
(In reply to comment #8) > Is there a bugzilla tracking the softlockup messages? > Ben? > My impression is this patch is just a temporary workaround to a real bug. The patch in question would allow a user to get a crashdump for later analysis and free up the system so that it could continue to do whatever it was doing. If this system was a HA system from Stratus, NEC, etc., getting the system back up and running in a normal mode of operation is critical. > The better patch would be to add the info we need to analyze these softlockup > messages in the future. I see them all the time in my tests and it's hard to > tell if they are meaningful or just false positives (where a piece of code is > known to take forever and therefore needs to kick the timer). But you're right -- this shouldn't be considered a _solution_ to the softlockup problem and there should be a BZ associated with the softlockup warning messages that LLNL is seeing. > > We added some code in 5.2 to allow more info to be displayed, if that isn't > helpful enough perhaps we should add more? > I'm not sure what more we should add -- I suppose that adding stack dumps of the other processors *might* be helpful. But that still has the problem that the other processors have continued on after the softlockup... P.
> Is there a bugzilla tracking the softlockup messages? > > Ben? We have several issues open regarding various softlockup problems that we are working on. We are not quite ready to move to 5.2 yet but we are working on backporting the patches for the softlock messages to our 5.1 kernel in the mean time. (Don't worry we will get there soon but it takes about 1.5 months after RH does an official release for us to begin rolling out a release.)
in kernel-2.6.18-99.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes.
this bug is now documented in the RHEL5.3 release notes. you can view a mock build of this document at the following link: http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html-single/Release_Notes/
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The soft lockup detector can now be configured to trigger a kernel panic instead of a warning message. This makes it possible for users to generate and analyze a crash dump during a soft lockup for forensic purposes. To configure the soft lockup detector to generate a panic, set the kernel parameter soft_lockup to 1. This parameter is set to 0 by default.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html