Bug 457256 - if one CPU is hung, any stopkernel action will hang the system
Summary: if one CPU is hung, any stopkernel action will hang the system
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 694224
TreeView+ depends on / blocked
 
Reported: 2008-07-30 15:34 UTC by Charlotte Richardson
Modified: 2011-10-17 14:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-17 14:07:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Charlotte Richardson 2008-07-30 15:34:07 UTC
Description of problem:
If one CPU is hung for some reason on an SMP system, any action that does a
stopkernel will result in a system hang. A particularly insidious case of this
is trying to start the crash utility to look at the live system in order to
determine what the nonresponding CPU is doing. Starting crash tries to insmod
the crash.ko driver, and insmod is one of the activities that uses stopkernel.


Version-Release number of selected component (if applicable):
Any


How reproducible:
Always


Steps to Reproduce:
1. Easiest way is to make a trivial driver module who init routine goes into an
infinite loop.
2. Insmod this driver on an SMP system.
3. Attempt to start crash.
  
Actual results:
System hangs. On our achitecture this results in an NMI due to the BMC watchdog
timer not getting fed.


Expected results:
The test person who found this was hoping to diagnose the cause of the stuck CPU
by looking at the live system using crash.


Additional info:
Not many actions in the kernel use stopmachine, but intel-rng.ko was one of
them. We have occluded that for our users with a dummy module (since we cannot
support the hardware random number generator in a fault-tolerant way anyhow) so
that that case no longer occurs here. We have a more-robust version of the
stopkernel idea that we use in some of our own device drivers to avoid causing
the same problem when we need to synchronize the actions of all the (live) CPUs.

Comment 1 Prarit Bhargava 2008-08-07 12:18:49 UTC
I've been in this situation before with a newer Intel processor. 

Charlotte, you mention that you have a patch for a more robust stop_machine() -- any chance you could attach it to this BZ?  Also, what HW are you seeing this issue on?

You should also point your TAM to this BZ so that they can track it.

P.

Comment 2 Prarit Bhargava 2008-12-02 13:08:49 UTC
Charlotte, any chance we could see a patch?

P.

Comment 3 Charlotte Richardson 2008-12-02 15:02:38 UTC
Hi, Prarit - Any multiprocessor system will do; no special or unusual hardware needed other than that. I think when I reported it I was using an 8-core system; the one I have now has 4 cores. Concoct some code you can insmod that hangs up one CPU and then try to run crash on the live system, and you'll see this happen.

I will try to find time to work on a patch for you. May not be real soon as I am trying to bring up new hardware just now (and so am up to my neck in troubles already).

/Charlotte

Comment 4 Dave Anderson 2008-12-05 13:24:28 UTC
FYI: to work around this w/respect to the crash utility, you can
do this, say, in /etc/rc.local or by hand:

  $ modprobe crash

and then, after the hang, you can run crash on the live system like this:

  $ crash --memory_module crash

If you don't add the "--memory_module crash" arguments, the crash session will
still come up OK, but it will rmmod the crash kernel module upon exit.  The
extra arguments will serve to keep the crash kernel module loaded.


Note You need to log in before you can comment on or make changes to this bug.