Red Hat Bugzilla – Bug 457256
if one CPU is hung, any stopkernel action will hang the system
Last modified: 2011-10-17 10:07:30 EDT
Description of problem:
If one CPU is hung for some reason on an SMP system, any action that does a
stopkernel will result in a system hang. A particularly insidious case of this
is trying to start the crash utility to look at the live system in order to
determine what the nonresponding CPU is doing. Starting crash tries to insmod
the crash.ko driver, and insmod is one of the activities that uses stopkernel.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Easiest way is to make a trivial driver module who init routine goes into an
2. Insmod this driver on an SMP system.
3. Attempt to start crash.
System hangs. On our achitecture this results in an NMI due to the BMC watchdog
timer not getting fed.
The test person who found this was hoping to diagnose the cause of the stuck CPU
by looking at the live system using crash.
Not many actions in the kernel use stopmachine, but intel-rng.ko was one of
them. We have occluded that for our users with a dummy module (since we cannot
support the hardware random number generator in a fault-tolerant way anyhow) so
that that case no longer occurs here. We have a more-robust version of the
stopkernel idea that we use in some of our own device drivers to avoid causing
the same problem when we need to synchronize the actions of all the (live) CPUs.
I've been in this situation before with a newer Intel processor.
Charlotte, you mention that you have a patch for a more robust stop_machine() -- any chance you could attach it to this BZ? Also, what HW are you seeing this issue on?
You should also point your TAM to this BZ so that they can track it.
Charlotte, any chance we could see a patch?
Hi, Prarit - Any multiprocessor system will do; no special or unusual hardware needed other than that. I think when I reported it I was using an 8-core system; the one I have now has 4 cores. Concoct some code you can insmod that hangs up one CPU and then try to run crash on the live system, and you'll see this happen.
I will try to find time to work on a patch for you. May not be real soon as I am trying to bring up new hardware just now (and so am up to my neck in troubles already).
FYI: to work around this w/respect to the crash utility, you can
do this, say, in /etc/rc.local or by hand:
$ modprobe crash
and then, after the hang, you can run crash on the live system like this:
$ crash --memory_module crash
If you don't add the "--memory_module crash" arguments, the crash session will
still come up OK, but it will rmmod the crash kernel module upon exit. The
extra arguments will serve to keep the crash kernel module loaded.