Description of problem: If one CPU is hung for some reason on an SMP system, any action that does a stopkernel will result in a system hang. A particularly insidious case of this is trying to start the crash utility to look at the live system in order to determine what the nonresponding CPU is doing. Starting crash tries to insmod the crash.ko driver, and insmod is one of the activities that uses stopkernel. Version-Release number of selected component (if applicable): Any How reproducible: Always Steps to Reproduce: 1. Easiest way is to make a trivial driver module who init routine goes into an infinite loop. 2. Insmod this driver on an SMP system. 3. Attempt to start crash. Actual results: System hangs. On our achitecture this results in an NMI due to the BMC watchdog timer not getting fed. Expected results: The test person who found this was hoping to diagnose the cause of the stuck CPU by looking at the live system using crash. Additional info: Not many actions in the kernel use stopmachine, but intel-rng.ko was one of them. We have occluded that for our users with a dummy module (since we cannot support the hardware random number generator in a fault-tolerant way anyhow) so that that case no longer occurs here. We have a more-robust version of the stopkernel idea that we use in some of our own device drivers to avoid causing the same problem when we need to synchronize the actions of all the (live) CPUs.
I've been in this situation before with a newer Intel processor. Charlotte, you mention that you have a patch for a more robust stop_machine() -- any chance you could attach it to this BZ? Also, what HW are you seeing this issue on? You should also point your TAM to this BZ so that they can track it. P.
Charlotte, any chance we could see a patch? P.
Hi, Prarit - Any multiprocessor system will do; no special or unusual hardware needed other than that. I think when I reported it I was using an 8-core system; the one I have now has 4 cores. Concoct some code you can insmod that hangs up one CPU and then try to run crash on the live system, and you'll see this happen. I will try to find time to work on a patch for you. May not be real soon as I am trying to bring up new hardware just now (and so am up to my neck in troubles already). /Charlotte
FYI: to work around this w/respect to the crash utility, you can do this, say, in /etc/rc.local or by hand: $ modprobe crash and then, after the hang, you can run crash on the live system like this: $ crash --memory_module crash If you don't add the "--memory_module crash" arguments, the crash session will still come up OK, but it will rmmod the crash kernel module upon exit. The extra arguments will serve to keep the crash kernel module loaded.