Bug 430452 - kernel hangs when hitting sysrq-w repeatedly
kernel hangs when hitting sysrq-w repeatedly
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
low Severity high
: rc
: ---
Assigned To: Ivan Vecera
Martin Jenner
Depends On:
  Show dependency treegraph
Reported: 2008-01-28 00:38 EST by Makito SHIOKAWA
Modified: 2008-09-04 12:31 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-06-24 04:15:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Proposed patch (1.14 KB, patch)
2008-05-29 08:44 EDT, Ivan Vecera
no flags Details | Diff
sysrq-w deadlock fix patch (4.90 KB, patch)
2008-06-26 21:59 EDT, Makito SHIOKAWA
no flags Details | Diff

  None (edit)
Description Makito SHIOKAWA 2008-01-28 00:38:22 EST
Description of problem:
Kernel hangs when hitting sysrq-w repeatedly. It seems deadlock between sysrq-w 
and haldaemon's CD-ROM drive polling.

CPU0                           CPU1
do_IRQ()                       system_call()
...                            ...
  /* IRQ disabled */
                                       /* call_lock taken */
                                       while(atomic_read(&data.started) != cpus)
                                       /* waiting for IPI response */
      /* waiting for call_lock */

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.Check whether haldaemon is running
2.Enable sysrq
3.Hit alt-sysrq-w repeatedly

Actual results:
Kernel hungs.

Expected results:
Shows result of sysrq-w normally.

Additional info:
It also occurs on Red Hat Enterprise Linux 5.1 (ex. by hitting sysrq-w 
repeatedly when mount/umount done repeatedly).
It can be avoided by changing spin_lock(&call_lock) to spin_trylock(&call_lock) 
in smp_call_function().
Comment 1 Ivan Vecera 2008-05-22 09:17:50 EDT
Could you please try a kernel package and provide some test report? They are
available on: http://people.redhat.com/ivecera/rhel-4-ivtest/

Comment 2 Makito SHIOKAWA 2008-05-23 07:01:38 EDT
I could test, but is corresponding patch or SRPM available? I would like to 
comprehend how it is treated for doing the test.
Comment 3 Ivan Vecera 2008-05-29 08:44:07 EDT
Created attachment 307045 [details]
Proposed patch

Ok, I'm putting the patch that solves this issue. I tried to reproduce the bug
on kernel-smp-2.6.9-67.EL with "success". The problem is that the
smp_call_function is called when IRQs are disabled.
The upstream introduced similar functionality for SysRq+L (see
h=5045bcae0fb466a1dbb6af0036e56901fd7aafb7) but this one doesn't use
smp_call_function directly but uses schedule_work. The same approach I used in
my patch.
I did some tests by myself and the problem seems to be solved but I would like
to ask you for some testing. The patched kernels (for i686 and x86_64) are
located at: http://people.redhat.com/ivecera/rhel-4-ivtest/
Comment 4 Makito SHIOKAWA 2008-06-03 05:30:05 EDT
I've tested kernel-smp-2.6.9-70.EL.ivtest.3.i686.rpm and also applying your 
patch to my kernel manually, and confirmed that problem doesn't reproduce 
anymore. I think this issue is solved now. Thanks for your reply.
Comment 5 Ivan Vecera 2008-06-24 04:15:18 EDT
Unfortunately the proposed patch was rejected by other engineers. The reason is
the sysrq-w was designed to run in interrupt context anf has always been a "use
in case of an emergency" option. It should only be used by an
administrator/service personnel with console access if the the system is already
frozen in some manner. It was never meant to be beaten on continuously as you
are doing; if you do that, eventually you will catch another cpu at just the
wrong time.
This feature will be removed in RHEL-6.
Comment 6 Makito SHIOKAWA 2008-06-26 21:59:35 EDT
Created attachment 310406 [details]
sysrq-w deadlock fix patch

OK, then how about just changing spin_lock() to spin_trylock() in
smp_call_function()? This still runs in interrupt context. (FYI: Attached patch
is how I avoided this problem on x86/x86_64 doing so.) This problem may occur
just hitting sysrq-w once if it was done in exact timing. So, I think it should
be fixed somehow until it is removed...
Comment 7 Ivan Vecera 2008-09-04 12:31:57 EDT
Your patch probably solves this issue but it is completely out of upstream. Our engineers don't want to fix it this way. The sysrq-w functionality was mainly used as RedHat debugging tool in former times and will be removed.

Note You need to log in before you can comment on or make changes to this bug.