430452 – kernel hangs when hitting sysrq-w repeatedly

Bug 430452 - kernel hangs when hitting sysrq-w repeatedly

Summary: kernel hangs when hitting sysrq-w repeatedly

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ivan Vecera
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-28 05:38 UTC by Makito SHIOKAWA
Modified:	2008-09-04 16:31 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-06-24 08:15:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch (1.14 KB, patch) 2008-05-29 12:44 UTC, Ivan Vecera	no flags	Details \| Diff
sysrq-w deadlock fix patch (4.90 KB, patch) 2008-06-27 01:59 UTC, Makito SHIOKAWA	no flags	Details \| Diff
View All

Description Makito SHIOKAWA 2008-01-28 05:38:22 UTC

Description of problem:
Kernel hangs when hitting sysrq-w repeatedly. It seems deadlock between sysrq-w 
and haldaemon's CD-ROM drive polling.

CPU0                           CPU1
----------------------------------------------------------
do_IRQ()                       system_call()
...                            ...
 i8042_interrupt()
  serio_interrupt()
  /* IRQ disabled */
  ...
                                do_open()
                                 idecd_open()
                                  cdrom_open()
                                   check_disk_change()
                                    __invalidate_device()
                                     invalidate_bdev()
                                      invalidate_bh_lrus()
                                       smp_call_function()
                                       /* call_lock taken */
                                       while(atomic_read(&data.started) != cpus)
                                       /* waiting for IPI response */
   __handle_sysrq()
    sysrq_handle_showcpus()
     smp_call_function()
      spin_lock(&call_lock)
      /* waiting for call_lock */

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-67.EL

How reproducible:

Steps to Reproduce:
1.Check whether haldaemon is running
2.Enable sysrq
3.Hit alt-sysrq-w repeatedly

Actual results:
Kernel hungs.

Expected results:
Shows result of sysrq-w normally.

Additional info:
It also occurs on Red Hat Enterprise Linux 5.1 (ex. by hitting sysrq-w 
repeatedly when mount/umount done repeatedly).
It can be avoided by changing spin_lock(&call_lock) to spin_trylock(&call_lock) 
in smp_call_function().

Comment 1 Ivan Vecera 2008-05-22 13:17:50 UTC

Could you please try a kernel package and provide some test report? They are
available on: http://people.redhat.com/ivecera/rhel-4-ivtest/

Thanks

Comment 2 Makito SHIOKAWA 2008-05-23 11:01:38 UTC

I could test, but is corresponding patch or SRPM available? I would like to 
comprehend how it is treated for doing the test.

Comment 3 Ivan Vecera 2008-05-29 12:44:07 UTC

Created attachment 307045 [details]
Proposed patch

Ok, I'm putting the patch that solves this issue. I tried to reproduce the bug
on kernel-smp-2.6.9-67.EL with "success". The problem is that the
smp_call_function is called when IRQs are disabled.
The upstream introduced similar functionality for SysRq+L (see
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff
h=5045bcae0fb466a1dbb6af0036e56901fd7aafb7) but this one doesn't use
smp_call_function directly but uses schedule_work. The same approach I used in
my patch.
I did some tests by myself and the problem seems to be solved but I would like
to ask you for some testing. The patched kernels (for i686 and x86_64) are
located at: http://people.redhat.com/ivecera/rhel-4-ivtest/

Comment 4 Makito SHIOKAWA 2008-06-03 09:30:05 UTC

I've tested kernel-smp-2.6.9-70.EL.ivtest.3.i686.rpm and also applying your 
patch to my kernel manually, and confirmed that problem doesn't reproduce 
anymore. I think this issue is solved now. Thanks for your reply.

Comment 5 Ivan Vecera 2008-06-24 08:15:18 UTC

Unfortunately the proposed patch was rejected by other engineers. The reason is
the sysrq-w was designed to run in interrupt context anf has always been a "use
in case of an emergency" option. It should only be used by an
administrator/service personnel with console access if the the system is already
frozen in some manner. It was never meant to be beaten on continuously as you
are doing; if you do that, eventually you will catch another cpu at just the
wrong time.
This feature will be removed in RHEL-6.

Comment 6 Makito SHIOKAWA 2008-06-27 01:59:35 UTC

Created attachment 310406 [details]
sysrq-w deadlock fix patch

OK, then how about just changing spin_lock() to spin_trylock() in
smp_call_function()? This still runs in interrupt context. (FYI: Attached patch
is how I avoided this problem on x86/x86_64 doing so.) This problem may occur
just hitting sysrq-w once if it was done in exact timing. So, I think it should
be fixed somehow until it is removed...

Comment 7 Ivan Vecera 2008-09-04 16:31:57 UTC

Your patch probably solves this issue but it is completely out of upstream. Our engineers don't want to fix it this way. The sysrq-w functionality was mainly used as RedHat debugging tool in former times and will be removed.

Note You need to log in before you can comment on or make changes to this bug.