Description of problem: Kernel hangs when hitting sysrq-w repeatedly. It seems deadlock between sysrq-w and haldaemon's CD-ROM drive polling. CPU0 CPU1 ---------------------------------------------------------- do_IRQ() system_call() ... ... i8042_interrupt() serio_interrupt() /* IRQ disabled */ ... do_open() idecd_open() cdrom_open() check_disk_change() __invalidate_device() invalidate_bdev() invalidate_bh_lrus() smp_call_function() /* call_lock taken */ while(atomic_read(&data.started) != cpus) /* waiting for IPI response */ __handle_sysrq() sysrq_handle_showcpus() smp_call_function() spin_lock(&call_lock) /* waiting for call_lock */ Version-Release number of selected component (if applicable): kernel-smp-2.6.9-67.EL How reproducible: Steps to Reproduce: 1.Check whether haldaemon is running 2.Enable sysrq 3.Hit alt-sysrq-w repeatedly Actual results: Kernel hungs. Expected results: Shows result of sysrq-w normally. Additional info: It also occurs on Red Hat Enterprise Linux 5.1 (ex. by hitting sysrq-w repeatedly when mount/umount done repeatedly). It can be avoided by changing spin_lock(&call_lock) to spin_trylock(&call_lock) in smp_call_function().
Could you please try a kernel package and provide some test report? They are available on: http://people.redhat.com/ivecera/rhel-4-ivtest/ Thanks
I could test, but is corresponding patch or SRPM available? I would like to comprehend how it is treated for doing the test.
Created attachment 307045 [details] Proposed patch Ok, I'm putting the patch that solves this issue. I tried to reproduce the bug on kernel-smp-2.6.9-67.EL with "success". The problem is that the smp_call_function is called when IRQs are disabled. The upstream introduced similar functionality for SysRq+L (see http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff h=5045bcae0fb466a1dbb6af0036e56901fd7aafb7) but this one doesn't use smp_call_function directly but uses schedule_work. The same approach I used in my patch. I did some tests by myself and the problem seems to be solved but I would like to ask you for some testing. The patched kernels (for i686 and x86_64) are located at: http://people.redhat.com/ivecera/rhel-4-ivtest/
I've tested kernel-smp-2.6.9-70.EL.ivtest.3.i686.rpm and also applying your patch to my kernel manually, and confirmed that problem doesn't reproduce anymore. I think this issue is solved now. Thanks for your reply.
Unfortunately the proposed patch was rejected by other engineers. The reason is the sysrq-w was designed to run in interrupt context anf has always been a "use in case of an emergency" option. It should only be used by an administrator/service personnel with console access if the the system is already frozen in some manner. It was never meant to be beaten on continuously as you are doing; if you do that, eventually you will catch another cpu at just the wrong time. This feature will be removed in RHEL-6.
Created attachment 310406 [details] sysrq-w deadlock fix patch OK, then how about just changing spin_lock() to spin_trylock() in smp_call_function()? This still runs in interrupt context. (FYI: Attached patch is how I avoided this problem on x86/x86_64 doing so.) This problem may occur just hitting sysrq-w once if it was done in exact timing. So, I think it should be fixed somehow until it is removed...
Your patch probably solves this issue but it is completely out of upstream. Our engineers don't want to fix it this way. The sysrq-w functionality was mainly used as RedHat debugging tool in former times and will be removed.