From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.7.2) Gecko/20040820 Debian/1.7.2-4 Description of problem: While using megaraid2.o driver, when panic occurs, diskdump stalls before printing any messages on console. Version-Release number of selected component (if applicable): 2.4.21-23.EL How reproducible: Sometimes Steps to Reproduce: 1. Set up diskdump with dump device on megaraid2.o 2. Put I/O load on the disk in which dump device is included 3. Crash kernel by SysRq+c Actual Results: After oops message, no diskdump related message is printed on console. Expected Results: The message below follows oops message: CPU frozen: #0#1#2#3 CPU#0 is executing diskdump. start dumping check dump partition ... Additional info: This problem does not occur with aic79xx.
Created attachment 106491 [details] I/O load with dd Attached shell script (io.sh) and dbench are used to put I/O load on the disk.
*** This bug has been marked as a duplicate of 138814 ***
The cause of this problem is not same with #138814. When I apply the patch attached, the problem seems disappear. The patch does not affect any code other than diskdump/netdump initialization. The problem occured in the following logic. It is independent to megaraid2 and may occur with netdump. dumping CPU other CPU ------------------------------------------------------------ <start crash dump> disk_dump local_irq_disable <any operation> smp_call_function spin_lock(call_lock) [Waiting for IPI being processed] smp_call_function spin_lock(call_lock) As IA-32 uses IPI for highmem related functions, the problem seems prominent on the architecture. Logically, it will occur on other platforms. Since netdump also uses smp_call_function to freeze cpus, it may also suffer from this problem.
Created attachment 107688 [details] Patch to fix deadlock in smp_call_function Similar patch to other platforms may be needed.
Nice work debugging this! The description sounds sane, and the patch looks good. One thing that comes to mind is that you could do a spin_try_lock on the call_lock, and only revert to the dump version of smp_call_function if you don't get the lock. -Jeff
Thank you for the comment. As the trylock checking is racy, I didn't find a merit to do it to decide whether dump specific version is called or not. Calling dump specific smp_call_function unconditionally makes the code easier to read and the behaviour more predictable. Though I didn't dare to do it, directly call dump_smp_call_function from netdump/diskdump may be less intrusive as we can remove 'wait < 0' checking and bring back to the upstream version.
> Though I didn't dare to do it, directly call dump_smp_call_function > from netdump/diskdump may be less intrusive as we can remove > 'wait < 0' checking and bring back to the upstream version. I think I like that idea even better... In the case of a dump operation, no special consideration should be given to the other processors except to prevent multiple attempts to take the dump_call_lock, right? Why should we even bother using smp_call_function()?
> Why should we even bother using smp_call_function()? The reason is this: If you manage to get the call_lock, then other processors will _not_ get it, which means they will not send IPI's to other CPUs. Because they are all calling smp_call_function with interrupts enabled, we are certain that they will then receive our IPI. > As the trylock checking is racy, spin_trylock is not racy. What leads you to believe this? Having said that, I am not against simply calling the dump version of smp_call_function from netdump and diskdump. Anything which simplifies this call path is a good thing. Plus, it will make it easier to maintain.
Created attachment 108206 [details] Patch to use dump_smp_call_function > Having said that, I am not against simply calling the dump version of > smp_call_function from netdump and diskdump. Well, I updated the patch. > spin_trylock is not racy. What leads you to believe this? I meant we can't assume the lock status if trylock fails.
Note that diskdump RHEL3-U4 currently supports x86, ia64 and x86_64, and schedule to support ppc64 in RHEL3-U5. And a patch for netdump support for x86_64, ia64 and ppc64 has been posted internally, also scheduled for RHEL3-U5. So there will be a need for 4 versions of the dump_smp_call_function().
The work assignment for the issue is as below. x86 : NEC ia64 : Fujitsu x86_64: NEC ppc64 : IBM
Created attachment 108521 [details] The patch for ppc64 produced by IBM The patch is sent by DAniel Stekloff, IBM on 12/14.
Created attachment 108644 [details] ppc64 support for dump_smp_call_function(), v3 This is a new version of the ppc64 support for dump_smp_call_function, I added a return to the end of the function.
*** Bug 139434 has been marked as a duplicate of this bug. ***
*** Bug 139421 has been marked as a duplicate of this bug. ***
Created attachment 108875 [details] dump_smp_call_function(i386 and x86-64 ) This patch is for i386 and x86-64 for dump_smp_call_function(), v3.
I posted to rhkernel-list on 1/20.
A fix for this problem has just been committed to the RHEL3 U5 patch pool this afternoon (in kernel version 2.4.21-27.11.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html