Red Hat Bugzilla – Bug 138815
[RHEL3-U5][Diskdump] Stalls before printing "CPU frozen"
Last modified: 2007-11-30 17:07:05 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.7.2)
Description of problem:
While using megaraid2.o driver, when panic occurs,
diskdump stalls before printing any messages on console.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Set up diskdump with dump device on megaraid2.o
2. Put I/O load on the disk in which dump device is included
3. Crash kernel by SysRq+c
Actual Results: After oops message, no diskdump related message
is printed on console.
Expected Results: The message below follows oops message:
CPU frozen: #0#1#2#3
CPU#0 is executing diskdump.
check dump partition
This problem does not occur with aic79xx.
Created attachment 106491 [details]
I/O load with dd
Attached shell script (io.sh) and dbench are used
to put I/O load on the disk.
*** This bug has been marked as a duplicate of 138814 ***
The cause of this problem is not same with #138814.
When I apply the patch attached, the problem seems disappear.
The patch does not affect any code other than diskdump/netdump
The problem occured in the following logic.
It is independent to megaraid2 and may occur with netdump.
dumping CPU other CPU
<start crash dump>
[Waiting for IPI being processed]
As IA-32 uses IPI for highmem related functions, the problem
seems prominent on the architecture.
Logically, it will occur on other platforms.
Since netdump also uses smp_call_function to freeze cpus,
it may also suffer from this problem.
Created attachment 107688 [details]
Patch to fix deadlock in smp_call_function
Similar patch to other platforms may be needed.
Nice work debugging this! The description sounds sane, and the patch
looks good. One thing that comes to mind is that you could do a
spin_try_lock on the call_lock, and only revert to the dump version of
smp_call_function if you don't get the lock.
Thank you for the comment.
As the trylock checking is racy, I didn't find a merit to
do it to decide whether dump specific version is called or not.
Calling dump specific smp_call_function unconditionally makes
the code easier to read and the behaviour more predictable.
Though I didn't dare to do it, directly call dump_smp_call_function
from netdump/diskdump may be less intrusive as we can remove
'wait < 0' checking and bring back to the upstream version.
> Though I didn't dare to do it, directly call dump_smp_call_function
> from netdump/diskdump may be less intrusive as we can remove
> 'wait < 0' checking and bring back to the upstream version.
I think I like that idea even better...
In the case of a dump operation, no special consideration should be
given to the other processors except to prevent multiple attempts to
take the dump_call_lock, right? Why should we even bother using
> Why should we even bother using smp_call_function()?
The reason is this:
If you manage to get the call_lock, then other processors will _not_
get it, which means they will not send IPI's to other CPUs. Because
they are all calling smp_call_function with interrupts enabled, we are
certain that they will then receive our IPI.
> As the trylock checking is racy,
spin_trylock is not racy. What leads you to believe this?
Having said that, I am not against simply calling the dump version of
smp_call_function from netdump and diskdump. Anything which
simplifies this call path is a good thing. Plus, it will make it
easier to maintain.
Created attachment 108206 [details]
Patch to use dump_smp_call_function
> Having said that, I am not against simply calling the dump version of
> smp_call_function from netdump and diskdump.
Well, I updated the patch.
> spin_trylock is not racy. What leads you to believe this?
I meant we can't assume the lock status if trylock fails.
Note that diskdump RHEL3-U4 currently supports x86, ia64 and x86_64,
and schedule to support ppc64 in RHEL3-U5. And a patch for netdump
support for x86_64, ia64 and ppc64 has been posted internally, also
scheduled for RHEL3-U5.
So there will be a need for 4 versions of the dump_smp_call_function().
The work assignment for the issue is as below.
x86 : NEC
ia64 : Fujitsu
ppc64 : IBM
Created attachment 108521 [details]
The patch for ppc64 produced by IBM
The patch is sent by DAniel Stekloff, IBM on 12/14.
Created attachment 108644 [details]
ppc64 support for dump_smp_call_function(), v3
This is a new version of the ppc64 support for dump_smp_call_function, I added
a return to the end of the function.
*** Bug 139434 has been marked as a duplicate of this bug. ***
*** Bug 139421 has been marked as a duplicate of this bug. ***
Created attachment 108875 [details]
dump_smp_call_function(i386 and x86-64 )
This patch is for i386 and x86-64 for dump_smp_call_function(), v3.
I posted to rhkernel-list on 1/20.
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.11.EL).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.