Bug 138815 - [RHEL3-U5][Diskdump] Stalls before printing "CPU frozen"
[RHEL3-U5][Diskdump] Stalls before printing "CPU frozen"
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
All Linux
medium Severity high
: ---
: ---
Assigned To: Tatsuo Uchida
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-11 07:11 EST by Jun'ichi NOMURA
Modified: 2007-11-30 17:07 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-18 09:28:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
I/O load with dd (510 bytes, text/plain)
2004-11-11 07:15 EST, Jun'ichi NOMURA
no flags Details
Patch to fix deadlock in smp_call_function (2.19 KB, patch)
2004-12-01 00:45 EST, Jun'ichi NOMURA
no flags Details | Diff
Patch to use dump_smp_call_function (5.67 KB, patch)
2004-12-09 09:16 EST, Jun'ichi NOMURA
no flags Details | Diff
The patch for ppc64 produced by IBM (5.82 KB, patch)
2004-12-14 10:45 EST, Yuuichi Nagahama
no flags Details | Diff
ppc64 support for dump_smp_call_function(), v3 (5.66 KB, patch)
2004-12-15 14:27 EST, Daniel Stekloff
no flags Details | Diff
dump_smp_call_function(i386 and x86-64 ) (10.53 KB, patch)
2004-12-20 00:50 EST, Kazuko Kimura
no flags Details | Diff

  None (edit)
Description Jun'ichi NOMURA 2004-11-11 07:11:49 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.7.2)
Gecko/20040820 Debian/1.7.2-4

Description of problem:
While using megaraid2.o driver, when panic occurs,
diskdump stalls before printing any messages on console.


Version-Release number of selected component (if applicable):
2.4.21-23.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Set up diskdump with dump device on megaraid2.o
2. Put I/O load on the disk in which dump device is included
3. Crash kernel by SysRq+c

    

Actual Results:  After oops message, no diskdump related message
is printed on console.

Expected Results:  The message below follows oops message:
 CPU frozen: #0#1#2#3
 CPU#0 is executing diskdump.
 start dumping
 check dump partition
 ...


Additional info:

This problem does not occur with aic79xx.
Comment 1 Jun'ichi NOMURA 2004-11-11 07:15:18 EST
Created attachment 106491 [details]
I/O load with dd

Attached shell script (io.sh) and dbench are used
to put I/O load on the disk.
Comment 2 Ernie Petrides 2004-11-11 19:34:07 EST

*** This bug has been marked as a duplicate of 138814 ***
Comment 3 Jun'ichi NOMURA 2004-12-01 00:41:26 EST
The cause of this problem is not same with #138814.
When I apply the patch attached, the problem seems disappear.
The patch does not affect any code other than diskdump/netdump
initialization.

The problem occured in the following logic.
It is independent to megaraid2 and may occur with netdump.

  dumping CPU                 other CPU
  ------------------------------------------------------------
  <start crash dump>
    disk_dump
      local_irq_disable
                                 <any operation>
                                   smp_call_function
                                     spin_lock(call_lock)
                                     [Waiting for IPI being processed]
      smp_call_function
        spin_lock(call_lock)

As IA-32 uses IPI for highmem related functions, the problem
seems prominent on the architecture.
Logically, it will occur on other platforms.

Since netdump also uses smp_call_function to freeze cpus,
it may also suffer from this problem.
Comment 4 Jun'ichi NOMURA 2004-12-01 00:45:47 EST
Created attachment 107688 [details]
Patch to fix deadlock in smp_call_function

Similar patch to other platforms may be needed.
Comment 5 Jeffrey Moyer 2004-12-01 10:25:46 EST
Nice work debugging this!  The description sounds sane, and the patch
looks good.  One thing that comes to mind is that you could do a
spin_try_lock on the call_lock, and only revert to the dump version of
smp_call_function if you don't get the lock.

-Jeff
Comment 6 Jun'ichi NOMURA 2004-12-08 01:30:31 EST
Thank you for the comment.
As the trylock checking is racy, I didn't find a merit to
do it to decide whether dump specific version is called or not.
Calling dump specific smp_call_function unconditionally makes
the code easier to read and the behaviour more predictable.

Though I didn't dare to do it, directly call dump_smp_call_function
from netdump/diskdump may be less intrusive as we can remove
'wait < 0' checking and bring back to the upstream version.
Comment 7 Dave Anderson 2004-12-08 08:40:43 EST
> Though I didn't dare to do it, directly call dump_smp_call_function
> from netdump/diskdump may be less intrusive as we can remove
> 'wait < 0' checking and bring back to the upstream version.

I think I like that idea even better...

In the case of a dump operation, no special consideration should be
given to the other processors except to prevent multiple attempts to
take the dump_call_lock, right?  Why should we even bother using
smp_call_function()?
Comment 8 Jeffrey Moyer 2004-12-08 09:49:43 EST
> Why should we even bother using smp_call_function()?

The reason is this:

If you manage to get the call_lock, then other processors will _not_
get it, which means they will not send IPI's to other CPUs.  Because
they are all calling smp_call_function with interrupts enabled, we are
certain that they will then receive our IPI.

> As the trylock checking is racy,

spin_trylock is not racy.  What leads you to believe this?

Having said that, I am not against simply calling the dump version of
smp_call_function from netdump and diskdump.  Anything which
simplifies this call path is a good thing.  Plus, it will make it
easier to maintain.
Comment 9 Jun'ichi NOMURA 2004-12-09 09:16:03 EST
Created attachment 108206 [details]
Patch to use dump_smp_call_function

> Having said that, I am not against simply calling the dump version of
> smp_call_function from netdump and diskdump.
Well, I updated the patch.

> spin_trylock is not racy.  What leads you to believe this?
I meant we can't assume the lock status if trylock fails.
Comment 10 Dave Anderson 2004-12-09 09:36:43 EST
Note that diskdump RHEL3-U4 currently supports x86, ia64 and x86_64,
and schedule to support ppc64 in RHEL3-U5.  And a patch for netdump
support for x86_64, ia64 and ppc64 has been posted internally, also
scheduled for RHEL3-U5.

So there will be a need for 4 versions of the dump_smp_call_function().
Comment 11 Yuuichi Nagahama 2004-12-09 10:31:34 EST
The work assignment for the issue is as below.
x86   : NEC
ia64  : Fujitsu
x86_64: NEC
ppc64 : IBM
Comment 12 Yuuichi Nagahama 2004-12-14 10:45:24 EST
Created attachment 108521 [details]
The patch for ppc64 produced by IBM

The patch is sent by DAniel Stekloff, IBM on 12/14.
Comment 13 Daniel Stekloff 2004-12-15 14:27:16 EST
Created attachment 108644 [details]
ppc64 support for dump_smp_call_function(), v3

This is a new version of the ppc64 support for dump_smp_call_function, I added
a return to the end of the function.
Comment 14 Nobuyuki Akiyama 2004-12-16 07:13:44 EST
*** Bug 139434 has been marked as a duplicate of this bug. ***
Comment 15 Nobuyuki Akiyama 2004-12-16 07:31:23 EST
*** Bug 139421 has been marked as a duplicate of this bug. ***
Comment 16 Kazuko Kimura 2004-12-20 00:50:11 EST
Created attachment 108875 [details]
dump_smp_call_function(i386 and x86-64 )

This patch is for i386 and x86-64 for dump_smp_call_function(), v3.
Comment 18 Tatsuo Uchida 2005-01-20 19:04:01 EST
I posted to rhkernel-list on 1/20.
Comment 19 Ernie Petrides 2005-02-03 18:24:35 EST
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.11.EL).
Comment 20 Tim Powers 2005-05-18 09:28:28 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html

Note You need to log in before you can comment on or make changes to this bug.