Bug 138815 - [RHEL3-U5][Diskdump] Stalls before printing "CPU frozen"
Summary: [RHEL3-U5][Diskdump] Stalls before printing "CPU frozen"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Tatsuo Uchida
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-11 12:11 UTC by Jun'ichi NOMURA
Modified: 2007-11-30 22:07 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:28:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
I/O load with dd (510 bytes, text/plain)
2004-11-11 12:15 UTC, Jun'ichi NOMURA
no flags Details
Patch to fix deadlock in smp_call_function (2.19 KB, patch)
2004-12-01 05:45 UTC, Jun'ichi NOMURA
no flags Details | Diff
Patch to use dump_smp_call_function (5.67 KB, patch)
2004-12-09 14:16 UTC, Jun'ichi NOMURA
no flags Details | Diff
The patch for ppc64 produced by IBM (5.82 KB, patch)
2004-12-14 15:45 UTC, Yuuichi Nagahama
no flags Details | Diff
ppc64 support for dump_smp_call_function(), v3 (5.66 KB, patch)
2004-12-15 19:27 UTC, Daniel Stekloff
no flags Details | Diff
dump_smp_call_function(i386 and x86-64 ) (10.53 KB, patch)
2004-12-20 05:50 UTC, Kazuko Kimura
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description Jun'ichi NOMURA 2004-11-11 12:11:49 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.7.2)
Gecko/20040820 Debian/1.7.2-4

Description of problem:
While using megaraid2.o driver, when panic occurs,
diskdump stalls before printing any messages on console.


Version-Release number of selected component (if applicable):
2.4.21-23.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. Set up diskdump with dump device on megaraid2.o
2. Put I/O load on the disk in which dump device is included
3. Crash kernel by SysRq+c

    

Actual Results:  After oops message, no diskdump related message
is printed on console.

Expected Results:  The message below follows oops message:
 CPU frozen: #0#1#2#3
 CPU#0 is executing diskdump.
 start dumping
 check dump partition
 ...


Additional info:

This problem does not occur with aic79xx.

Comment 1 Jun'ichi NOMURA 2004-11-11 12:15:18 UTC
Created attachment 106491 [details]
I/O load with dd

Attached shell script (io.sh) and dbench are used
to put I/O load on the disk.

Comment 2 Ernie Petrides 2004-11-12 00:34:07 UTC

*** This bug has been marked as a duplicate of 138814 ***

Comment 3 Jun'ichi NOMURA 2004-12-01 05:41:26 UTC
The cause of this problem is not same with #138814.
When I apply the patch attached, the problem seems disappear.
The patch does not affect any code other than diskdump/netdump
initialization.

The problem occured in the following logic.
It is independent to megaraid2 and may occur with netdump.

  dumping CPU                 other CPU
  ------------------------------------------------------------
  <start crash dump>
    disk_dump
      local_irq_disable
                                 <any operation>
                                   smp_call_function
                                     spin_lock(call_lock)
                                     [Waiting for IPI being processed]
      smp_call_function
        spin_lock(call_lock)

As IA-32 uses IPI for highmem related functions, the problem
seems prominent on the architecture.
Logically, it will occur on other platforms.

Since netdump also uses smp_call_function to freeze cpus,
it may also suffer from this problem.


Comment 4 Jun'ichi NOMURA 2004-12-01 05:45:47 UTC
Created attachment 107688 [details]
Patch to fix deadlock in smp_call_function

Similar patch to other platforms may be needed.

Comment 5 Jeff Moyer 2004-12-01 15:25:46 UTC
Nice work debugging this!  The description sounds sane, and the patch
looks good.  One thing that comes to mind is that you could do a
spin_try_lock on the call_lock, and only revert to the dump version of
smp_call_function if you don't get the lock.

-Jeff

Comment 6 Jun'ichi NOMURA 2004-12-08 06:30:31 UTC
Thank you for the comment.
As the trylock checking is racy, I didn't find a merit to
do it to decide whether dump specific version is called or not.
Calling dump specific smp_call_function unconditionally makes
the code easier to read and the behaviour more predictable.

Though I didn't dare to do it, directly call dump_smp_call_function
from netdump/diskdump may be less intrusive as we can remove
'wait < 0' checking and bring back to the upstream version.


Comment 7 Dave Anderson 2004-12-08 13:40:43 UTC
> Though I didn't dare to do it, directly call dump_smp_call_function
> from netdump/diskdump may be less intrusive as we can remove
> 'wait < 0' checking and bring back to the upstream version.

I think I like that idea even better...

In the case of a dump operation, no special consideration should be
given to the other processors except to prevent multiple attempts to
take the dump_call_lock, right?  Why should we even bother using
smp_call_function()?


Comment 8 Jeff Moyer 2004-12-08 14:49:43 UTC
> Why should we even bother using smp_call_function()?

The reason is this:

If you manage to get the call_lock, then other processors will _not_
get it, which means they will not send IPI's to other CPUs.  Because
they are all calling smp_call_function with interrupts enabled, we are
certain that they will then receive our IPI.

> As the trylock checking is racy,

spin_trylock is not racy.  What leads you to believe this?

Having said that, I am not against simply calling the dump version of
smp_call_function from netdump and diskdump.  Anything which
simplifies this call path is a good thing.  Plus, it will make it
easier to maintain.

Comment 9 Jun'ichi NOMURA 2004-12-09 14:16:03 UTC
Created attachment 108206 [details]
Patch to use dump_smp_call_function

> Having said that, I am not against simply calling the dump version of
> smp_call_function from netdump and diskdump.
Well, I updated the patch.

> spin_trylock is not racy.  What leads you to believe this?
I meant we can't assume the lock status if trylock fails.

Comment 10 Dave Anderson 2004-12-09 14:36:43 UTC
Note that diskdump RHEL3-U4 currently supports x86, ia64 and x86_64,
and schedule to support ppc64 in RHEL3-U5.  And a patch for netdump
support for x86_64, ia64 and ppc64 has been posted internally, also
scheduled for RHEL3-U5.

So there will be a need for 4 versions of the dump_smp_call_function().


Comment 11 Yuuichi Nagahama 2004-12-09 15:31:34 UTC
The work assignment for the issue is as below.
x86   : NEC
ia64  : Fujitsu
x86_64: NEC
ppc64 : IBM


Comment 12 Yuuichi Nagahama 2004-12-14 15:45:24 UTC
Created attachment 108521 [details]
The patch for ppc64 produced by IBM

The patch is sent by DAniel Stekloff, IBM on 12/14.

Comment 13 Daniel Stekloff 2004-12-15 19:27:16 UTC
Created attachment 108644 [details]
ppc64 support for dump_smp_call_function(), v3

This is a new version of the ppc64 support for dump_smp_call_function, I added
a return to the end of the function.

Comment 14 Nobuyuki Akiyama 2004-12-16 12:13:44 UTC
*** Bug 139434 has been marked as a duplicate of this bug. ***

Comment 15 Nobuyuki Akiyama 2004-12-16 12:31:23 UTC
*** Bug 139421 has been marked as a duplicate of this bug. ***

Comment 16 Kazuko Kimura 2004-12-20 05:50:11 UTC
Created attachment 108875 [details]
dump_smp_call_function(i386 and x86-64 )

This patch is for i386 and x86-64 for dump_smp_call_function(), v3.

Comment 18 Tatsuo Uchida 2005-01-21 00:04:01 UTC
I posted to rhkernel-list on 1/20.


Comment 19 Ernie Petrides 2005-02-03 23:24:35 UTC
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-27.11.EL).


Comment 20 Tim Powers 2005-05-18 13:28:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html



Note You need to log in before you can comment on or make changes to this bug.