Bug 113341 - netdump - various race conditions that lead to hangs in panic()/die()
Summary: netdump - various race conditions that lead to hangs in panic()/die()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jeff Moyer
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-01-12 21:37 UTC by Ben Woodard
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-12-20 20:54:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch to fix these problems. (36.73 KB, patch)
2004-01-12 21:38 UTC, Ben Woodard
no flags Details | Diff
Change smp_call_function's wait arg to allow -1, to not wait for ipi's to be received. (2.08 KB, patch)
2004-03-01 16:13 UTC, Jeff Moyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:550 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4 2004-12-20 05:00:00 UTC
Red Hat Product Errata RHSA-2004:188 0 normal SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 2 2004-05-11 04:00:00 UTC

Description Ben Woodard 2004-01-12 21:37:34 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1)
Gecko/20030225

Description of problem:
We have been having some problems with the synchronization of the
processors as machines crash. One of our kernel developers spent
considerable time analyzing why this was happening and came up with the
following patch. These go a long way to fixing those problems. We are
pretty certain that they solve all the problems we have observed. This
is not to say that it solves all problems in synchronizing processors as
a machine crashes but it solves all the problems we have seen thus far.

The three problems we have seen are:
1) one CPU just so happens to be in an interrupt handler with interrupts
disabled at the time of the crash.
2) The NMI watchdog interrupts the crash process.
3) We have seen a couple problems where we hung in printk and Dave
believes that this was due to the panic code hitting tiny races in the
die/panic kernel path. Since we closed the size of the race in die we
hadn't seen these.

We have had a variation of this patch running in production for quite a
while and it has been reasonably well tested.

I will open a couple of bugzilla entries to justify these fixes. You can
forward comments regarding this patch through me if you would like or
you can converse with Dave Peterson <dsp> directly. Dave is
currently working on porting netdump to ia64.
From: 	Dave Peterson <dsp>
To: 	Ben Woodard <bwoodard>
Cc: 	Jim Garlick <garlick>
Subject: 	patch with netdump fixes
Date: 	Thu, 08 Jan 2004 21:02:50 -0800	
Ben,

Attached is a patch for the rh-2_4_21-7EL kernel that
fixes some netdump-related problems.  This is a substantial
rewrite of my previous netdump fixes that will hopefully
be a bit simpler, more robust, and easy to port as netdump
gets ported to platforms other than x86.  It should also
improve the reliability of the shutdown code path even if
netdump is not enabled.  Here is a summary of my changes:

    - The current version of netdump is vulnerable to the
      following failure: CPU A crashes and calls
      smp_call_function() to shut down the other CPUs.
      However, CPU B is hung somewhere, spinning with
      interrupts disabled.  Therefore, B never responds to
      the cross-interrupt and A hangs inside
      smp_call_function().

      The patch fixes this problem by implementing the
      following behavior: CPU A attempts to zap the other
      CPUs in order to get them to shut down.  If they do
      not all respond within a certain timeout, CPU A zaps
      the unresponsive CPUs with a nonmaskable interrupt.
      If any of the remaining CPUs fail to respond to the
      NMI within a certain timeout, CPU A gives up and
      continues executing.

    - The current version of netdump does not shut down
      the NMI watchdog before starting to dump.  The
      watchdog can therefore mess up netdump while it is
      sending the dump.  To fix this, the patch adds some
      code that stops the watchdog individually on each
      CPU.

    - I have modified the shutdown code so that when a
      panic or BUG() occurs, the crashing CPU immediately
      stops other CPUs before doing anything else.  This
      improves the reliability of the shutdown code path
      because fewer things can go wrong once other CPUs
      have been shut down.  Stopping other CPUs as early
      as possible eliminates SMP issues from the rest of
      the code, allowing for greater simplicity.

      Also I fixed things so that the crashing CPU doesn't
      call printk() until it has stopped all other CPUs
      and then called bust_spinlocks().  This prevents
      another CPU from grabbing a lock and crashing after
      bust_spinlocks() has been called, causing the system
      to hang inside printk().  The patch also modifies the
      NMI handler so that when the watchdog bites, it
      avoids calling printk() until other CPUs have been
      shut down and bust_spinlocks() has been called.

Please take a look at the patch, and send it to redhat if
you think they may find it useful.


Thanks,

Dave


Version-Release number of selected component (if applicable):
kernel-2.4.21-7EL

How reproducible:
Sometimes

Steps to Reproduce:
These are race conditions that are sometimes hard to hit.
1. die while another CPU is in an ISR while with interrupts disabled

1. Take so long to do a netdump that the watchdog NMI triggers.

1. Very hard to hit race in die() with printk.
    

Actual Results:  hangs

Expected Results:  netdumps

Additional info:

Comment 1 Ben Woodard 2004-01-12 21:38:20 UTC
Created attachment 96907 [details]
patch to fix these problems.

Comment 2 Jeff Moyer 2004-03-01 16:13:17 UTC
Created attachment 98162 [details]
Change smp_call_function's wait arg to allow -1, to not wait for ipi's to be received.

This patch was accepted into the taroon U2 pool.  It changes the netdump code
to not wait for IPI's to be received by other CPUs on an smp system.

Comment 3 Ernie Petrides 2004-03-05 20:28:00 UTC
Jeff's fix was commited to U2 in kernel version 2.4.21-9.14.EL.


Comment 4 L3support 2004-04-07 02:41:55 UTC
Some troubles that is the same cause as Bug#113341 has occurred 
in our customer.

We also tried "RHEL3 Update2 Public Beta(kernel-2.4.21-12.EL)", 
but same troubles occurred.
The problem of Bug#113341 is not solved by Update2, either.
We ask for immediate correction.


Description of problem:
 Netdump does not send dump to the server.
 Usually, it is possible to send dumping to a Netdump-Server.
 But, sometimes it fails.

  If it fails, the following also occurs.
  1)The following directory and file are created on Netdump-Server.
     /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm
     /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/log
    But, the following are NOT created.
     /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump
     /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump-incomplete
  2)Incomprehensible characters are displayed on Client Console.
  3)Client machine will not reboot automatically.

 Our customer says, "netdump does not become reliance!" and is very angry.


Our test and consideration:
 According to our test, in a client with many CPUs, netdump seems 
 not to function normally.

 We also tried "RHEL3 Update2 Public Beta(kernel-2.4.21-12.EL)", but
 same troubles occurred.

 In our investigation, it was judged that the following troubles 
 of Bug#113341 were the cause.

 >3) We have seen a couple problems where we hung in printk and Dave
 >believes that this was due to the panic code hitting tiny races in the
 >die/panic kernel path. Since we closed the size of the race in die we
 >hadn't seen these.

 We tried "RHEL3 Update1(kernel-2.4.21-9.EL)+Dave's patch",
 same trouble did not occur !
 https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=96907&action=view

Our demand:
 Why wasn't "Dave's patch" adopted ??
 The problem of Bug#113341 is not solved by Update2, either.
 We ask for immediate correction.


Version-Release number:
 kernel-2.4.21-9.EL  kernel-2.4.21-12.EL
 netdump-0.6.11-3

How reproducible:
 Sometimes

Step to Reproduce:
1.boot
2.wait 2-3 minutes
3.Push "Alt + SysRq + c"

Expected results:
 Successful netdump in every case.
 1)The following directory and file are created on Netdump-Server.
    /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm
    /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/log
    /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump
 2)Client machine will reboot automatically.

Additional info:

Severity: 1(We ask for immediate correction.)


Comment 5 L3support 2004-04-07 04:22:54 UTC
To: Ben Woodard
To: Dave Peterson
To: Someone who knows about this bug

I have a problem that netdump hangs with RHEL3-U2-Beta. I found this
problem was solved by removing printk() in the freeze_cpu(). I tried
Dave Peterson's patch and our problem was also resolved. I think the
following point is related to our problem.

Dave Peterson wrote:

>   Also I fixed things so that the crashing CPU doesn't
>   call printk() until it has stopped all other CPUs
>   and then called bust_spinlocks().  This prevents
>   another CPU from grabbing a lock and crashing after
>   bust_spinlocks() has been called, causing the system
>   to hang inside printk().  The patch also modifies the
>   NMI handler so that when the watchdog bites, it
>   avoids calling printk() until other CPUs have been
>   shut down and bust_spinlocks() has been called.

I don't understand how netdump hangs inside printk().
Could anyone teach it in detail?
For example, which lock does "grabbing a lock" mean?

Comment 8 John Flanagan 2004-05-12 01:08:18 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html


Comment 12 Ernie Petrides 2004-09-20 06:59:21 UTC
An additional fix for this problem has just been committed to the
RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.8.EL).


Comment 13 John Flanagan 2004-12-20 20:54:51 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html



Note You need to log in before you can comment on or make changes to this bug.