From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: We have been having some problems with the synchronization of the processors as machines crash. One of our kernel developers spent considerable time analyzing why this was happening and came up with the following patch. These go a long way to fixing those problems. We are pretty certain that they solve all the problems we have observed. This is not to say that it solves all problems in synchronizing processors as a machine crashes but it solves all the problems we have seen thus far. The three problems we have seen are: 1) one CPU just so happens to be in an interrupt handler with interrupts disabled at the time of the crash. 2) The NMI watchdog interrupts the crash process. 3) We have seen a couple problems where we hung in printk and Dave believes that this was due to the panic code hitting tiny races in the die/panic kernel path. Since we closed the size of the race in die we hadn't seen these. We have had a variation of this patch running in production for quite a while and it has been reasonably well tested. I will open a couple of bugzilla entries to justify these fixes. You can forward comments regarding this patch through me if you would like or you can converse with Dave Peterson <dsp> directly. Dave is currently working on porting netdump to ia64. From: Dave Peterson <dsp> To: Ben Woodard <bwoodard> Cc: Jim Garlick <garlick> Subject: patch with netdump fixes Date: Thu, 08 Jan 2004 21:02:50 -0800 Ben, Attached is a patch for the rh-2_4_21-7EL kernel that fixes some netdump-related problems. This is a substantial rewrite of my previous netdump fixes that will hopefully be a bit simpler, more robust, and easy to port as netdump gets ported to platforms other than x86. It should also improve the reliability of the shutdown code path even if netdump is not enabled. Here is a summary of my changes: - The current version of netdump is vulnerable to the following failure: CPU A crashes and calls smp_call_function() to shut down the other CPUs. However, CPU B is hung somewhere, spinning with interrupts disabled. Therefore, B never responds to the cross-interrupt and A hangs inside smp_call_function(). The patch fixes this problem by implementing the following behavior: CPU A attempts to zap the other CPUs in order to get them to shut down. If they do not all respond within a certain timeout, CPU A zaps the unresponsive CPUs with a nonmaskable interrupt. If any of the remaining CPUs fail to respond to the NMI within a certain timeout, CPU A gives up and continues executing. - The current version of netdump does not shut down the NMI watchdog before starting to dump. The watchdog can therefore mess up netdump while it is sending the dump. To fix this, the patch adds some code that stops the watchdog individually on each CPU. - I have modified the shutdown code so that when a panic or BUG() occurs, the crashing CPU immediately stops other CPUs before doing anything else. This improves the reliability of the shutdown code path because fewer things can go wrong once other CPUs have been shut down. Stopping other CPUs as early as possible eliminates SMP issues from the rest of the code, allowing for greater simplicity. Also I fixed things so that the crashing CPU doesn't call printk() until it has stopped all other CPUs and then called bust_spinlocks(). This prevents another CPU from grabbing a lock and crashing after bust_spinlocks() has been called, causing the system to hang inside printk(). The patch also modifies the NMI handler so that when the watchdog bites, it avoids calling printk() until other CPUs have been shut down and bust_spinlocks() has been called. Please take a look at the patch, and send it to redhat if you think they may find it useful. Thanks, Dave Version-Release number of selected component (if applicable): kernel-2.4.21-7EL How reproducible: Sometimes Steps to Reproduce: These are race conditions that are sometimes hard to hit. 1. die while another CPU is in an ISR while with interrupts disabled 1. Take so long to do a netdump that the watchdog NMI triggers. 1. Very hard to hit race in die() with printk. Actual Results: hangs Expected Results: netdumps Additional info:
Created attachment 96907 [details] patch to fix these problems.
Created attachment 98162 [details] Change smp_call_function's wait arg to allow -1, to not wait for ipi's to be received. This patch was accepted into the taroon U2 pool. It changes the netdump code to not wait for IPI's to be received by other CPUs on an smp system.
Jeff's fix was commited to U2 in kernel version 2.4.21-9.14.EL.
Some troubles that is the same cause as Bug#113341 has occurred in our customer. We also tried "RHEL3 Update2 Public Beta(kernel-2.4.21-12.EL)", but same troubles occurred. The problem of Bug#113341 is not solved by Update2, either. We ask for immediate correction. Description of problem: Netdump does not send dump to the server. Usually, it is possible to send dumping to a Netdump-Server. But, sometimes it fails. If it fails, the following also occurs. 1)The following directory and file are created on Netdump-Server. /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/log But, the following are NOT created. /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump-incomplete 2)Incomprehensible characters are displayed on Client Console. 3)Client machine will not reboot automatically. Our customer says, "netdump does not become reliance!" and is very angry. Our test and consideration: According to our test, in a client with many CPUs, netdump seems not to function normally. We also tried "RHEL3 Update2 Public Beta(kernel-2.4.21-12.EL)", but same troubles occurred. In our investigation, it was judged that the following troubles of Bug#113341 were the cause. >3) We have seen a couple problems where we hung in printk and Dave >believes that this was due to the panic code hitting tiny races in the >die/panic kernel path. Since we closed the size of the race in die we >hadn't seen these. We tried "RHEL3 Update1(kernel-2.4.21-9.EL)+Dave's patch", same trouble did not occur ! https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=96907&action=view Our demand: Why wasn't "Dave's patch" adopted ?? The problem of Bug#113341 is not solved by Update2, either. We ask for immediate correction. Version-Release number: kernel-2.4.21-9.EL kernel-2.4.21-12.EL netdump-0.6.11-3 How reproducible: Sometimes Step to Reproduce: 1.boot 2.wait 2-3 minutes 3.Push "Alt + SysRq + c" Expected results: Successful netdump in every case. 1)The following directory and file are created on Netdump-Server. /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/log /var/crash/192.168.xxx.xxx-2004-mm-dd-hh:mm/vmdump 2)Client machine will reboot automatically. Additional info: Severity: 1(We ask for immediate correction.)
To: Ben Woodard To: Dave Peterson To: Someone who knows about this bug I have a problem that netdump hangs with RHEL3-U2-Beta. I found this problem was solved by removing printk() in the freeze_cpu(). I tried Dave Peterson's patch and our problem was also resolved. I think the following point is related to our problem. Dave Peterson wrote: > Also I fixed things so that the crashing CPU doesn't > call printk() until it has stopped all other CPUs > and then called bust_spinlocks(). This prevents > another CPU from grabbing a lock and crashing after > bust_spinlocks() has been called, causing the system > to hang inside printk(). The patch also modifies the > NMI handler so that when the watchdog bites, it > avoids calling printk() until other CPUs have been > shut down and bust_spinlocks() has been called. I don't understand how netdump hangs inside printk(). Could anyone teach it in detail? For example, which lock does "grabbing a lock" mean?
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html
An additional fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.8.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html