Bug 168733 - netdump does not dump complet contents
Summary: netdump does not dump complet contents
Status: CLOSED DUPLICATE of bug 194055
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: netdump
Version: 4.0
Hardware: ia64
OS: Linux
Target Milestone: ---
: ---
Assignee: Neil Horman
QA Contact:
Depends On:
Blocks: 176344
TreeView+ depends on / blocked
Reported: 2005-09-19 21:20 UTC by Doug Chapman
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Clone Of:
Last Closed: 2006-06-21 19:17:49 UTC

Attachments (Terms of Use)
patch to avoid deadlock on arp_reply mechanism (2.29 KB, patch)
2006-06-21 13:28 UTC, Neil Horman
no flags Details | Diff

Description Doug Chapman 2005-09-19 21:20:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050719 Red Hat/1.0.6-1.4.1 Firefox/1.0.6

Description of problem:
This problem is being seen on my HP ia64 systems.  When netdump saves a crashdump it appears that the full dump is not being saved.  The file on the netdump server is still named vmcore-incomplete and loading it into crash causes lots of errors (various things not found).

I have tried using ia64 and i386 netdump servers.  I have only used ia64 as the netdump client.

I have seen this with 2 different NIC drivers: tg3 and e1000.  I am seeing a completely different problem with the e100 driver.  With e100 the dump hangs the client partway through the dump (will file this as a separate bugzilla).

It appears we are getting a timeout.  I am getting the following messages in my syslog messages file on the netdump server:

netdump[25971]: Got too many timeouts waiting for memory page for client, ignoring it

This message is generated at the same moment when the client dumps stack trace info of all processes to the console and then reboots.

Version-Release number of selected component (if applicable):
kernel-2.6.9-20.EL netdump-0.7.7-3 netdump-server-0.7.7-3

How reproducible:

Steps to Reproduce:
1. set up netdump client (any) and server (ia64)
2. crash client with: echo c > /proc/sysrq-trigger
3. check resulting vmcore file on server

Actual Results:  The vmcore was still named vmcore-incomplete and was not useful for debugging.  Also, netdump-server generates a "too many timeouts" message.

Expected Results:  once the full dump was transfered to the netdump server the vmcore-incomplete should be renamed vmcore.  The crash tool should read it without error.

Additional info:

Comment 1 Dave Anderson 2005-09-20 12:43:53 UTC
Not sure what the difference is between the tg3/e1000 and the e100 hangs?
Anyway, if the file remains as vmcore-incomplete, then there was a hang
on the client side, and this would also be reported on the server side in

Comment 2 Jeff Moyer 2005-09-20 14:34:52 UTC
Your analysis, quoted below, is not *quite* right:

>  netdump[25971]: Got too many timeouts waiting for memory page for client
>, ignoring it
>  This message is generated at the same moment when the client dumps stack trace 
>  info of all processes to the console and then reboots.

The message is generated, and *then* the client does the stack trace.  This is
an indication that the client is no longer able to send packets, but is quite
capable of receiving them.

I'll take a look when next I have time.  Thanks!

Comment 7 Bob Johnson 2006-04-11 16:03:49 UTC
This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 4.4 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 4.4 release.

Comment 10 Neil Horman 2006-06-21 13:28:37 UTC
Created attachment 131274 [details]
patch to avoid deadlock on arp_reply mechanism

I'm not sure if this is the problem or not, but from tgrafs description of the
arp entry aging out, its at least a possibility.  There was a recursive path in
netpoll which, on replying to an arp request, a deadlock can occur if spinlocks
are shared between a drivers tx and rx paths.  Please test this patch out as
well and let me know if it solves the problem for you.

Comment 11 Doug Chapman 2006-06-21 16:22:18 UTC
Appears this patch is already in the latest RHEL4 U4 kernels.  I tested tg3 and
it appears to be working fine now.  Don't know if it was due to this patch or
something else since I have not tested netdump in a while.

I am unable to test on e1000 currently due to BZ 193688.

Comment 12 Doug Chapman 2006-06-21 16:26:33 UTC
On closer examination it isn't clear if this patch is in RHEL4 or not.  I see it
in jbaron's nahant/u4/ directory as
0981.nhorman.netpoll-recursion-deadlock.patch but I don't seem to be able to
find it in the srpm.

Can you confirm/deny if this is in the latest kernel (i.e. 2.6.9-39.1.EL)?

Comment 13 Neil Horman 2006-06-21 16:50:04 UTC
Its slated to go into the RHEL4 U5 kernel, but it may well have not yet been
checked in.  2.6.9-39.1.EL should not have this patch in place, AFAIK.

The patch attached to bz 193688 is slated IIRC, to go into U4, as it is a
regression. You should be able to apply both patches for testing.  If you want
to test out the patch I have attached above, I have a test kernel for bz 194055
on my people page:
That has it built in.

Comment 14 Doug Chapman 2006-06-21 18:13:29 UTC
I built a new kernel with the patch for this BZ as well as the one for BZ 193688
and I am now able to netdump just fine.  I did not try just the 193688 patch.  I
was going to use the one from your people page but I didn't see an ia64 kernel.

Comment 15 Neil Horman 2006-06-21 19:17:33 UTC
I wouldn't bother trying both.  If you're having problems on more than just the
e1000 card (which from your inital comments it seems as though you are), and the
attached patch fixed it, then I'm certain that the recursive deadlock described
in bz 194055 is your problem.  I'm going to close this as a dup of that bug, and
you can expect the fix to be released in U5.

Comment 16 Neil Horman 2006-06-21 19:17:49 UTC

*** This bug has been marked as a duplicate of 194055 ***

Note You need to log in before you can comment on or make changes to this bug.