From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050719 Red Hat/1.0.6-1.4.1 Firefox/1.0.6 Description of problem: This problem is being seen on my HP ia64 systems. When netdump saves a crashdump it appears that the full dump is not being saved. The file on the netdump server is still named vmcore-incomplete and loading it into crash causes lots of errors (various things not found). I have tried using ia64 and i386 netdump servers. I have only used ia64 as the netdump client. I have seen this with 2 different NIC drivers: tg3 and e1000. I am seeing a completely different problem with the e100 driver. With e100 the dump hangs the client partway through the dump (will file this as a separate bugzilla). It appears we are getting a timeout. I am getting the following messages in my syslog messages file on the netdump server: netdump[25971]: Got too many timeouts waiting for memory page for client 10.12.11.32, ignoring it This message is generated at the same moment when the client dumps stack trace info of all processes to the console and then reboots. Version-Release number of selected component (if applicable): kernel-2.6.9-20.EL netdump-0.7.7-3 netdump-server-0.7.7-3 How reproducible: Always Steps to Reproduce: 1. set up netdump client (any) and server (ia64) 2. crash client with: echo c > /proc/sysrq-trigger 3. check resulting vmcore file on server Actual Results: The vmcore was still named vmcore-incomplete and was not useful for debugging. Also, netdump-server generates a "too many timeouts" message. Expected Results: once the full dump was transfered to the netdump server the vmcore-incomplete should be renamed vmcore. The crash tool should read it without error. Additional info:
Not sure what the difference is between the tg3/e1000 and the e100 hangs? Anyway, if the file remains as vmcore-incomplete, then there was a hang on the client side, and this would also be reported on the server side in /var/log/messages.
Your analysis, quoted below, is not *quite* right: > netdump[25971]: Got too many timeouts waiting for memory page for client > 10.12.11.32, ignoring it > This message is generated at the same moment when the client dumps stack trace > info of all processes to the console and then reboots. The message is generated, and *then* the client does the stack trace. This is an indication that the client is no longer able to send packets, but is quite capable of receiving them. I'll take a look when next I have time. Thanks!
This issue is on Red Hat Engineering's list of planned work items for the upcoming Red Hat Enterprise Linux 4.4 release. Engineering resources have been assigned and barring unforeseen circumstances, Red Hat intends to include this item in the 4.4 release.
Created attachment 131274 [details] patch to avoid deadlock on arp_reply mechanism I'm not sure if this is the problem or not, but from tgrafs description of the arp entry aging out, its at least a possibility. There was a recursive path in netpoll which, on replying to an arp request, a deadlock can occur if spinlocks are shared between a drivers tx and rx paths. Please test this patch out as well and let me know if it solves the problem for you.
Appears this patch is already in the latest RHEL4 U4 kernels. I tested tg3 and it appears to be working fine now. Don't know if it was due to this patch or something else since I have not tested netdump in a while. I am unable to test on e1000 currently due to BZ 193688.
On closer examination it isn't clear if this patch is in RHEL4 or not. I see it in jbaron's nahant/u4/ directory as 0981.nhorman.netpoll-recursion-deadlock.patch but I don't seem to be able to find it in the srpm. Can you confirm/deny if this is in the latest kernel (i.e. 2.6.9-39.1.EL)?
Its slated to go into the RHEL4 U5 kernel, but it may well have not yet been checked in. 2.6.9-39.1.EL should not have this patch in place, AFAIK. The patch attached to bz 193688 is slated IIRC, to go into U4, as it is a regression. You should be able to apply both patches for testing. If you want to test out the patch I have attached above, I have a test kernel for bz 194055 on my people page: http://people.redhat.com/nhorman That has it built in.
I built a new kernel with the patch for this BZ as well as the one for BZ 193688 and I am now able to netdump just fine. I did not try just the 193688 patch. I was going to use the one from your people page but I didn't see an ia64 kernel.
I wouldn't bother trying both. If you're having problems on more than just the e1000 card (which from your inital comments it seems as though you are), and the attached patch fixed it, then I'm certain that the recursive deadlock described in bz 194055 is your problem. I'm going to close this as a dup of that bug, and you can expect the fix to be released in U5.
*** This bug has been marked as a duplicate of 194055 ***