Red Hat Bugzilla – Bug 145476
netdump client/server problems
Last modified: 2007-11-30 17:07:06 EST
Description of problem:
netdump isn't reliable
The client is using the MAC address of the router, and the IP the
netdump server. They both agree on the ssh keys involved... and
netdump on the client can talk to the server. However, it never gives
us a full dump.
Attached is the log file that is created on the server side and the
relevant syslog lines from the serverside, and a tcpdump.bin file of
the UDP packets going back and forth.
Version-Release number of selected component (if applicable):
RHEL3 x86, newest kernel and
netdump-server-0.6.11-3.i386 and netdump-0.6.11-3.i386
Created attachment 109945 [details]
log from the /var/crash/IP-date directory
Created attachment 109946 [details]
tcpdump from the server-side, taken during the "Got too many timeouts" messages
Created attachment 109947 [details]
syslog from server side... notice all the different "timouts" messages
Right -- the handshake is not working in your configuration.
Can you post a copy of your /etc/sysconfig file?
I will note that the log produced isn't worthless... it looks like a
decent kernel OOPS. However, netdump is producing all that is
promised... we are supposed to get more, right?
Sure.... here is the /etc/sysconfig/netdump file:
I will note that I found after I posted this that the client and
server are on the SAME subnet... didn't realize that before. However,
when changing the MAC as specified on the client to that of the
netdump server, the behavior of everything... syslog, tcpdump, log
file created on server... is the exact same.
Sorry, the NETDUMPADDR and SYSLOGADDR values above are in fact both
So, if you remove the NETDUMPMACADDR setting in the netdump config
file, does netdump start working correctly?
Hmm... no, it doesn't change the behavior. We still get the same
netdump: Got too many timeouts in handshaking, ignoring client
netdump: Got too many timeouts waiting for SHOW_STATUS for
client 0x40661d25, rebooting it
Is specifying the NETDUMPMACADDR not really needed... even on machines
that dumping to a non-local-subnet'ed server?
NETDUMPMACADDR would only be needed if the server is not on
the local subnet.
I'm presuming that after you removed the NETDUMPMACADDR that you did
a "service netdump stop" followed by a "service netdump start", but
I just want to make sure.
Anyway, presuming that was done, Jeff, do you see any reason why a
server on the same subnet could not communicate with the client on
the initial handshake? (I added Jeff Moyer to the cc: list, because
he is the "netdump/network guy", who handles this type of situation
far better than I...)
Yes... I stopped and started the netdump client after making the
changes, and before performing the last test.
So what is next here? Why is this failing on a local switched subnet?
Jeff, do you have any ideas here?
The tcpdump output does not include packet data, which would be useful in this
case. I'm guessing, based on the syslog output, that the client is continuously
trying to handshake with the server, but never receiving the responses. You can
verify this by telling me whether the client actually reboots (notice the server
repeatedly trying to reboot it). I'm guessing it doesn't.
Please include the exact kernel version used. "latest" is hardly useful when I
have to look at a billion bug reports. Help a brother out.
Also, what network card are you using on the dumping system?
The client never reboots... but then again I don't have the IDLETIMEOUT set
either. The kernel version used is: 2.4.21-27.0.1.ELsmp on both the client and
On the client we are using this card:
eth0: Tigon3 [partno(NA) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:0e:7f:20:66:e3
lspci show it as this:
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit
Ethernet (rev 02)
Subsystem: Compaq Computer Corporation NC7781 Gigabit Server Adapter
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 15
Memory at f7de0000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities:  PCI-X non-bridge device.
Capabilities:  Power Management version 2
Capabilities:  Vital Product Data
Capabilities:  Message Signalled Interrupts: 64bit+ Queue=0/3 Enable
Ok, I don't see any regressions in that bit of code.
In the past, we have seen problems like this when the network is heavily loaded,
especially with UDP and/or broadcast traffic. Can you test the netdump setup
with a cross-over cable to a server? This would help narrow down the problem.
Bingo... with a crossover cable it works perfectly and completely (first time
I've ever seen it do so). This can't be "the solution" though. How can we make
netdump work on a real network?
Created attachment 109994 [details]
try to auto-negotiate 100mb fd so we are not inundated with packets
Ok, here is a really disgusting patch that may just work. I think what is
happening is as follows:
When netdump starts, we disable interrupts and switch to polling mode for the
network card. On faster networks, we may not be able to process packets as
quickly as they come in. I don't believe this to be a limitation of the CPU
speed, but rather a limitation in the way the code is structured. (this is
just a theory).
What this patch does is tries to re-negotiate the speed of the network card.
It tries to set it at 100MB Full Duplex.
As such, you need to MAKE SURE that the switch port that the netdump client is
plugged into has auto negotiation enabled.
Please give this a try.
Created attachment 109995 [details]
same patch, but this one compiles
Sorry, fired too soon. This patch should do the trick.
Can we get this in the form of a binary test kernel? Or can we just
get a netdump.o to replace?
Sure. I've got a build running now. I'll post it in the morning.
Just to add this in if its helpful at all, I've seen cases where netdump will
time out because someone pulled the cable btw. a dumping host and the swtich.
The timeout was triggered because STP was enabled on the switch and the switch
port when into blocking mode while the spanning tree re-negotiated. While in
blocking mode, the frames to the dumping client were dropped. If the
renegotiate patch is tested, you should probably disable stp on the port
connected to the host (if its not done already). In fact, it may not hurt to
disable STP on that port anyway, just to see if it clears up the problem.
We have portfast enabled on all switch ports, so spanning tree is not an issue.
Created attachment 110016 [details]
Kernel built with proposed netdump patch
Here is a debug kernel for you to test.
The new kernel doesn't even try to do the handshake... at least it doesn't say
it does. I get the kernel OOPS in the "log" file, but no vmcore.
The last thing on the console of the client is:
CPU#0 is executing netdump.
CPU#1 is frozen.
P CISS Driver (v 2.4.52.RH2)
Yeah, that was a longshot. I'll see if I can reproduce the problem in-house.
Are you able too reproduce? This is critical for Linux in our
I haven't had time to look into this further. I'll update this bug
when I have any further status.
Here is a new datapoint. From the Cisco switch that the client and server are
plugged into, we captured all traffic flowing between the two ports.
The tarball I'll attach contains the log file, the vmcore-incomplete file (which
sometimes we get, and sometimes we don't), and the switch sniffer file
nam_rtp-lnx-16.enc (ethereal can decode).
Created attachment 110462 [details]
2nd netdump tarfile
Any thoughts as to why there are two of every frame in the trace? That seems
rather strange to me.
Hmmm... it could be because it was a port-to-port trace... so if one side sent
it, the other side would see it too... hence the duplication. In fact, this
would give us a way to see if a packet got dropped when going through the
Ok, that makes sense. I'll see if I can find a lost duplicate packet.
please note that I am in the same situation. For a while I was
getting the "Got to many time outs" messages. Now on the server Im
testing on all I see the handshake attempt. I tried a cross over
cable but it did not work. Im not sure if I have the netdump conf
file coded correctly. But I can ping on the cross over between the
server and client. This all worked before our upgrade from RHEL 3 U
3 to U 4. Now its broken.
Please verify my configuration,, I would like to see if the cross
over cable fixs this.
DEV=eth1 ( on client)
NETDUMPADDR=192.168.1.41 ( on server )
NETDUMPMACADDR=00:02:A5:48:39:91 ( on server )
# If you want the console log (not crash dumps) sent via the
# syslog service, set SYSLOGADDR to the IP of the syslog server.
# The other two values normally remain unchanged.
We recieved a crash on one of our systems today but no netdump. With
the frequency and instability of the crashes we recieve on redhat
kernels it is critical to have this working properly. Please change
the severity and priority to HIGH for this call !
Feb 10 08:50:00 maxdev4
Feb 10 08:50:03 maxdev4 CPU#0 is frozen.
Feb 10 08:50:03 maxdev4 CPU#1 is frozen.
Feb 10 08:50:03 maxdev4 CPU#2 is frozen.
Feb 10 08:50:03 maxdev4 CPU#3 is executing netdump.
Feb 10 08:50:03 maxdev4 CPU#4 is frozen.
Feb 10 08:50:03 maxdev4 CPU#5 is frozen.
Feb 10 08:50:03 maxdev4 CPU#6 is frozen.
Feb 10 08:50:03 maxdev4 CPU#7 is frozen.
Feb 10 08:50:03 maxdev4 < netdump activated - performing handshake
with the client. >
Feb 10 08:50:13 pacexp1 automount: attempting to mount
Feb 10 08:50:13 pacexp1 sshd(pam_unix): session opened for
user sanjit by (uid=1003)
Feb 10 08:53:59 pacexp1 sshd(pam_unix): session opened for
user root by (uid=0)
Feb 10 08:57:19 pacexp1 sshd(pam_unix): session closed for
Feb 10 09:00:35 adminp2 dhcpd: DHCPREQUEST for 172.24.111.58
(172.24.111.251) from 00:0e:7f:ff:ea:56 via eth1
Feb 10 09:00:35 adminp2 dhcpd: DHCPACK on 172.24.111.58 to
00:0e:7f:ff:ea:56 via eth1
Feb 10 09:01:55 pacexp1 sshd(pam_unix): session closed for
Feb 10 09:03:02 pacexp1 automount: expired /home/sanjit
Feb 10 09:07:41 adminp2 dhcpd: DHCPREQUEST for 172.24.111.50
(172.24.111.251) from 00:0b:cd:cc:ba:32 via eth1
Feb 10 09:07:41 adminp2 dhcpd: DHCPACK on 172.24.111.50 to
00:0b:cd:cc:ba:32 via eth1
Feb 10 09:11:35 msp1 sshd(pam_unix): session opened for user
tmiller by (uid=1058)
Feb 10 09:12:21 adminp2 sshd(pam_unix): session opened for
user root by (uid=0)
Feb 10 09:12:48 adminp2 sshd(pam_unix): session closed for
Feb 10 09:14:08 adminp2 sshd(pam_unix): session opened for
user netdump by (uid=34)
Feb 10 09:14:08 adminp2 sshd(pam_unix): session closed for
Feb 10 09:14:09 maxdev4 [...network console startup...]
Feb 10 09:14:21 adminp2 sshd(pam_unix): session opened for
user root by (uid=0)
Feb 10 09:17:43 fhhp1 sshd(pam_unix): session closed for user
Feb 10 09:18:02 msp3 sshd(pam_unix): session opened for user
tmiller by (uid=1058)
Those duplicates are probably an artifact of the spanned port on the catalyst
used to tap the traffic for the sniffer.
Here is one more thing to try. Manually set the card to 100 Mbit on the dumping
system. I believe the supported way of doing this is via ethtool. Then, cause
a crashdump and let me know if it finishes.
I have found the problem. The netdump-server is running on a machine with an
eth card that has a few IP address, all but one of which are assigned via
interface aliases. All these IP addresses are on the same logical subnet
(vlan). The netdump clients all point to an IP that is on one of the IPs that is
associated with and interface alias. I found through network sniffing on the
server side that the netdump client sends its request to IP A, and while the
server *does* receive the request and answer back, it answers from the main
non-aliased IP B.
Talking to A and hearing back from B apparently doesn't fly with the netdumping
client. When I change the client to point at IP B... it works just fine.
From my understanding of it, the netdump-server would in fact answer back to th
e client from an aliased IP A if it was bound specifically to that IP. Is there
a way to get netdump-server, to bind to a specific IP address? This is an
option for most other network services... but I don't see any mention of that in
the netdump-server man page or init script.
Wow, that's the problem? Good work! Consider this an RFE (request for
enhancement). I'll work to get the configuration option you describe into the
Jeff, any idea if this problem would also apply to machines with bonded
Netdump doesn't support the bonding driver. It is a separate issue.
I think the next step here is to change the netdump server to send batches of
requests when it misses some pre-defined number of responses. If we flood the
network with requests, we are more likely to get them through a saturated network.
I've started in on this, but to no avail. It's clear that the messages coming
from the panicked system make it to the netdump-server. The responses from the
server never get back to the panicked system, though. Whether this is due to
the packets not reaching the server, or the netdump server not being able to
process all incoming packets remains up in the air (though I suspect the
latter). I'll plug in a hub, next, and get some tcpdump output.
Note that I removed the udelay(100) in the netdump handshake polling code, to
try to process more packets, and this did not help the situation.
Created attachment 113379 [details]
Reset dev->quota to dev->weight in tg3_poll_controller
Please try this patch. It fixes a bug whereby if the controller exhausts its
current NAPI budget, it will never be reset. This results in the controller
never delivering any more packets to the netdump subsystem.
Jeff, who are you waiting on feedback from? As far as I am concerned my problem
has been %95 percent solved by finding that the netdump server can't be on an
aliased network interface. Is someone else on this bug going to test your patch?
Also on a side note, one of our problems was also that on HP servers their hpasm
software + ASR hardware was causing problems. The software would check in with
the hardware... when it stops doing so the ASR reboots the machine. This isn't
a bad idea in general (if the software is locked up then perhaps you want the
machine to reboot), but it doesn't mix well with netdump. Disabling the ASR
timeout in bios does the trick... increasing the timeout value to the max of 30
minutes might also work for fast netdumps with small ammounts of memory.
It appears that this one bugzilla is tracking multiple bugs. In your case, you
say that you are 95% solved. What is the remaining 5%? I think the main point
of confusion here is the Summary line of this bug. =)
So, do me a favor and update the summary to more accurately describe your
situation. I'll then clone the bug for the problem I described in comment #48.
I think the regression you are experiencing is that netdump will no longer work
when the interface is not the first (i.e. eth0 works, everything else doesn't).
Please take a look at bug #150374 and see if that describes your situation. If
so, we'll move you onto the CC list there.
Thanks for bearing with me, here.
The remaining 5% of the problem is that sometimes netdump just freezes up.
Changing summary now so that you can duplicate this bug to address the client
Created attachment 113503 [details]
Patch to fix up all NAPI enabled netdump drivers.
This patch is more comprehensive, fixing the bug in all of the netdump NAPI
The above patch (id=113503) was submitted for internal review.
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.2.EL).
Very nice! Thanks Red Hat!
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
Reading the man page to netdump-server, I see no indication that netdump-server
can bind to a specific IP or interface with netdump-server-0.7.7-2, which came
out in the RHEL3 U6 update.
Sorry, the bug fixed here was not the one you were seeing. I got confused. So,
I'm creating a new bug, and I'll add you to the CC list. So, I'm closing this
The new bug ID is 171405. Any interested parties should add themselves to the
CC list for that bug.