Description of problem: netdump isn't reliable The client is using the MAC address of the router, and the IP the netdump server. They both agree on the ssh keys involved... and netdump on the client can talk to the server. However, it never gives us a full dump. Attached is the log file that is created on the server side and the relevant syslog lines from the serverside, and a tcpdump.bin file of the UDP packets going back and forth. Version-Release number of selected component (if applicable): RHEL3 x86, newest kernel and netdump-server-0.6.11-3.i386 and netdump-0.6.11-3.i386
Created attachment 109945 [details] log from the /var/crash/IP-date directory
Created attachment 109946 [details] tcpdump from the server-side, taken during the "Got too many timeouts" messages
Created attachment 109947 [details] syslog from server side... notice all the different "timouts" messages
Right -- the handshake is not working in your configuration. Can you post a copy of your /etc/sysconfig file?
I will note that the log produced isn't worthless... it looks like a decent kernel OOPS. However, netdump is producing all that is promised... we are supposed to get more, right?
Sure.... here is the /etc/sysconfig/netdump file: # LOCALPORT=6666 DEV=eth0 NETDUMPADDR=64.102.29.132 # NETDUMPPORT=6666 NETDUMPMACADDR=00:00:0C:07:AC:01 # IDLETIMEOUT= # NETDUMPKEYEXCHANGE=none # SYSLOGADDR=64.102.29.131 # SYSLOGPORT=514 SYSLOGMACADDR=00:00:0C:07:AC:01
I will note that I found after I posted this that the client and server are on the SAME subnet... didn't realize that before. However, when changing the MAC as specified on the client to that of the netdump server, the behavior of everything... syslog, tcpdump, log file created on server... is the exact same.
Sorry, the NETDUMPADDR and SYSLOGADDR values above are in fact both 64.102.29.132
So, if you remove the NETDUMPMACADDR setting in the netdump config file, does netdump start working correctly?
Hmm... no, it doesn't change the behavior. We still get the same behavior: netdump[13741]: Got too many timeouts in handshaking, ignoring client 0x40661d25 netdump[13741]: Got too many timeouts waiting for SHOW_STATUS for client 0x40661d25, rebooting it Is specifying the NETDUMPMACADDR not really needed... even on machines that dumping to a non-local-subnet'ed server?
NETDUMPMACADDR would only be needed if the server is not on the local subnet. I'm presuming that after you removed the NETDUMPMACADDR that you did a "service netdump stop" followed by a "service netdump start", but I just want to make sure. Anyway, presuming that was done, Jeff, do you see any reason why a server on the same subnet could not communicate with the client on the initial handshake? (I added Jeff Moyer to the cc: list, because he is the "netdump/network guy", who handles this type of situation far better than I...)
Yes... I stopped and started the netdump client after making the changes, and before performing the last test.
So what is next here? Why is this failing on a local switched subnet?
Don't know. Jeff, do you have any ideas here?
The tcpdump output does not include packet data, which would be useful in this case. I'm guessing, based on the syslog output, that the client is continuously trying to handshake with the server, but never receiving the responses. You can verify this by telling me whether the client actually reboots (notice the server repeatedly trying to reboot it). I'm guessing it doesn't. Please include the exact kernel version used. "latest" is hardly useful when I have to look at a billion bug reports. Help a brother out. Also, what network card are you using on the dumping system?
The client never reboots... but then again I don't have the IDLETIMEOUT set either. The kernel version used is: 2.4.21-27.0.1.ELsmp on both the client and server. On the client we are using this card: eth0: Tigon3 [partno(NA) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:0e:7f:20:66:e3 lspci show it as this: 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02) Subsystem: Compaq Computer Corporation NC7781 Gigabit Server Adapter (PCI-X, 10,100,1000-T) Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 15 Memory at f7de0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at <unassigned> [disabled] [size=64K] Capabilities: [40] PCI-X non-bridge device. Capabilities: [48] Power Management version 2 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable
Ok, I don't see any regressions in that bit of code. In the past, we have seen problems like this when the network is heavily loaded, especially with UDP and/or broadcast traffic. Can you test the netdump setup with a cross-over cable to a server? This would help narrow down the problem. Thanks.
Bingo... with a crossover cable it works perfectly and completely (first time I've ever seen it do so). This can't be "the solution" though. How can we make netdump work on a real network?
Created attachment 109994 [details] try to auto-negotiate 100mb fd so we are not inundated with packets Ok, here is a really disgusting patch that may just work. I think what is happening is as follows: When netdump starts, we disable interrupts and switch to polling mode for the network card. On faster networks, we may not be able to process packets as quickly as they come in. I don't believe this to be a limitation of the CPU speed, but rather a limitation in the way the code is structured. (this is just a theory). What this patch does is tries to re-negotiate the speed of the network card. It tries to set it at 100MB Full Duplex. As such, you need to MAKE SURE that the switch port that the netdump client is plugged into has auto negotiation enabled. Please give this a try.
Created attachment 109995 [details] same patch, but this one compiles Sorry, fired too soon. This patch should do the trick.
Can we get this in the form of a binary test kernel? Or can we just get a netdump.o to replace?
Sure. I've got a build running now. I'll post it in the morning.
Just to add this in if its helpful at all, I've seen cases where netdump will time out because someone pulled the cable btw. a dumping host and the swtich. The timeout was triggered because STP was enabled on the switch and the switch port when into blocking mode while the spanning tree re-negotiated. While in blocking mode, the frames to the dumping client were dropped. If the renegotiate patch is tested, you should probably disable stp on the port connected to the host (if its not done already). In fact, it may not hurt to disable STP on that port anyway, just to see if it clears up the problem.
We have portfast enabled on all switch ports, so spanning tree is not an issue.
Created attachment 110016 [details] Kernel built with proposed netdump patch Here is a debug kernel for you to test.
The new kernel doesn't even try to do the handshake... at least it doesn't say it does. I get the kernel OOPS in the "log" file, but no vmcore. The last thing on the console of the client is: CPU#0 is executing netdump. CPU#1 is frozen. P CISS Driver (v 2.4.52.RH2)
Yeah, that was a longshot. I'll see if I can reproduce the problem in-house.
Are you able too reproduce? This is critical for Linux in our environment.
I haven't had time to look into this further. I'll update this bug when I have any further status.
Here is a new datapoint. From the Cisco switch that the client and server are plugged into, we captured all traffic flowing between the two ports. The tarball I'll attach contains the log file, the vmcore-incomplete file (which sometimes we get, and sometimes we don't), and the switch sniffer file nam_rtp-lnx-16.enc (ethereal can decode).
Created attachment 110462 [details] 2nd netdump tarfile
Any thoughts as to why there are two of every frame in the trace? That seems rather strange to me.
Hmmm... it could be because it was a port-to-port trace... so if one side sent it, the other side would see it too... hence the duplication. In fact, this would give us a way to see if a packet got dropped when going through the switch, no?
Ok, that makes sense. I'll see if I can find a lost duplicate packet.
please note that I am in the same situation. For a while I was getting the "Got to many time outs" messages. Now on the server Im testing on all I see the handshake attempt. I tried a cross over cable but it did not work. Im not sure if I have the netdump conf file coded correctly. But I can ping on the cross over between the server and client. This all worked before our upgrade from RHEL 3 U 3 to U 4. Now its broken. Please verify my configuration,, I would like to see if the cross over cable fixs this. # # LOCALPORT=6666 DEV=eth1 ( on client) NETDUMPADDR=192.168.1.41 ( on server ) # NETDUMPPORT=6666 NETDUMPMACADDR=00:02:A5:48:39:91 ( on server ) # IDLETIMEOUT= # # If you want the console log (not crash dumps) sent via the # syslog service, set SYSLOGADDR to the IP of the syslog server. # The other two values normally remain unchanged. SYSLOGADDR=192.168.1.41 SYSLOGPORT=514 SYSLOGMACADDR=00:02:A5:48:39:91
We recieved a crash on one of our systems today but no netdump. With the frequency and instability of the crashes we recieve on redhat kernels it is critical to have this working properly. Please change the severity and priority to HIGH for this call ! Feb 10 08:50:00 maxdev4 Feb 10 08:50:03 maxdev4 CPU#0 is frozen. Feb 10 08:50:03 maxdev4 CPU#1 is frozen. Feb 10 08:50:03 maxdev4 CPU#2 is frozen. Feb 10 08:50:03 maxdev4 CPU#3 is executing netdump. Feb 10 08:50:03 maxdev4 CPU#4 is frozen. Feb 10 08:50:03 maxdev4 CPU#5 is frozen. Feb 10 08:50:03 maxdev4 CPU#6 is frozen. Feb 10 08:50:03 maxdev4 CPU#7 is frozen. Feb 10 08:50:03 maxdev4 < netdump activated - performing handshake with the client. > Feb 10 08:50:13 pacexp1 automount[4124]: attempting to mount entry /home/sanjit Feb 10 08:50:13 pacexp1 sshd(pam_unix)[29936]: session opened for user sanjit by (uid=1003) Feb 10 08:53:59 pacexp1 sshd(pam_unix)[30086]: session opened for user root by (uid=0) Feb 10 08:57:19 pacexp1 sshd(pam_unix)[29936]: session closed for user sanjit Feb 10 09:00:35 adminp2 dhcpd: DHCPREQUEST for 172.24.111.58 (172.24.111.251) from 00:0e:7f:ff:ea:56 via eth1 Feb 10 09:00:35 adminp2 dhcpd: DHCPACK on 172.24.111.58 to 00:0e:7f:ff:ea:56 via eth1 Feb 10 09:01:55 pacexp1 sshd(pam_unix)[30086]: session closed for user root Feb 10 09:03:02 pacexp1 automount[30312]: expired /home/sanjit Feb 10 09:07:41 adminp2 dhcpd: DHCPREQUEST for 172.24.111.50 (172.24.111.251) from 00:0b:cd:cc:ba:32 via eth1 Feb 10 09:07:41 adminp2 dhcpd: DHCPACK on 172.24.111.50 to 00:0b:cd:cc:ba:32 via eth1 Feb 10 09:11:35 msp1 sshd(pam_unix)[17753]: session opened for user tmiller by (uid=1058) Feb 10 09:12:21 adminp2 sshd(pam_unix)[15195]: session opened for user root by (uid=0) Feb 10 09:12:48 adminp2 sshd(pam_unix)[15195]: session closed for user root Feb 10 09:14:08 adminp2 sshd(pam_unix)[15263]: session opened for user netdump by (uid=34) Feb 10 09:14:08 adminp2 sshd(pam_unix)[15263]: session closed for user netdump Feb 10 09:14:09 maxdev4 [...network console startup...] Feb 10 09:14:21 adminp2 sshd(pam_unix)[15265]: session opened for user root by (uid=0) Feb 10 09:17:43 fhhp1 sshd(pam_unix)[24821]: session closed for user pcxoper Feb 10 09:18:02 msp3 sshd(pam_unix)[30278]: session opened for user tmiller by (uid=1058)
Those duplicates are probably an artifact of the spanned port on the catalyst used to tap the traffic for the sniffer.
Here is one more thing to try. Manually set the card to 100 Mbit on the dumping system. I believe the supported way of doing this is via ethtool. Then, cause a crashdump and let me know if it finishes. Thanks.
I have found the problem. The netdump-server is running on a machine with an eth card that has a few IP address, all but one of which are assigned via interface aliases. All these IP addresses are on the same logical subnet (vlan). The netdump clients all point to an IP that is on one of the IPs that is associated with and interface alias. I found through network sniffing on the server side that the netdump client sends its request to IP A, and while the server *does* receive the request and answer back, it answers from the main non-aliased IP B. Talking to A and hearing back from B apparently doesn't fly with the netdumping client. When I change the client to point at IP B... it works just fine. From my understanding of it, the netdump-server would in fact answer back to th e client from an aliased IP A if it was bound specifically to that IP. Is there a way to get netdump-server, to bind to a specific IP address? This is an option for most other network services... but I don't see any mention of that in the netdump-server man page or init script.
Wow, that's the problem? Good work! Consider this an RFE (request for enhancement). I'll work to get the configuration option you describe into the next release. Thanks!
Jeff, any idea if this problem would also apply to machines with bonded interfaces?
Netdump doesn't support the bonding driver. It is a separate issue.
I think the next step here is to change the netdump server to send batches of requests when it misses some pre-defined number of responses. If we flood the network with requests, we are more likely to get them through a saturated network.
I've started in on this, but to no avail. It's clear that the messages coming from the panicked system make it to the netdump-server. The responses from the server never get back to the panicked system, though. Whether this is due to the packets not reaching the server, or the netdump server not being able to process all incoming packets remains up in the air (though I suspect the latter). I'll plug in a hub, next, and get some tcpdump output. Note that I removed the udelay(100) in the netdump handshake polling code, to try to process more packets, and this did not help the situation.
Created attachment 113379 [details] Reset dev->quota to dev->weight in tg3_poll_controller Please try this patch. It fixes a bug whereby if the controller exhausts its current NAPI budget, it will never be reset. This results in the controller never delivering any more packets to the netdump subsystem.
Jeff, who are you waiting on feedback from? As far as I am concerned my problem has been %95 percent solved by finding that the netdump server can't be on an aliased network interface. Is someone else on this bug going to test your patch? Also on a side note, one of our problems was also that on HP servers their hpasm software + ASR hardware was causing problems. The software would check in with the hardware... when it stops doing so the ASR reboots the machine. This isn't a bad idea in general (if the software is locked up then perhaps you want the machine to reboot), but it doesn't mix well with netdump. Disabling the ASR timeout in bios does the trick... increasing the timeout value to the max of 30 minutes might also work for fast netdumps with small ammounts of memory.
Joshua, It appears that this one bugzilla is tracking multiple bugs. In your case, you say that you are 95% solved. What is the remaining 5%? I think the main point of confusion here is the Summary line of this bug. =) So, do me a favor and update the summary to more accurately describe your situation. I'll then clone the bug for the problem I described in comment #48. ginnie nuckles, I think the regression you are experiencing is that netdump will no longer work when the interface is not the first (i.e. eth0 works, everything else doesn't). Please take a look at bug #150374 and see if that describes your situation. If so, we'll move you onto the CC list there. Thanks for bearing with me, here.
The remaining 5% of the problem is that sometimes netdump just freezes up. Changing summary now so that you can duplicate this bug to address the client lockups
Created attachment 113503 [details] Patch to fix up all NAPI enabled netdump drivers. This patch is more comprehensive, fixing the bug in all of the netdump NAPI enabled drivers.
The above patch (id=113503) was submitted for internal review.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.2.EL).
Very nice! Thanks Red Hat!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html
Reading the man page to netdump-server, I see no indication that netdump-server can bind to a specific IP or interface with netdump-server-0.7.7-2, which came out in the RHEL3 U6 update.
Joshua, Sorry, the bug fixed here was not the one you were seeing. I got confused. So, I'm creating a new bug, and I'll add you to the CC list. So, I'm closing this bug again.
The new bug ID is 171405. Any interested parties should add themselves to the CC list for that bug.