Description of problem: In addition to the netdump server's IP address, the netdump client also needs the MAC address of the next hop, on the local subnet. If the server is on the same subnet, this is the same as the MAC address of the server. However, if this is not the case, the next hop MAC address is the MAC address of the gateway. The /etc/init.d/netconsole script attempts to be clever when processing the "start" command and uses arping/arp to try to automatically configure the needed MAC address. Unfortunately, this doesn't work in the case where a router separates client and server. I suppose this is all a result of netdump/netconsole avoiding use of the regular network stack and constructing UDP packets from scratch. One solution would be to copy the information from the regular stack sometime before it starts to process a crash. The simplest workaround is to specify a MAC address in the "/etc/sysconfig/netdump" file, but this is also broken in this case! The fix for this second problem is a trivial edit to /etc/init.d/netdump: *************** *** 107,112 **** --- 107,113 ---- { # netdump/netconsole server NETDUMPOPTS= + IPADDR=$NETDUMPADDR eval $(print_address_info $NETDUMPADDR) [ "$HOSTNAME" = "?" -a -z "$MAC" ] && \ echo "$prog: can't resolve $NETDUMPADDR MAC address" 1>&2 && usage This preserves NETDUMPADDR in the case where arping/arp fail and makes the problem solvable, by manually supplying the proper MAC address in "/etc/sysconfig/netdump", for NETDUMPMACADDR. I have a more elaborate fix that automatically works out the MAC address and does not require this to be configured. However, my shell script hacking proficiency is down near zero these days so it could do with a rewiev/rework by someone who does this more often than I. I will attach this... Version-Release number of selected component (if applicable): netdump-0.6.8-2 How reproducible: Every time. Steps to Reproduce: 1. Set-up netdump server and client on two distinct subnets, layer 3 routing required for connectivity 2. Configure netdump client 3. Attempt to start netdump client Actual results: service netdump start fails Expected results: service netdump start succeeds Additional info:
Created attachment 91655 [details] /etc/init.d/netdump modifications
Created attachment 102386 [details] netdump-mac-subnet.patch unified diff version of the above patch, against the RHEL3 netdump.
Jeff, can you take a look at this?
> One solution would be to copy the information from the regular stack > sometime before it starts to process a crash. That is precisely what we are doing by getting this information either from arp or from the value supplied in /etc/sysconfig/netdump. > The simplest workaround is to specify a MAC address in > the "/etc/sysconfig/netdump" file, but this is also broken in this > case! That is not a workaround. It is a "functions as designed" thing. Further, it is not broken in the configurations that I've tested. I use this feature of netdump daily. > This preserves NETDUMPADDR in the case where arping/arp fail and > makes the problem solvable, by manually supplying the proper MAC > address in "/etc/sysconfig/netdump", for NETDUMPMACADDR. If arp fails, we exit from the script in print_address_info. I'm not sure what problem you are addressing with this patch. Perhaps you could supply more detail about the configuration that isn't working for you? The idea of using traceroute for finding the gateway automatically sounds worth exploring. This may be worth incorporating in the next version of netdump. Thanks.
The problem is that arp[ing] only works when the address is either in the arp cache or reachable through a ethernet-layer (layer 2) broadcast. There are some slight exceptions, such as when proxy arp is in use. However, the bottom line is that, in general, either someone manually adds an address to the arp cache for a host that is on another subnet/broadcast domain, or arp does not work. This is normally fine, since any address that is not specified by the subnet mask is automatically contacted through the gateway. However, netdump is not normal in that it does not use the standard IP stack, so it does not pick up this automatic indirection -- it needs to have the right MAC address configured, either manually or, better, automatically. As things stand in the version we use, both of these options are broken. I would not be surprised if at least the manual configuration option has been fixed, but this is still a pain. If you are not seeing this problem, the most likely reason is that either you are in an environment where netdump traffic does not need to cross any subnet boundary -- or someone has configured your routers and/or proxies to transparently forward arp broadcasts and responses. This is not the way things operate in general in the wide world and netdump is broken in environments that do not do this -- and this is not just a theoretical possibility. Here is a quick high-level overview of ARP I found just now that explains this: <http://www.mynetwatchman.com/pckidiot/chap05.htm>. Here's what I would do to try to reproduce this: /sbin/ifconfig (find "addr" and "Mask") /sbin/route -n (find a "Gateway" entry, where "Flags" includes 'G') ping at least one address that is on the same subnet as "addr" (in other words, an address that differs from "addr" only in bits that are zero in "Mask") and at least one address on a different subnet /sbin/arp -n -a note that the address on the same subnet has an arp entry, while the address on a different subnet does not; also notice that the "Gateway" used to reach the second address has an entry If you're saying that manual configuration now works and is required in this type of network environment, I'd still call this a bug and encourage you to fix it at the next opportunity.
ping_output="$(ping -c 1 -I $DEV $host 2> /dev/null | \ grep '^PING ' | awk '{print $3}' | sed 's#^(##' | sed 's#)$##')" [ $? -ne 0 ] && echo "$prog: cannot ping $host" 1>&2 && usage I believe the return value from the ping_output line will be the result of the last command evaluated in the pipeline. So, this does not check that ping failed. This needs fixing. + trc_output="$(traceroute -i $DEV -n -m 1 $host_ip 2> /dev/null | \ + grep '^ 1 ' | awk '{print $2}')" + [ $? -ne 0 ] && echo "$prog: cannot traceroute $host_ip" 1>&2 && usage Same here. for line in $arp_output; do IFS=$oldIFS set - foo $line shift - if [ "$2" = "($host)" ] || expr "$1" : "$host" &>/dev/null; then - echo HOSTNAME=$1 IPADDR=$2 AT=$3 MAC=$4 \ + if [ "$2" = "($mac_ip)" ] || expr "$1" : "$mac_ip" &>/dev/null; then + echo HOSTNAME=$1 IPADDR=$host_ip MAC_IPADDR=$2 AT=$3 MAC=$4 \ TYPE=$5 ON=$6 IFACE=$7 This bit won't apply anymore. My main concern with this patch is that we don't introduce regressions, not even in the error cases. Please create a new patch against the latest source, netdump-0.7.5, and if you could generate diffs against the source tree, that would be ideal (i.e. not against /etc/init.d/netdump and some file netdump in whatever directory). Please also note that it has been mentioned that some switches hide the first hop. I want to ensure that the hard-coded case will still work in this case. To that end, if anyone watching this bugzilla has a network with foundry switches deployed, please let me know if you can volunteer to test. Thanks, Jeff
Created attachment 111129 [details] netdump-subnets.patch Updated patch. Untested (as I don't have the hardware to do so here).
Created attachment 111136 [details] Slight corrections to prior version of the netdump-subnets patch
This breaks my setup. I believe our routers use proxy arp. Previously, I could specify the IP address of my netdump server (which is on the other side of the router), and this would work fine. With your patch applied, this no longer works. I'll look into this further. -Jeff
Created attachment 111444 [details] Find the next hop MAC address automatically Alan, please take a look at this version of the patch. Honestly, I don't see how the last version of the patch would have worked for anything other than client and server on the same subnet. Thanks. Jeff
Ugh. the usage function doesn't return: [ $? -ne 0 ] && echo "$prog: cannot ping $host" 1>&2 && usage So you really want to exit the script at this point. Not only that, the script was called with the proper arguments, but the configuration was incorrect. Thus, telling the user that they need to call the script with start|stop|status is not helpful in this case.
Created attachment 111448 [details] Gets rid of bogus Usage calls. Okay, this patch gets rid of the calls to usage. After looking at the code again, it's apparent that you don't want to exit in these cases. This version passes a number of regression tests in my environment. Any testing by others would be greatly appreciated.
Sorry -- I should have taken more time with this, I just did a quick review and smoke_test of the last patch. Thanks for following up! The most recent patch looks good to me and worked in our environment -- this time I configured and also actually triggered a dump :).
OK, thanks for the testing, Alan. I'll work to get this into our next update.
A fix for this has been committed to netdump, and is on track for RHEL 3 U5 and RHEL 4 U1. Packages versioned 0.7.7-2 and later have this fix.
I'm having trouble getting this patch to apply to my netdump install. Is there any way I could get the packaged version. patching file netdump Hunk #1 FAILED at 73. 1 out of 1 hunk FAILED -- saving rejects to file netdump.rej
You can get the latest version from anonymous cvs: export CVSROOT=:pserver:anonymous.com:/usr/local/CVS cvs -z3 login (hit enter) cvs -z3 co netdump
Indeed this works.
I'm not able to reproduce this with any of our testlab networks, probably due to having routers passing arp requests or something. But it appears at least one person had success with the patched packages. Will try with a less smart networking setup once I get into the lab.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-451.html