Bug 90803 - (IT_44337) /etc/init.d/netdump start script requires client to be on same subnet as server
/etc/init.d/netdump start script requires client to be on same subnet as server
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: netdump (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jeffrey Moyer
Depends On:
Blocks: 132991
  Show dependency treegraph
Reported: 2003-05-13 20:09 EDT by Allen Nuttle
Modified: 2007-11-30 17:06 EST (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2005-113
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2005-05-19 20:12:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
/etc/init.d/netdump modifications (3.54 KB, patch)
2003-05-13 20:17 EDT, Allen Nuttle
no flags Details | Diff
netdump-mac-subnet.patch (2.66 KB, patch)
2004-08-03 08:32 EDT, Bastien Nocera
no flags Details | Diff
netdump-subnets.patch (2.34 KB, patch)
2005-02-16 09:29 EST, Bastien Nocera
no flags Details | Diff
Slight corrections to prior version of the netdump-subnets patch (2.43 KB, patch)
2005-02-16 15:41 EST, Allen Nuttle
no flags Details | Diff
Find the next hop MAC address automatically (2.26 KB, patch)
2005-02-25 16:39 EST, Jeffrey Moyer
no flags Details | Diff
Gets rid of bogus Usage calls. (2.46 KB, patch)
2005-02-25 17:27 EST, Jeffrey Moyer
no flags Details | Diff

  None (edit)
Description Allen Nuttle 2003-05-13 20:09:29 EDT
Description of problem:

In addition to the netdump server's IP address, the netdump client
also needs the MAC address of the next hop, on the local subnet.
If the server is on the same subnet, this is the same as the MAC
address of the server.  However, if this is not the case, the next
hop MAC address is the MAC address of the gateway.

The /etc/init.d/netconsole script attempts to be clever when
processing the "start" command and uses arping/arp to try to
automatically configure the needed MAC address.  Unfortunately, this
doesn't work in the case where a router separates client and server.

I suppose this is all a result of netdump/netconsole avoiding use of
the regular network stack and constructing UDP packets from scratch.
One solution would be to copy the information from the regular stack
sometime before it starts to process a crash.

The simplest workaround is to specify a MAC address in
the "/etc/sysconfig/netdump" file, but this is also broken in this
case!  The fix for this second problem is a trivial edit
to /etc/init.d/netdump:

*** 107,112 ****
--- 107,113 ----
      # netdump/netconsole server
      eval $(print_address_info $NETDUMPADDR)
      [ "$HOSTNAME" = "?" -a -z "$MAC" ] && \
  	echo "$prog: can't resolve $NETDUMPADDR MAC address" 1>&2 && usage

This preserves NETDUMPADDR in the case where arping/arp fail and
makes the problem solvable, by manually supplying the proper MAC
address in "/etc/sysconfig/netdump", for NETDUMPMACADDR.

I have a more elaborate fix that automatically works out the MAC
address and does not require this to be configured.  However, my
shell script hacking proficiency is down near zero these days so
it could do with a rewiev/rework by someone who does this more often
than I.  I will attach this...

Version-Release number of selected component (if applicable):


How reproducible:

Every time.

Steps to Reproduce:

1. Set-up netdump server and client on two distinct subnets, layer 3
    routing required for connectivity
2. Configure netdump client
3. Attempt to start netdump client

Actual results:

service netdump start fails

Expected results:

service netdump start succeeds

Additional info:
Comment 1 Allen Nuttle 2003-05-13 20:17:03 EDT
Created attachment 91655 [details]
/etc/init.d/netdump modifications
Comment 2 Bastien Nocera 2004-08-03 08:32:17 EDT
Created attachment 102386 [details]

unified diff version of the above patch, against the RHEL3 netdump.
Comment 3 Dave Anderson 2004-08-17 09:15:18 EDT
Jeff, can you take a look at this?
Comment 4 Jeffrey Moyer 2004-08-17 09:35:40 EDT
> One solution would be to copy the information from the regular stack
> sometime before it starts to process a crash.

That is precisely what we are doing by getting this information either
from arp or from the value supplied in /etc/sysconfig/netdump.

> The simplest workaround is to specify a MAC address in
> the "/etc/sysconfig/netdump" file, but this is also broken in this
> case!

That is not a workaround.  It is a "functions as designed" thing. 
Further, it is not broken in the configurations that I've tested.  I
use this feature of netdump daily.

> This preserves NETDUMPADDR in the case where arping/arp fail and
> makes the problem solvable, by manually supplying the proper MAC
> address in "/etc/sysconfig/netdump", for NETDUMPMACADDR.

If arp fails, we exit from the script in print_address_info.  I'm not
sure what problem you are addressing with this patch.

Perhaps you could supply more detail about the configuration that
isn't working for you?

The idea of using traceroute for finding the gateway automatically
sounds worth exploring.  This may be worth incorporating in the next
version of netdump.

Comment 5 Allen Nuttle 2004-08-17 12:02:36 EDT
The problem is that arp[ing] only works when the address is either
in the arp cache or reachable through a ethernet-layer (layer 2)
broadcast.  There are some slight exceptions, such as when proxy
arp is in use.  However, the bottom line is that, in general, either
someone manually adds an address to the arp cache for a host that is
on another subnet/broadcast domain, or arp does not work.  This is
normally fine, since any address that is not specified by the subnet
mask is automatically contacted through the gateway.  However,
netdump is not normal in that it does not use the standard IP stack,
so it does not pick up this automatic indirection -- it needs to
have the right MAC address configured, either manually or, better, 
automatically.  As things stand in the version we use, both of these
options are broken.  I would not be surprised if at least the manual
configuration option has been fixed, but this is still a pain.

If you are not seeing this problem, the most likely reason is that
either you are in an environment where netdump traffic does not need
to cross any subnet boundary -- or someone has configured your
routers and/or proxies to transparently forward arp broadcasts and
responses.  This is not the way things operate in general in the
wide world and netdump is broken in environments that do not do this
-- and this is not just a theoretical possibility.

Here is a quick high-level overview of ARP I found just now that
explains this: <http://www.mynetwatchman.com/pckidiot/chap05.htm>.

Here's what I would do to try to reproduce this:

/sbin/ifconfig  (find "addr" and "Mask")
/sbin/route -n  (find a "Gateway" entry, where "Flags" includes 'G')

ping at least one address that is on the same subnet as "addr" (in
other words, an address that differs from "addr" only in bits that
are zero in "Mask") and at least one address on a different subnet

/sbin/arp -n -a

note that the address on the same subnet has an arp entry, while the
address on a different subnet does not; also notice that the
"Gateway" used to reach the second address has an entry

If you're saying that manual configuration now works and is required
in this type of network environment, I'd still call this a bug and
encourage you to fix it at the next opportunity.
Comment 10 Jeffrey Moyer 2005-02-15 13:58:03 EST
    ping_output="$(ping -c 1 -I $DEV $host 2> /dev/null | \
      grep '^PING ' | awk '{print $3}' | sed 's#^(##' | sed 's#)$##')"
    [ $? -ne 0 ] && echo "$prog: cannot ping $host" 1>&2 && usage

I believe the return value from the ping_output line will be the
result of the last command evaluated in the pipeline.  So, this does
not check that ping failed.  This needs fixing.
+    trc_output="$(traceroute -i $DEV -n -m 1 $host_ip 2> /dev/null | \
+      grep '^ 1  ' | awk '{print $2}')"
+    [ $? -ne 0 ] && echo "$prog: cannot traceroute $host_ip" 1>&2 &&

Same here.
     for line in $arp_output; do
         set - foo $line
-        if [ "$2" = "($host)" ] || expr "$1" : "$host" &>/dev/null; then
-            echo HOSTNAME=$1 IPADDR=$2 AT=$3 MAC=$4 \
+	if [ "$2" = "($mac_ip)" ] || expr "$1" : "$mac_ip" &>/dev/null; then
+	    echo HOSTNAME=$1 IPADDR=$host_ip MAC_IPADDR=$2 AT=$3 MAC=$4 \
                  TYPE=$5 ON=$6 IFACE=$7

This bit won't apply anymore.

My main concern with this patch is that we don't introduce
regressions, not even in the error cases.  Please create a new patch
against the latest source, netdump-0.7.5, and if you could generate
diffs against the source tree, that would be ideal (i.e. not against
/etc/init.d/netdump and some file netdump in whatever directory).

Please also note that it has been mentioned that some switches hide
the first hop.  I want to ensure that the hard-coded case will still
work in this case.  To that end, if anyone watching this bugzilla has
a network with foundry switches deployed, please let me know if you
can volunteer to test.


Comment 11 Bastien Nocera 2005-02-16 09:29:19 EST
Created attachment 111129 [details]

Updated patch. Untested (as I don't have the hardware to do so here).
Comment 12 Allen Nuttle 2005-02-16 15:41:45 EST
Created attachment 111136 [details]
Slight corrections to prior version of the netdump-subnets patch
Comment 13 Jeffrey Moyer 2005-02-25 15:24:21 EST
This breaks my setup.  I believe our routers use proxy arp. 
Previously, I could specify the IP address of my netdump server (which
is on the other side of the router), and this would work fine.  With
your patch applied, this no longer works.

I'll look into this further.

Comment 14 Jeffrey Moyer 2005-02-25 16:39:27 EST
Created attachment 111444 [details]
Find the next hop MAC address automatically

Alan, please take a look at this version of the patch.	Honestly, I don't see
how the last version of the patch would have worked for anything other than
client and server on the same subnet.


Comment 15 Jeffrey Moyer 2005-02-25 16:45:42 EST
Ugh.  the usage function doesn't return:

    [ $? -ne 0 ] && echo "$prog: cannot ping $host" 1>&2 && usage

So you really want to exit the script at this point.  Not only that,
the script was called with the proper arguments, but the configuration
was incorrect.  Thus, telling the user that they need to call the
script with start|stop|status is not helpful in this case.
Comment 16 Jeffrey Moyer 2005-02-25 17:27:59 EST
Created attachment 111448 [details]
Gets rid of bogus Usage calls.

Okay, this patch gets rid of the calls to usage.  After looking at the code
again, it's apparent that you don't want to exit in these cases.  This version
passes a number of regression tests in my environment.	Any testing by others
would be greatly appreciated.
Comment 17 Allen Nuttle 2005-02-27 20:20:26 EST
Sorry -- I should have taken more time with this, I just did a quick
review and smoke_test of the last patch.  Thanks for following up!

The most recent patch looks good to me and worked in our environment
-- this time I configured and also actually triggered a dump :).
Comment 18 Jeffrey Moyer 2005-02-28 08:12:51 EST
OK, thanks for the testing, Alan.  I'll work to get this into our next
Comment 19 Jeffrey Moyer 2005-03-01 17:45:28 EST
A fix for this has been committed to netdump, and is on track for RHEL 3 U5 and
RHEL 4 U1.  Packages versioned 0.7.7-2 and later have this fix.
Comment 20 Jerry Uanino 2005-03-02 13:59:35 EST
I'm having trouble getting this patch to apply to my netdump install.  Is 
there any way I could get the packaged version.

patching file netdump
Hunk #1 FAILED at 73.
1 out of 1 hunk FAILED -- saving rejects to file netdump.rej
Comment 21 Jeffrey Moyer 2005-03-02 14:03:08 EST
You can get the latest version from anonymous cvs:

     export CVSROOT=:pserver:anonymous@rhlinux.redhat.com:/usr/local/CVS
     cvs -z3 login
     (hit enter)
     cvs -z3 co netdump
Comment 22 Jerry Uanino 2005-03-02 14:45:05 EST
Indeed this works.
Comment 23 Jay Turner 2005-04-27 06:39:57 EDT
I'm not able to reproduce this with any of our testlab networks, probably due to
having routers passing arp requests or something.  But it appears at least one
person had success with the patched packages.  Will try with a less smart
networking setup once I get into the lab.
Comment 24 Tim Powers 2005-05-19 20:12:26 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.