Bug 459879 - kdump via bond device doesn't work for non-basic config.
kdump via bond device doesn't work for non-basic config.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools (Show other bugs)
5.2
All Linux
medium Severity high
: rc
: ---
Assigned To: Neil Horman
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-23 11:02 EDT by Etsuji Nakai
Modified: 2012-10-02 15:51 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 600607 (view as bug list)
Environment:
Last Closed: 2009-01-20 15:59:52 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
suggested patch for /sbin/mkdumprd (1.01 KB, patch)
2008-08-23 11:02 EDT, Etsuji Nakai
no flags Details | Diff
kdump.conf (4.24 KB, text/plain)
2008-08-25 19:12 EDT, Etsuji Nakai
no flags Details
inird for the crash kernel (4.80 MB, application/octet-stream)
2008-08-25 19:15 EDT, Etsuji Nakai
no flags Details
Correct one: initrd for the crash kernel (4.80 MB, application/octet-stream)
2008-08-25 19:22 EDT, Etsuji Nakai
no flags Details
patch to update MASTER pointers for slaves on bonded interfaces (687 bytes, patch)
2008-09-04 13:42 EDT, Neil Horman
no flags Details | Diff
fixing typo and the second problem for /sbin/mkdumprd (783 bytes, patch)
2008-09-05 05:13 EDT, Etsuji Nakai
no flags Details | Diff

  None (edit)
Description Etsuji Nakai 2008-08-23 11:02:46 EDT
Created attachment 314866 [details]
suggested patch for /sbin/mkdumprd

Description of problem:
kdump via bond0 consisting of eth0 and eth1 may work well, but it doesn't work other (non-basic) bond configs such as "bond1 consisting of eth1 and eth2"

Version-Release number of selected component (if applicable):
kexec-tools-1.102pre-21.el5
kexec-tools-1.102pre-21.el5_2.2

How reproducible:
Follow the steps below.

Steps to Reproduce:
1. Create bond1 with eth1 and eth2 in addition to (non-bond) eth0.

# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
DHCP_HOSTNAME=rhel52

# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
ONBOOT=yes
SLAVE=yes
MASTER=bond1

# cat /etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE=eth2
ONBOOT=yes
SLAVE=yes
MASTER=bond1

# cat /etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE=bond1
IPADDR=192.168.1.198
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
GATEWAY=192.168.1.254
ONBOOT=yes
BOOTPROTO=static
BONDING_OPTS="mode=1 primary=eth1 miimon=100 updelay=5000"

Add "alias bond1 bonding" to /etc/modprobe.conf

2. Configrue net-kdump via bond1

# cat /etc/kdump.conf
net admin@192.168.1.190
path /home/admin/crash

3. Start kdump

# echo c > /proc/sysrq-trigger

Actual results:
eth1 and eth2 fails to be enslaved, and vmcore cannot be sent out to the remote server.

Expected results:
vmcore is sent out to the remote server via the bond device.


Additional info:
Attached is a suggested patch for /sbin/mkdumprd included in kexec-tools-1.102pre-21.el5_2.2
Comment 1 Neil Horman 2008-08-25 09:35:40 EDT
What exactly do you think is going wrong?  bond1 should be passed into find_activate_slaves as arg 1, and as a result we should find all physical interfaces for which we are attached to  that bond.  Your patch seems like a hack to make your environment work properly (especially since you hardcode bond0  into your definition of OLD_MASTER.  If you would please send me your kdump.conf, and the initramfs image you produce from it.  Thanks!
Comment 2 Etsuji Nakai 2008-08-25 19:12:26 EDT
Created attachment 314951 [details]
kdump.conf
Comment 3 Etsuji Nakai 2008-08-25 19:15:03 EDT
Created attachment 314953 [details]
inird for the crash kernel
Comment 4 Etsuji Nakai 2008-08-25 19:22:01 EDT
Created attachment 314954 [details]
Correct one: initrd for the crash kernel

Attached 314953 is incorrect. Please ignore it. This is the correct one.
Comment 5 Etsuji Nakai 2008-08-25 19:23:21 EDT
Please re-check how the "init" script in the initrd for the crach kernel (initrd-<Version>kdump.img) handles bonding devices. What I found was that:

1. bondX (a bond master in the normal kernel. bond1 in my case) is forcefully converted to bond0 (that is hardcoded) with the "map_interface" script.

* Corresponding part of "init" (in initrd-<Version>kdump.img)
------------
for i in `ls /etc/ifcfg-*`
do
   NETDEV=`echo $i | cut -d"-" -f2`
   map_interface $NETDEV
done
rename_interfaces
IFACE=`cat /etc/iface_to_activate`
ifup $IFACE
------------

* "scriptfns/map_interface" called from "init".
------------
. /etc/ifcfg-$NETDEV
for j in `ifconfig -a | awk '/.*Link encap.*/ {print $1}'`
do
    case "$BUS_ID" in
    Bonding)
        REAL_DEV=bond0
        RENAMED="yes"
        ;;
...
#build the interface rename map
echo $NETDEV $REAL_DEV tmp$TMPCNT >> /etc/iface_map
TMPCNT=`echo $TMPCNT 1 + p | dc`
echo $TMPCNT > /tmp/tmpcnt
echo mapping $NETDEV to $REAL_DEV
-------------

2. As a result, bond0 (instead of the original bond master, bond1 in my case) is always activated as a master, and "bond0" is passed into find_activate_slaves as arg 1.

3. However, since "ifcfg-eth1, ifcfg-eth2" still contains "MASTER=bond1", it fails to be enslaved to the active bond0.

Note that, another problem is that find_activate_slaves tries to read ifcfg-eth0 first and failes (as ifcfg-eth0 doesn't exist in the crash kernel environment.) Then, find_activate_slaves is aborted there. This is why I added 
+    if [ -f /etc/ifcfg-\$j ];
in the patch.

See the attached (314951, 314954) for kdump.conf and initrd (for crash kernel) which was built with the original (non-patched) mkdumprd.
Comment 6 Neil Horman 2008-09-04 13:40:44 EDT
The concerns you have in note 2 is supposed to be taken care of in the rename_interfaces function.  Although looking at it, it seems like that function misses handling slaves on bonded interfaces.  I'll write a patch...
Comment 7 Neil Horman 2008-09-04 13:42:01 EDT
Created attachment 315787 [details]
patch to update MASTER pointers for slaves on bonded interfaces

I've not tested it yet, but I think this will update what your missing.  Please let me know if it solves your problems.  Thanks!
Comment 8 Etsuji Nakai 2008-09-05 05:13:47 EDT
Created attachment 315851 [details]
fixing typo and the second problem for /sbin/mkdumprd
Comment 9 Etsuji Nakai 2008-09-05 05:18:55 EDT
Thanks Neil. 

I found two typos in your patch:

+    IS_BOND=\`echo /etc/ifcfg-\$NEW | grep bond\`
+    if [ -n "$IS_BOND" ] #### shoud be =====> if [ -n "\$IS_BIND" ]
+    then
+        for i in \`ls /etc/ifcfg-*\`
+        do
+            sed -e"s/.*MASTER=\$CURRENT.*/MASTER=\$NEW/" \$i > /tmp/ifcfg-tmp
+            mv /sbin/ifcfg-tmp \$i  #### shold be =====> mv /tmp/ifcfg-tmp \$i
+        done
+    fi

And there still remains the second problem:

find_activate_slaves finds device names ethXX from 'ifconfig' output and tries to read the corresponding /etc/if-ethXX, but it doesn't always exist the corresponding one. Hence, when find_activate_slaves tries to read non-existing /etc/ifcfg-ethX, if failes and find_activate_slaves is aborted there. (This may be a particular behaviour of the busybox shell.) 

See the attached (id=315851) for the patch fixing the typos and this problem. It worked in my environment.
Comment 10 Neil Horman 2008-09-05 07:00:45 EDT
Like I said, I hadn't tested it.  What is the result of your testing after fixing the typos?

As for your patch, As I've noted, I still don't like it for the reasons I gave previously. However, given that you seem adamant on it, I'll consider it if you can test it and show that it works in other cases as well.  Speciically if it works in the trivial case (where there is only one bonded interface bond0 in the entire system runing under a normal kernel), and in the case where there is only one bonded interface without the normal name (say bondtest).  Then I'll look into taking it.
Comment 11 Etsuji Nakai 2008-09-05 07:31:36 EDT
No, no. I'm not sticking to the original patch (id=314866), let's throw it away.

Please look at the contents of my last patch (id=315851).

https://bugzilla.redhat.com/attachment.cgi?id=315851

It's just a slight modification of your patch (id=315787) in the follwing ways.

- Fixing typos.
- Add a fix to the another problem below (other than the one fixed by your patch.)
------------------
find_activate_slaves finds device names ethXX from 'ifconfig' output and tries
to read the corresponding /etc/if-ethXX, but it doesn't always exist the
corresponding one. Hence, when find_activate_slaves tries to read non-existing
/etc/ifcfg-ethX, if failes and find_activate_slaves is aborted there. (This may
be a particular behaviour of the busybox shell.) 
------------------

And my testing result is that.

Your patch (id=315787) didn't work becase of the typos and the another problem.
The modified version of your patch (id=315851) worked well.

Does it make sense? Thanks.
Comment 12 Neil Horman 2008-09-05 10:44:36 EDT
Ahh, sorry, I missed that attachment.  Yes, what you have there makes sense to me, I'll check that it as soon as I can.

Thanks!
Comment 17 errata-xmlrpc 2009-01-20 15:59:52 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0105.html

Note You need to log in before you can comment on or make changes to this bug.