Bug 459879

Summary: kdump via bond device doesn't work for non-basic config.
Product: Red Hat Enterprise Linux 5 Reporter: Etsuji Nakai <enakai0>
Component: kexec-toolsAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.2CC: mgahagan, qcai, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 600607 (view as bug list) Environment:
Last Closed: 2009-01-20 20:59:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
suggested patch for /sbin/mkdumprd
none
kdump.conf
none
inird for the crash kernel
none
Correct one: initrd for the crash kernel
none
patch to update MASTER pointers for slaves on bonded interfaces
none
fixing typo and the second problem for /sbin/mkdumprd none

Description Etsuji Nakai 2008-08-23 15:02:46 UTC
Created attachment 314866 [details]
suggested patch for /sbin/mkdumprd

Description of problem:
kdump via bond0 consisting of eth0 and eth1 may work well, but it doesn't work other (non-basic) bond configs such as "bond1 consisting of eth1 and eth2"

Version-Release number of selected component (if applicable):
kexec-tools-1.102pre-21.el5
kexec-tools-1.102pre-21.el5_2.2

How reproducible:
Follow the steps below.

Steps to Reproduce:
1. Create bond1 with eth1 and eth2 in addition to (non-bond) eth0.

# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
DHCP_HOSTNAME=rhel52

# cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
ONBOOT=yes
SLAVE=yes
MASTER=bond1

# cat /etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE=eth2
ONBOOT=yes
SLAVE=yes
MASTER=bond1

# cat /etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE=bond1
IPADDR=192.168.1.198
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
GATEWAY=192.168.1.254
ONBOOT=yes
BOOTPROTO=static
BONDING_OPTS="mode=1 primary=eth1 miimon=100 updelay=5000"

Add "alias bond1 bonding" to /etc/modprobe.conf

2. Configrue net-kdump via bond1

# cat /etc/kdump.conf
net admin.1.190
path /home/admin/crash

3. Start kdump

# echo c > /proc/sysrq-trigger

Actual results:
eth1 and eth2 fails to be enslaved, and vmcore cannot be sent out to the remote server.

Expected results:
vmcore is sent out to the remote server via the bond device.


Additional info:
Attached is a suggested patch for /sbin/mkdumprd included in kexec-tools-1.102pre-21.el5_2.2

Comment 1 Neil Horman 2008-08-25 13:35:40 UTC
What exactly do you think is going wrong?  bond1 should be passed into find_activate_slaves as arg 1, and as a result we should find all physical interfaces for which we are attached to  that bond.  Your patch seems like a hack to make your environment work properly (especially since you hardcode bond0  into your definition of OLD_MASTER.  If you would please send me your kdump.conf, and the initramfs image you produce from it.  Thanks!

Comment 2 Etsuji Nakai 2008-08-25 23:12:26 UTC
Created attachment 314951 [details]
kdump.conf

Comment 3 Etsuji Nakai 2008-08-25 23:15:03 UTC
Created attachment 314953 [details]
inird for the crash kernel

Comment 4 Etsuji Nakai 2008-08-25 23:22:01 UTC
Created attachment 314954 [details]
Correct one: initrd for the crash kernel

Attached 314953 is incorrect. Please ignore it. This is the correct one.

Comment 5 Etsuji Nakai 2008-08-25 23:23:21 UTC
Please re-check how the "init" script in the initrd for the crach kernel (initrd-<Version>kdump.img) handles bonding devices. What I found was that:

1. bondX (a bond master in the normal kernel. bond1 in my case) is forcefully converted to bond0 (that is hardcoded) with the "map_interface" script.

* Corresponding part of "init" (in initrd-<Version>kdump.img)
------------
for i in `ls /etc/ifcfg-*`
do
   NETDEV=`echo $i | cut -d"-" -f2`
   map_interface $NETDEV
done
rename_interfaces
IFACE=`cat /etc/iface_to_activate`
ifup $IFACE
------------

* "scriptfns/map_interface" called from "init".
------------
. /etc/ifcfg-$NETDEV
for j in `ifconfig -a | awk '/.*Link encap.*/ {print $1}'`
do
    case "$BUS_ID" in
    Bonding)
        REAL_DEV=bond0
        RENAMED="yes"
        ;;
...
#build the interface rename map
echo $NETDEV $REAL_DEV tmp$TMPCNT >> /etc/iface_map
TMPCNT=`echo $TMPCNT 1 + p | dc`
echo $TMPCNT > /tmp/tmpcnt
echo mapping $NETDEV to $REAL_DEV
-------------

2. As a result, bond0 (instead of the original bond master, bond1 in my case) is always activated as a master, and "bond0" is passed into find_activate_slaves as arg 1.

3. However, since "ifcfg-eth1, ifcfg-eth2" still contains "MASTER=bond1", it fails to be enslaved to the active bond0.

Note that, another problem is that find_activate_slaves tries to read ifcfg-eth0 first and failes (as ifcfg-eth0 doesn't exist in the crash kernel environment.) Then, find_activate_slaves is aborted there. This is why I added 
+    if [ -f /etc/ifcfg-\$j ];
in the patch.

See the attached (314951, 314954) for kdump.conf and initrd (for crash kernel) which was built with the original (non-patched) mkdumprd.

Comment 6 Neil Horman 2008-09-04 17:40:44 UTC
The concerns you have in note 2 is supposed to be taken care of in the rename_interfaces function.  Although looking at it, it seems like that function misses handling slaves on bonded interfaces.  I'll write a patch...

Comment 7 Neil Horman 2008-09-04 17:42:01 UTC
Created attachment 315787 [details]
patch to update MASTER pointers for slaves on bonded interfaces

I've not tested it yet, but I think this will update what your missing.  Please let me know if it solves your problems.  Thanks!

Comment 8 Etsuji Nakai 2008-09-05 09:13:47 UTC
Created attachment 315851 [details]
fixing typo and the second problem for /sbin/mkdumprd

Comment 9 Etsuji Nakai 2008-09-05 09:18:55 UTC
Thanks Neil. 

I found two typos in your patch:

+    IS_BOND=\`echo /etc/ifcfg-\$NEW | grep bond\`
+    if [ -n "$IS_BOND" ] #### shoud be =====> if [ -n "\$IS_BIND" ]
+    then
+        for i in \`ls /etc/ifcfg-*\`
+        do
+            sed -e"s/.*MASTER=\$CURRENT.*/MASTER=\$NEW/" \$i > /tmp/ifcfg-tmp
+            mv /sbin/ifcfg-tmp \$i  #### shold be =====> mv /tmp/ifcfg-tmp \$i
+        done
+    fi

And there still remains the second problem:

find_activate_slaves finds device names ethXX from 'ifconfig' output and tries to read the corresponding /etc/if-ethXX, but it doesn't always exist the corresponding one. Hence, when find_activate_slaves tries to read non-existing /etc/ifcfg-ethX, if failes and find_activate_slaves is aborted there. (This may be a particular behaviour of the busybox shell.) 

See the attached (id=315851) for the patch fixing the typos and this problem. It worked in my environment.

Comment 10 Neil Horman 2008-09-05 11:00:45 UTC
Like I said, I hadn't tested it.  What is the result of your testing after fixing the typos?

As for your patch, As I've noted, I still don't like it for the reasons I gave previously. However, given that you seem adamant on it, I'll consider it if you can test it and show that it works in other cases as well.  Speciically if it works in the trivial case (where there is only one bonded interface bond0 in the entire system runing under a normal kernel), and in the case where there is only one bonded interface without the normal name (say bondtest).  Then I'll look into taking it.

Comment 11 Etsuji Nakai 2008-09-05 11:31:36 UTC
No, no. I'm not sticking to the original patch (id=314866), let's throw it away.

Please look at the contents of my last patch (id=315851).

https://bugzilla.redhat.com/attachment.cgi?id=315851

It's just a slight modification of your patch (id=315787) in the follwing ways.

- Fixing typos.
- Add a fix to the another problem below (other than the one fixed by your patch.)
------------------
find_activate_slaves finds device names ethXX from 'ifconfig' output and tries
to read the corresponding /etc/if-ethXX, but it doesn't always exist the
corresponding one. Hence, when find_activate_slaves tries to read non-existing
/etc/ifcfg-ethX, if failes and find_activate_slaves is aborted there. (This may
be a particular behaviour of the busybox shell.) 
------------------

And my testing result is that.

Your patch (id=315787) didn't work becase of the typos and the another problem.
The modified version of your patch (id=315851) worked well.

Does it make sense? Thanks.

Comment 12 Neil Horman 2008-09-05 14:44:36 UTC
Ahh, sorry, I missed that attachment.  Yes, what you have there makes sense to me, I'll check that it as soon as I can.

Thanks!

Comment 17 errata-xmlrpc 2009-01-20 20:59:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0105.html