Bug 91399

Summary: bonding+dhcp does not work
Product: [Retired] Red Hat Linux Reporter: Nicolas Mailhot <nicolas.mailhot>
Component: initscriptsAssignee: Bill Nottingham <notting>
Status: CLOSED ERRATA QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3CC: rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-12-19 19:13:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch that brings up bonding devices last
none
First patch test results
none
First patch test results
none
patch for bonding devices
none
ethtool source rpm
none
ifup from patched initscripts 6.67
none
same for network none

Description Nicolas Mailhot 2003-05-22 11:16:18 UTC
Description of problem:

I've just installed a test RH 7.3 server in a big site (if the test goes well it
will probably be upgraded to RH AS & duplicated)

The site we're on is the infrastructure node for France. Once validated server
setups are duplicated in daughter sites all around the world. Server setup is
currently standardised on windows 2000 + dhcp & dynamic dns (ie each center has
a single static ip for dns server with the same numbering rules for all
sub-networks all around the world - all other servers and stations have dynamic
ips assigned via dhcp and are rebooted every night)

All servers have full hight-availability features including duplicated gigabit
cards.

So to fit well the linux server would have to use dhcp on a bonding interface.

The problem is dhcp works on the separate cards, bonding works with static
network info, but bonding + dhcp fails.

We've ended up using a bonding+static info setup but this is clearly less than
optimal for the client and will weight during the final pilot phase evaluation.
(ie we may be asked to use windows as OS platform instead, which we can do but
would like to avoid)

I don't know if this problem is fixed in AS - if it is it's probably ok for us
to tell the client we'll switch to dhcp+bonding setup at the end of the pilot
phase via the system upgrade.

Version-Release number of selected component (if applicable):
initscripts-6.67-1

How reproducible:

always

Steps to Reproduce:
1. get a system with two eth cards
2. setup bonding with static ip info, eg
(in /etc/sysconfig/network-scripts/ifcfg-bond0)

DEVICE=bond0
ONBOOT=yes
USERCTL=no
 
BOOTPROTO='none'
IPADDR='192.168.1.16'
NETMASK='255.255.255.0'
NETWORK='192.168.1.0'
BROADCAST='192.168.1.255'

3. switch to dhcp :

DEVICE=bond0
ONBOOT=yes
USERCTL=no
BOOTPROTO=dhcp
    
Actual results:

mai 22 12:25:25 kitu ifup: Définition des informations IP pour bond0.
May 22 12:25:25 kitu kernel: bonding.c:v2.4.20-20030320 (March 20, 2003)
May 22 12:25:25 kitu kernel: bond0 registered with MII link monitoring set to
100 ms, in fault-tolerance (active-backup) mode.May 22 12:25:25 kitu
dhcpcd[30339]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00)
May 22 12:25:25 kitu kernel: bond0 registered without ARP monitoring
May 22 12:25:25 kitu dhcpcd[30339]: dhcpStart: retrying MAC address request
(returned 00:00:00:00:00:00)

Additional info:

It seems the scripts try dhcp before getting a mac for bond0 from the first eth
slave. Setting a HWADRR in ifcfg-bond0 does not seem to work either.

I've reproduced it here with two e100 cards + RH 7.3 dhcp server (exact same
error messages)
The field setup is tg3 + e1000 & windows 2000 (or 2003) server.

The client does not stint on hardware & uses windows best practices
(dhcp,teaming,night reboot) so I'd be really surprised if we were the first (or
last) to hit this.

(btw dhcp is also supposed to give the current ntp server ipinfo to the client)

Comment 1 Bill Nottingham 2003-05-22 16:59:27 UTC
Yes, this is broken.

Basically, the way the devices are brought up is:

- bring up master
- bring up slaves

This is all done serially.

Obviously, with DHCP, this Will Not Work.


Comment 2 Bill Nottingham 2003-05-22 18:07:36 UTC
Created attachment 91902 [details]
patch that brings up bonding devices last

This patch moves the bringing up of bonding devices to the end. Does this work
for you?

Comment 3 Nicolas Mailhot 2003-05-23 08:08:06 UTC
Ok, since the patch didn't apply on RH 7.3 network, I took the one from Rawhide,
patched it (with fuzz), and tried it on the test bonded setup. Unfortunately it
didn't work and also killed bonding with static IP info.

Comment 4 Nicolas Mailhot 2003-05-23 08:10:21 UTC
Created attachment 91912 [details]
First patch test results

Comment 5 Nicolas Mailhot 2003-05-23 08:11:42 UTC
Syslog lang was set to en_US, iptables stopped, bonding and e100 modules rmmoded
before each test.

Comment 6 Nicolas Mailhot 2003-05-23 08:14:50 UTC
Created attachment 91913 [details]
First patch test results

Sorry for the bogus first version.

Comment 7 Bill Nottingham 2003-05-23 13:56:05 UTC
Hm, ok, I was testing this on RHL 9+ (and the patch was against the 7.3
initscripts branch, so it should have applied, odd.) I'll dig up a 7.3 or AS2.1
machine to try and test.

Comment 8 Nicolas Mailhot 2003-05-23 14:23:22 UTC
[root@kitu root]# rpm -q initscripts
initscripts-6.67-1
[root@kitu root]# rpm -V initscripts
..5....T c /etc/inittab
.......T c /etc/rc.d/init.d/network

I can install RH9 or even rawhide initscripts rpm on the test server if you feel
that's safe. Setting out a test machine is fairly easy - one only needs a dhcp
environment and two nics.

Comment 9 Bill Nottingham 2003-05-24 02:02:06 UTC
Hm, yes, I can reproduce the failures on RH7.3; it's not obvious why it's
failing though, especially when using the errata kernel.

Comment 10 Nicolas Mailhot 2003-07-15 10:26:42 UTC
Ok so I've got a test box and checked this a bit more
- bonding+dhcp also fails on RH ES
- on RH ES static bonding works but one must load the bonding module manually
since the scripts won't (ie I could not figure how to get a RH ES box auto start
without doing scripts surgery)
- RH 9 is massively broken. I couldn't get any bonding setup to work.

Comment 11 Nicolas Mailhot 2003-07-15 10:28:21 UTC
Maybe this should be escalated a bit - while RH 7.3 may be EOL soon both ES and
9 are supposed to be there longer.

The ES part is the worst - it doesn't even have the level of support of 7.3

Comment 12 Bill Nottingham 2003-07-15 14:52:48 UTC
The bits in ES are older than in RH 7.3, so it's not really that surprising. I
haven't had a chance to play with this more recently, though.

Comment 13 Roel Teuwen 2003-08-14 14:24:51 UTC
One can work around this problem by changing the network init script in
/etc/init.d as below. This was done on RHAS 2.1 where I am seeing the same
problem. It basically reversed the initialisation of the interfaces and is a
crude workaround that should apply to most other versions.

--- network.orig        Thu Aug 14 09:18:47 2003
+++ network     Thu Aug 14 09:05:35 2003
@@ -44,7 +44,7 @@

 # find all the interfaces besides loopback.
 # ignore aliases, alternative configurations, and editor backup files
-interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
+interfaces=`ls ifcfg* | sort -r | LANG=C egrep -v
'(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
            LANG=C egrep -v '(~|\.bak)$' | \
            LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \
             LANG=C egrep 'ifcfg-[a-z0-9]+$' | \

Comment 14 Bill Nottingham 2003-08-14 16:16:33 UTC
That's what the previous patches posted already do.. they move bonding devices
to the end.

Comment 15 Roel Teuwen 2003-08-19 09:27:19 UTC
Not exactly. I'm sorry, I should have been more clear to mention that this patch
is for RedHat Advanced Server 2.1 specifically. (initscripts-6.47-1, may not be
latest, sorry)

I am not sure what version of RedHat your patch is specifically for, but at
least on RedHat Advanced Server 2.1, the bonding interface bond0 will be added
to 'interfaces' in the interfaces= line you can see in my previous comment, and
will be brought up by the "	for i in $interfaces; do" line that (again, on AS2.1
at least) is executed before the bonding interface addition you mad ebehind the
cipe devices.

I prefer your approach ofcourse, but what is needed on AS2.1 is removing the
bond* devices from the interfaces. May I suggest the following patch for AS2.1,
not that I like the way the devices are enumerated, but I have kept the style : 

--- network.orig        Thu Aug 14 09:18:47 2003
+++ network     Tue Aug 19 04:23:38 2003
@@ -47,7 +47,8 @@
 interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
            LANG=C egrep -v '(~|\.bak)$' | \
            LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \
-            LANG=C egrep 'ifcfg-[a-z0-9]+$' | \
+           LANG=C egrep -v 'ifcfg-bond[0-9]+$' | \
+            LANG=C egrep 'ifcfg-[a-z0-9]+$' | \
             sed 's/^ifcfg-//g'`

 # See how we were called.
@@ -107,6 +108,7 @@
        # add cipe here.
        cipeinterfaces=`ls ifcfg* | LANG=C egrep -v
'(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
            LANG=C egrep -v '(~|\.bak)$' | \
+           LANG=C egrep -v '(ifcfg-bond[0-9]+$' | \
            LANG=C egrep 'ifcfg-cipcb[0-9]+$' | \
             sed 's/^ifcfg-//g'`
        for i in $cipeinterfaces ; do
@@ -116,6 +118,34 @@
                {
                        confirm $i
                        case $? in
+                               0)
+                                   :
+                               ;;
+                               2)
+                                   CONFIRM=
+                               ;;
+                               *)
+                                   continue
+                               ;;
+                       esac
+               }
+               action $"Bringing up interface $i: " ./ifup $i boot
+           fi
+        done
+
+       # Bring up bonding interfaces
+       bondinterfaces=`ls ifcfg* | LANG=C egrep -v
'(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
+            LANG=C egrep -v '(~|\.bak)$' | \
+           LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \
+            LANG=C egrep 'ifcfg-bond[0-9]+$' | \
+            sed 's/^ifcfg-//g'`
+       for i in $bondinterfaces ; do
+          if ! LANG=C egrep -L "^ONBOOT=['\"]?[Nn][Oo]['\"]?" ifcfg-$i
>/dev/null 2>&1 ; then
+               # If we're in confirmation mode, get user confirmation
+               [ -n "$CONFIRM" ]  &&
+               {
+                       confirm $i
+                       case $? in
                                0)
                                    :
                                ;;

Comment 16 Bill Nottingham 2003-09-04 03:26:53 UTC
The patch to move bonding devices to the end is going to be in 7.31-1; I'll do
some more testing with that.

Comment 17 Nicolas Mailhot 2003-09-04 14:45:35 UTC
I may also do some testing on 7.3 / ES 2.1 if needed.

Comment 18 Bill Nottingham 2003-09-05 19:45:31 UTC
Never mind; moving bonding to last is the *wrong* answer. (The bonding device
needs brought up first). dhcp will have to be hacked in some other way.

Comment 19 Bill Nottingham 2003-09-05 20:39:22 UTC
Created attachment 94260 [details]
patch for bonding devices

The attached is a backport to the 7.2 initscripts branch of the fixes that make
it work for me in testing on the Taroon beta. I haven't installed a
7.2/7.3/AS2.1 box to try this there; I've been told that the bonding driver in
those releases is older and somewhat more buggy.

What's done is that we initialize the bonding device and bring its link up, and
then we bring up the slaves, and then finish bringing up the bonding device.

Since the patch is a backport to the 7.2 initscripts *branch* of CVS, it might
need to be massaged slightly depending on what package you're on.

Comment 20 Bill Nottingham 2003-09-05 20:41:20 UTC
Created attachment 94261 [details]
ethtool source rpm

You'll need this version of ethtool for the 'ethtool -i' check for a bonding
device to work.

Comment 21 Nicolas Mailhot 2003-09-05 20:53:36 UTC
Ok, I'll try it on monday.
Checking on 7.3 should be easy. On 2.1 ES this might require hunting down
another network card.

Thanks.

Comment 22 Nicolas Mailhot 2003-09-08 14:19:28 UTC
The patch seems to apply to the 7.3 initscripts versions with fuzz.
I'll try it as soon as I can get the 7.3 test server offline

Comment 23 Nicolas Mailhot 2003-09-11 12:26:51 UTC
Well naïve patching of the initscript version shipped with 7.3 fails (this may
be due to problems in the 7.3 bonding module, I don't know).

Anyway this will be a problem for our future deployments mainly, which means RH
AS/ES. Will AS 3 work with Oracle & java out of the box ? If so I'll set up a
taroon test server and try on it.

Working on a soon-to-be-unsupported platform does not make much sense.

Comment 24 Bill Nottingham 2003-09-11 20:02:43 UTC
Can you attach your patched 7.3 /etc/init.d/network and ifup? That will clarify
any patch errors.

What sort of problems do you have with the patched version?

Comment 25 Nicolas Mailhot 2003-09-12 10:03:52 UTC
Created attachment 94436 [details]
ifup from patched initscripts 6.67

Comment 26 Nicolas Mailhot 2003-09-12 10:05:30 UTC
Created attachment 94437 [details]
same for network

Comment 27 Nicolas Mailhot 2003-09-12 10:09:15 UTC
Sep 11 12:58:02 localhost network: Setting network parameters:  succeeded
Sep 11 12:58:02 localhost ifup: Cannot get driver information: Operation not
supported
Sep 11 12:58:02 localhost network: Bringing up loopback interface:  succeeded
Sep 11 12:58:02 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003)
Sep 11 12:58:02 localhost kernel: bond0 registered with MII link monitoring set
to 100 ms, in fault-tolerance (active-backup)
mode.
Sep 11 12:58:02 localhost kernel: bond0 registered without ARP monitoring
Sep 11 12:58:02 localhost ifup: Cannot get driver information: No such device
Sep 11 12:58:02 localhost ifup: Determining IP information for bond0...
Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request
(returned 00:00:00:00:00:00)
Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request
(returned 00:00:00:00:00:00)
Sep 11 12:59:02 localhost dhcpcd[26902]: timed out waiting for a valid DHCP
server response
Sep 11 12:59:02 localhost kernel: bond0: released all slaves
Sep 11 12:59:02 localhost ifup:  failed.
Sep 11 12:59:02 localhost network: Bringing up interface bond0:  failed
Sep 11 12:59:23 localhost /etc/hotplug/net.agent: NET unregister event not supported


Comment 28 Nicolas Mailhot 2003-09-12 10:12:46 UTC
Of course to get the mac to use the scripts should check the first slave device,
which is not done since there is no e100 initialisation message at all. Should I
try to work around this by putting stuff in modules.conf ? The same setup works
with static adressing and bonding with the original 7.3 scripts :


Sep 12 10:39:38 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003)
Sep 12 10:39:38 localhost kernel: bond0 registered with MII link monitoring set
to 100 ms, in fault-tolerance (active-backup)
mode.
Sep 12 10:39:38 localhost kernel: bond0 registered without ARP monitoring
Sep 12 10:39:38 localhost kernel: Intel(R) PRO/100 Network Driver - version
2.2.21-k1
Sep 12 10:39:38 localhost kernel: Copyright (c) 2003 Intel Corporation
Sep 12 10:39:38 localhost kernel:
Sep 12 10:39:38 localhost kernel: e100: eth0: Intel(R) PRO/100 Network Connection
Sep 12 10:39:38 localhost kernel:   Hardware receive checksums enabled
Sep 12 10:39:38 localhost kernel:   cpu cycle saver enabled
Sep 12 10:39:38 localhost kernel:
Sep 12 10:39:38 localhost kernel: e100: eth1: Intel(R) PRO/100 Network Connection
Sep 12 10:39:38 localhost kernel:   Hardware receive checksums enabled
Sep 12 10:39:38 localhost kernel:   cpu cycle saver enabled
Sep 12 10:39:38 localhost kernel:
Sep 12 10:39:38 localhost kernel: bond0: enslaving eth0 as a backup interface
with a down link.
Sep 12 10:39:38 localhost kernel: bond0: enslaving eth1 as a backup interface
with a down link.
Sep 12 10:39:38 localhost kernel: e100: eth0 NIC Link is Up 100 Mbps Full duplex
Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth0,
enabling it in 1000 ms.
Sep 12 10:39:38 localhost kernel: bond0: making interface eth0 the new active
one 100 ms earlier.
Sep 12 10:39:38 localhost kernel: e100: eth1 NIC Link is Up 100 Mbps Full duplex
Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth1,
enabling it in 1000 ms.
Sep 12 10:39:38 localhost kernel: bond0: link status definitely up for interface
eth1.


Comment 29 Nicolas Mailhot 2003-09-12 10:15:34 UTC
[ The QA people here growl when I take out their test server to test system
stuff, but they'll let me do it as long as I can tell I'm working with you ]

Comment 30 Bill Nottingham 2003-09-12 14:56:39 UTC
Do you have 'TYPE=Bonding' in your bonding configuration? If not, does that help?

Comment 31 Nicolas Mailhot 2003-09-12 15:04:09 UTC
My ifcfg-bond0 is :

DEVICE=bond0
ONBOOT=yes
USERCTL=no
BOOTPROTO=dhcp

Should I add something ?
The static ifcfg-bond0 (that works on 7.3) is :
DEVICE=bond0
ONBOOT=yes
USERCTL=no
 
BOOTPROTO='none'
IPADDR='192.168.1.16'
NETMASK='255.255.255.0'
NETWORK='192.168.1.0'
BROADCAST='192.168.1.255'



Comment 32 Bill Nottingham 2003-09-12 15:06:50 UTC
Add 'TYPE=Bonding' and then try again.  I suppose in the backport I can do
checks for device name (i.e., bondXXX), but at least in RHEL3 and the current
Severn beta, that's not valid as you can rename ethernet devices whatever you want.

Comment 33 Nicolas Mailhot 2003-09-12 15:14:59 UTC
Ok, I will try to wrest the control of the server from QA's greedy hands.

No need to backport any checking - I used this because that's how it worked
before, adding TYPE information quite ok.

Should I also put some TYPE in the ethernet config files ?

Comment 34 Bill Nottingham 2003-09-12 15:58:27 UTC
No, as long as they have MASTER & SLAVE set, things should work.

Comment 35 Nicolas Mailhot 2003-09-12 16:16:37 UTC
Ok, will test it on monday if I can

Comment 36 Nicolas Mailhot 2003-09-15 13:11:16 UTC
Well with Type in it must do something because that's a sure way to oops the
server (using latest RH 7.3 kernel)

Curiously it oopses both with and without dhcp, while the shipped initscript do
work with static addressing at least.

Are you interested in the oops ? (And if so how can I grab it - on homemade
kernels I'd just use a fb console with lots of lines but with a distribution
kernel I'm a bit at a loss)

Comment 37 Bill Nottingham 2003-09-15 16:12:26 UTC
Sure, I'm interested. You could try grabbing it with a serial console.

Comment 38 John Flanagan 2003-12-19 19:13:50 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2003-341.html