Bug 91399
Summary: | bonding+dhcp does not work | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Nicolas Mailhot <nicolas.mailhot> | ||||||||||||||||
Component: | initscripts | Assignee: | Bill Nottingham <notting> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brock Organ <borgan> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 7.3 | CC: | rvokal | ||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2003-12-19 19:13:50 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Nicolas Mailhot
2003-05-22 11:16:18 UTC
Yes, this is broken. Basically, the way the devices are brought up is: - bring up master - bring up slaves This is all done serially. Obviously, with DHCP, this Will Not Work. Created attachment 91902 [details]
patch that brings up bonding devices last
This patch moves the bringing up of bonding devices to the end. Does this work
for you?
Ok, since the patch didn't apply on RH 7.3 network, I took the one from Rawhide, patched it (with fuzz), and tried it on the test bonded setup. Unfortunately it didn't work and also killed bonding with static IP info. Created attachment 91912 [details]
First patch test results
Syslog lang was set to en_US, iptables stopped, bonding and e100 modules rmmoded before each test. Created attachment 91913 [details]
First patch test results
Sorry for the bogus first version.
Hm, ok, I was testing this on RHL 9+ (and the patch was against the 7.3 initscripts branch, so it should have applied, odd.) I'll dig up a 7.3 or AS2.1 machine to try and test. [root@kitu root]# rpm -q initscripts initscripts-6.67-1 [root@kitu root]# rpm -V initscripts ..5....T c /etc/inittab .......T c /etc/rc.d/init.d/network I can install RH9 or even rawhide initscripts rpm on the test server if you feel that's safe. Setting out a test machine is fairly easy - one only needs a dhcp environment and two nics. Hm, yes, I can reproduce the failures on RH7.3; it's not obvious why it's failing though, especially when using the errata kernel. Ok so I've got a test box and checked this a bit more - bonding+dhcp also fails on RH ES - on RH ES static bonding works but one must load the bonding module manually since the scripts won't (ie I could not figure how to get a RH ES box auto start without doing scripts surgery) - RH 9 is massively broken. I couldn't get any bonding setup to work. Maybe this should be escalated a bit - while RH 7.3 may be EOL soon both ES and 9 are supposed to be there longer. The ES part is the worst - it doesn't even have the level of support of 7.3 The bits in ES are older than in RH 7.3, so it's not really that surprising. I haven't had a chance to play with this more recently, though. One can work around this problem by changing the network init script in /etc/init.d as below. This was done on RHAS 2.1 where I am seeing the same problem. It basically reversed the initialisation of the interfaces and is a crude workaround that should apply to most other versions. --- network.orig Thu Aug 14 09:18:47 2003 +++ network Thu Aug 14 09:05:35 2003 @@ -44,7 +44,7 @@ # find all the interfaces besides loopback. # ignore aliases, alternative configurations, and editor backup files -interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ +interfaces=`ls ifcfg* | sort -r | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ LANG=C egrep -v '(~|\.bak)$' | \ LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \ LANG=C egrep 'ifcfg-[a-z0-9]+$' | \ That's what the previous patches posted already do.. they move bonding devices to the end. Not exactly. I'm sorry, I should have been more clear to mention that this patch
is for RedHat Advanced Server 2.1 specifically. (initscripts-6.47-1, may not be
latest, sorry)
I am not sure what version of RedHat your patch is specifically for, but at
least on RedHat Advanced Server 2.1, the bonding interface bond0 will be added
to 'interfaces' in the interfaces= line you can see in my previous comment, and
will be brought up by the " for i in $interfaces; do" line that (again, on AS2.1
at least) is executed before the bonding interface addition you mad ebehind the
cipe devices.
I prefer your approach ofcourse, but what is needed on AS2.1 is removing the
bond* devices from the interfaces. May I suggest the following patch for AS2.1,
not that I like the way the devices are enumerated, but I have kept the style :
--- network.orig Thu Aug 14 09:18:47 2003
+++ network Tue Aug 19 04:23:38 2003
@@ -47,7 +47,8 @@
interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
LANG=C egrep -v '(~|\.bak)$' | \
LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \
- LANG=C egrep 'ifcfg-[a-z0-9]+$' | \
+ LANG=C egrep -v 'ifcfg-bond[0-9]+$' | \
+ LANG=C egrep 'ifcfg-[a-z0-9]+$' | \
sed 's/^ifcfg-//g'`
# See how we were called.
@@ -107,6 +108,7 @@
# add cipe here.
cipeinterfaces=`ls ifcfg* | LANG=C egrep -v
'(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
LANG=C egrep -v '(~|\.bak)$' | \
+ LANG=C egrep -v '(ifcfg-bond[0-9]+$' | \
LANG=C egrep 'ifcfg-cipcb[0-9]+$' | \
sed 's/^ifcfg-//g'`
for i in $cipeinterfaces ; do
@@ -116,6 +118,34 @@
{
confirm $i
case $? in
+ 0)
+ :
+ ;;
+ 2)
+ CONFIRM=
+ ;;
+ *)
+ continue
+ ;;
+ esac
+ }
+ action $"Bringing up interface $i: " ./ifup $i boot
+ fi
+ done
+
+ # Bring up bonding interfaces
+ bondinterfaces=`ls ifcfg* | LANG=C egrep -v
'(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \
+ LANG=C egrep -v '(~|\.bak)$' | \
+ LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \
+ LANG=C egrep 'ifcfg-bond[0-9]+$' | \
+ sed 's/^ifcfg-//g'`
+ for i in $bondinterfaces ; do
+ if ! LANG=C egrep -L "^ONBOOT=['\"]?[Nn][Oo]['\"]?" ifcfg-$i
>/dev/null 2>&1 ; then
+ # If we're in confirmation mode, get user confirmation
+ [ -n "$CONFIRM" ] &&
+ {
+ confirm $i
+ case $? in
0)
:
;;
The patch to move bonding devices to the end is going to be in 7.31-1; I'll do some more testing with that. I may also do some testing on 7.3 / ES 2.1 if needed. Never mind; moving bonding to last is the *wrong* answer. (The bonding device needs brought up first). dhcp will have to be hacked in some other way. Created attachment 94260 [details]
patch for bonding devices
The attached is a backport to the 7.2 initscripts branch of the fixes that make
it work for me in testing on the Taroon beta. I haven't installed a
7.2/7.3/AS2.1 box to try this there; I've been told that the bonding driver in
those releases is older and somewhat more buggy.
What's done is that we initialize the bonding device and bring its link up, and
then we bring up the slaves, and then finish bringing up the bonding device.
Since the patch is a backport to the 7.2 initscripts *branch* of CVS, it might
need to be massaged slightly depending on what package you're on.
Created attachment 94261 [details]
ethtool source rpm
You'll need this version of ethtool for the 'ethtool -i' check for a bonding
device to work.
Ok, I'll try it on monday. Checking on 7.3 should be easy. On 2.1 ES this might require hunting down another network card. Thanks. The patch seems to apply to the 7.3 initscripts versions with fuzz. I'll try it as soon as I can get the 7.3 test server offline Well naïve patching of the initscript version shipped with 7.3 fails (this may be due to problems in the 7.3 bonding module, I don't know). Anyway this will be a problem for our future deployments mainly, which means RH AS/ES. Will AS 3 work with Oracle & java out of the box ? If so I'll set up a taroon test server and try on it. Working on a soon-to-be-unsupported platform does not make much sense. Can you attach your patched 7.3 /etc/init.d/network and ifup? That will clarify any patch errors. What sort of problems do you have with the patched version? Created attachment 94436 [details]
ifup from patched initscripts 6.67
Created attachment 94437 [details]
same for network
Sep 11 12:58:02 localhost network: Setting network parameters: succeeded Sep 11 12:58:02 localhost ifup: Cannot get driver information: Operation not supported Sep 11 12:58:02 localhost network: Bringing up loopback interface: succeeded Sep 11 12:58:02 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003) Sep 11 12:58:02 localhost kernel: bond0 registered with MII link monitoring set to 100 ms, in fault-tolerance (active-backup) mode. Sep 11 12:58:02 localhost kernel: bond0 registered without ARP monitoring Sep 11 12:58:02 localhost ifup: Cannot get driver information: No such device Sep 11 12:58:02 localhost ifup: Determining IP information for bond0... Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) Sep 11 12:59:02 localhost dhcpcd[26902]: timed out waiting for a valid DHCP server response Sep 11 12:59:02 localhost kernel: bond0: released all slaves Sep 11 12:59:02 localhost ifup: failed. Sep 11 12:59:02 localhost network: Bringing up interface bond0: failed Sep 11 12:59:23 localhost /etc/hotplug/net.agent: NET unregister event not supported Of course to get the mac to use the scripts should check the first slave device, which is not done since there is no e100 initialisation message at all. Should I try to work around this by putting stuff in modules.conf ? The same setup works with static adressing and bonding with the original 7.3 scripts : Sep 12 10:39:38 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003) Sep 12 10:39:38 localhost kernel: bond0 registered with MII link monitoring set to 100 ms, in fault-tolerance (active-backup) mode. Sep 12 10:39:38 localhost kernel: bond0 registered without ARP monitoring Sep 12 10:39:38 localhost kernel: Intel(R) PRO/100 Network Driver - version 2.2.21-k1 Sep 12 10:39:38 localhost kernel: Copyright (c) 2003 Intel Corporation Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: e100: eth0: Intel(R) PRO/100 Network Connection Sep 12 10:39:38 localhost kernel: Hardware receive checksums enabled Sep 12 10:39:38 localhost kernel: cpu cycle saver enabled Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: e100: eth1: Intel(R) PRO/100 Network Connection Sep 12 10:39:38 localhost kernel: Hardware receive checksums enabled Sep 12 10:39:38 localhost kernel: cpu cycle saver enabled Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: bond0: enslaving eth0 as a backup interface with a down link. Sep 12 10:39:38 localhost kernel: bond0: enslaving eth1 as a backup interface with a down link. Sep 12 10:39:38 localhost kernel: e100: eth0 NIC Link is Up 100 Mbps Full duplex Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth0, enabling it in 1000 ms. Sep 12 10:39:38 localhost kernel: bond0: making interface eth0 the new active one 100 ms earlier. Sep 12 10:39:38 localhost kernel: e100: eth1 NIC Link is Up 100 Mbps Full duplex Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth1, enabling it in 1000 ms. Sep 12 10:39:38 localhost kernel: bond0: link status definitely up for interface eth1. [ The QA people here growl when I take out their test server to test system stuff, but they'll let me do it as long as I can tell I'm working with you ] Do you have 'TYPE=Bonding' in your bonding configuration? If not, does that help? My ifcfg-bond0 is : DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO=dhcp Should I add something ? The static ifcfg-bond0 (that works on 7.3) is : DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO='none' IPADDR='192.168.1.16' NETMASK='255.255.255.0' NETWORK='192.168.1.0' BROADCAST='192.168.1.255' Add 'TYPE=Bonding' and then try again. I suppose in the backport I can do checks for device name (i.e., bondXXX), but at least in RHEL3 and the current Severn beta, that's not valid as you can rename ethernet devices whatever you want. Ok, I will try to wrest the control of the server from QA's greedy hands. No need to backport any checking - I used this because that's how it worked before, adding TYPE information quite ok. Should I also put some TYPE in the ethernet config files ? No, as long as they have MASTER & SLAVE set, things should work. Ok, will test it on monday if I can Well with Type in it must do something because that's a sure way to oops the server (using latest RH 7.3 kernel) Curiously it oopses both with and without dhcp, while the shipped initscript do work with static addressing at least. Are you interested in the oops ? (And if so how can I grab it - on homemade kernels I'd just use a fb console with lots of lines but with a distribution kernel I'm a bit at a loss) Sure, I'm interested. You could try grabbing it with a serial console. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2003-341.html |