Description of problem: I've just installed a test RH 7.3 server in a big site (if the test goes well it will probably be upgraded to RH AS & duplicated) The site we're on is the infrastructure node for France. Once validated server setups are duplicated in daughter sites all around the world. Server setup is currently standardised on windows 2000 + dhcp & dynamic dns (ie each center has a single static ip for dns server with the same numbering rules for all sub-networks all around the world - all other servers and stations have dynamic ips assigned via dhcp and are rebooted every night) All servers have full hight-availability features including duplicated gigabit cards. So to fit well the linux server would have to use dhcp on a bonding interface. The problem is dhcp works on the separate cards, bonding works with static network info, but bonding + dhcp fails. We've ended up using a bonding+static info setup but this is clearly less than optimal for the client and will weight during the final pilot phase evaluation. (ie we may be asked to use windows as OS platform instead, which we can do but would like to avoid) I don't know if this problem is fixed in AS - if it is it's probably ok for us to tell the client we'll switch to dhcp+bonding setup at the end of the pilot phase via the system upgrade. Version-Release number of selected component (if applicable): initscripts-6.67-1 How reproducible: always Steps to Reproduce: 1. get a system with two eth cards 2. setup bonding with static ip info, eg (in /etc/sysconfig/network-scripts/ifcfg-bond0) DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO='none' IPADDR='192.168.1.16' NETMASK='255.255.255.0' NETWORK='192.168.1.0' BROADCAST='192.168.1.255' 3. switch to dhcp : DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO=dhcp Actual results: mai 22 12:25:25 kitu ifup: Définition des informations IP pour bond0. May 22 12:25:25 kitu kernel: bonding.c:v2.4.20-20030320 (March 20, 2003) May 22 12:25:25 kitu kernel: bond0 registered with MII link monitoring set to 100 ms, in fault-tolerance (active-backup) mode.May 22 12:25:25 kitu dhcpcd[30339]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) May 22 12:25:25 kitu kernel: bond0 registered without ARP monitoring May 22 12:25:25 kitu dhcpcd[30339]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) Additional info: It seems the scripts try dhcp before getting a mac for bond0 from the first eth slave. Setting a HWADRR in ifcfg-bond0 does not seem to work either. I've reproduced it here with two e100 cards + RH 7.3 dhcp server (exact same error messages) The field setup is tg3 + e1000 & windows 2000 (or 2003) server. The client does not stint on hardware & uses windows best practices (dhcp,teaming,night reboot) so I'd be really surprised if we were the first (or last) to hit this. (btw dhcp is also supposed to give the current ntp server ipinfo to the client)
Yes, this is broken. Basically, the way the devices are brought up is: - bring up master - bring up slaves This is all done serially. Obviously, with DHCP, this Will Not Work.
Created attachment 91902 [details] patch that brings up bonding devices last This patch moves the bringing up of bonding devices to the end. Does this work for you?
Ok, since the patch didn't apply on RH 7.3 network, I took the one from Rawhide, patched it (with fuzz), and tried it on the test bonded setup. Unfortunately it didn't work and also killed bonding with static IP info.
Created attachment 91912 [details] First patch test results
Syslog lang was set to en_US, iptables stopped, bonding and e100 modules rmmoded before each test.
Created attachment 91913 [details] First patch test results Sorry for the bogus first version.
Hm, ok, I was testing this on RHL 9+ (and the patch was against the 7.3 initscripts branch, so it should have applied, odd.) I'll dig up a 7.3 or AS2.1 machine to try and test.
[root@kitu root]# rpm -q initscripts initscripts-6.67-1 [root@kitu root]# rpm -V initscripts ..5....T c /etc/inittab .......T c /etc/rc.d/init.d/network I can install RH9 or even rawhide initscripts rpm on the test server if you feel that's safe. Setting out a test machine is fairly easy - one only needs a dhcp environment and two nics.
Hm, yes, I can reproduce the failures on RH7.3; it's not obvious why it's failing though, especially when using the errata kernel.
Ok so I've got a test box and checked this a bit more - bonding+dhcp also fails on RH ES - on RH ES static bonding works but one must load the bonding module manually since the scripts won't (ie I could not figure how to get a RH ES box auto start without doing scripts surgery) - RH 9 is massively broken. I couldn't get any bonding setup to work.
Maybe this should be escalated a bit - while RH 7.3 may be EOL soon both ES and 9 are supposed to be there longer. The ES part is the worst - it doesn't even have the level of support of 7.3
The bits in ES are older than in RH 7.3, so it's not really that surprising. I haven't had a chance to play with this more recently, though.
One can work around this problem by changing the network init script in /etc/init.d as below. This was done on RHAS 2.1 where I am seeing the same problem. It basically reversed the initialisation of the interfaces and is a crude workaround that should apply to most other versions. --- network.orig Thu Aug 14 09:18:47 2003 +++ network Thu Aug 14 09:05:35 2003 @@ -44,7 +44,7 @@ # find all the interfaces besides loopback. # ignore aliases, alternative configurations, and editor backup files -interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ +interfaces=`ls ifcfg* | sort -r | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ LANG=C egrep -v '(~|\.bak)$' | \ LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \ LANG=C egrep 'ifcfg-[a-z0-9]+$' | \
That's what the previous patches posted already do.. they move bonding devices to the end.
Not exactly. I'm sorry, I should have been more clear to mention that this patch is for RedHat Advanced Server 2.1 specifically. (initscripts-6.47-1, may not be latest, sorry) I am not sure what version of RedHat your patch is specifically for, but at least on RedHat Advanced Server 2.1, the bonding interface bond0 will be added to 'interfaces' in the interfaces= line you can see in my previous comment, and will be brought up by the " for i in $interfaces; do" line that (again, on AS2.1 at least) is executed before the bonding interface addition you mad ebehind the cipe devices. I prefer your approach ofcourse, but what is needed on AS2.1 is removing the bond* devices from the interfaces. May I suggest the following patch for AS2.1, not that I like the way the devices are enumerated, but I have kept the style : --- network.orig Thu Aug 14 09:18:47 2003 +++ network Tue Aug 19 04:23:38 2003 @@ -47,7 +47,8 @@ interfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ LANG=C egrep -v '(~|\.bak)$' | \ LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \ - LANG=C egrep 'ifcfg-[a-z0-9]+$' | \ + LANG=C egrep -v 'ifcfg-bond[0-9]+$' | \ + LANG=C egrep 'ifcfg-[a-z0-9]+$' | \ sed 's/^ifcfg-//g'` # See how we were called. @@ -107,6 +108,7 @@ # add cipe here. cipeinterfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ LANG=C egrep -v '(~|\.bak)$' | \ + LANG=C egrep -v '(ifcfg-bond[0-9]+$' | \ LANG=C egrep 'ifcfg-cipcb[0-9]+$' | \ sed 's/^ifcfg-//g'` for i in $cipeinterfaces ; do @@ -116,6 +118,34 @@ { confirm $i case $? in + 0) + : + ;; + 2) + CONFIRM= + ;; + *) + continue + ;; + esac + } + action $"Bringing up interface $i: " ./ifup $i boot + fi + done + + # Bring up bonding interfaces + bondinterfaces=`ls ifcfg* | LANG=C egrep -v '(ifcfg-lo|:|rpmsave|rpmorig|rpmnew)' | \ + LANG=C egrep -v '(~|\.bak)$' | \ + LANG=C egrep -v 'ifcfg-cipcb[0-9]+$' | \ + LANG=C egrep 'ifcfg-bond[0-9]+$' | \ + sed 's/^ifcfg-//g'` + for i in $bondinterfaces ; do + if ! LANG=C egrep -L "^ONBOOT=['\"]?[Nn][Oo]['\"]?" ifcfg-$i >/dev/null 2>&1 ; then + # If we're in confirmation mode, get user confirmation + [ -n "$CONFIRM" ] && + { + confirm $i + case $? in 0) : ;;
The patch to move bonding devices to the end is going to be in 7.31-1; I'll do some more testing with that.
I may also do some testing on 7.3 / ES 2.1 if needed.
Never mind; moving bonding to last is the *wrong* answer. (The bonding device needs brought up first). dhcp will have to be hacked in some other way.
Created attachment 94260 [details] patch for bonding devices The attached is a backport to the 7.2 initscripts branch of the fixes that make it work for me in testing on the Taroon beta. I haven't installed a 7.2/7.3/AS2.1 box to try this there; I've been told that the bonding driver in those releases is older and somewhat more buggy. What's done is that we initialize the bonding device and bring its link up, and then we bring up the slaves, and then finish bringing up the bonding device. Since the patch is a backport to the 7.2 initscripts *branch* of CVS, it might need to be massaged slightly depending on what package you're on.
Created attachment 94261 [details] ethtool source rpm You'll need this version of ethtool for the 'ethtool -i' check for a bonding device to work.
Ok, I'll try it on monday. Checking on 7.3 should be easy. On 2.1 ES this might require hunting down another network card. Thanks.
The patch seems to apply to the 7.3 initscripts versions with fuzz. I'll try it as soon as I can get the 7.3 test server offline
Well naïve patching of the initscript version shipped with 7.3 fails (this may be due to problems in the 7.3 bonding module, I don't know). Anyway this will be a problem for our future deployments mainly, which means RH AS/ES. Will AS 3 work with Oracle & java out of the box ? If so I'll set up a taroon test server and try on it. Working on a soon-to-be-unsupported platform does not make much sense.
Can you attach your patched 7.3 /etc/init.d/network and ifup? That will clarify any patch errors. What sort of problems do you have with the patched version?
Created attachment 94436 [details] ifup from patched initscripts 6.67
Created attachment 94437 [details] same for network
Sep 11 12:58:02 localhost network: Setting network parameters: succeeded Sep 11 12:58:02 localhost ifup: Cannot get driver information: Operation not supported Sep 11 12:58:02 localhost network: Bringing up loopback interface: succeeded Sep 11 12:58:02 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003) Sep 11 12:58:02 localhost kernel: bond0 registered with MII link monitoring set to 100 ms, in fault-tolerance (active-backup) mode. Sep 11 12:58:02 localhost kernel: bond0 registered without ARP monitoring Sep 11 12:58:02 localhost ifup: Cannot get driver information: No such device Sep 11 12:58:02 localhost ifup: Determining IP information for bond0... Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) Sep 11 12:58:02 localhost dhcpcd[26902]: dhcpStart: retrying MAC address request (returned 00:00:00:00:00:00) Sep 11 12:59:02 localhost dhcpcd[26902]: timed out waiting for a valid DHCP server response Sep 11 12:59:02 localhost kernel: bond0: released all slaves Sep 11 12:59:02 localhost ifup: failed. Sep 11 12:59:02 localhost network: Bringing up interface bond0: failed Sep 11 12:59:23 localhost /etc/hotplug/net.agent: NET unregister event not supported
Of course to get the mac to use the scripts should check the first slave device, which is not done since there is no e100 initialisation message at all. Should I try to work around this by putting stuff in modules.conf ? The same setup works with static adressing and bonding with the original 7.3 scripts : Sep 12 10:39:38 localhost kernel: bonding.c:v2.4.20-20030320 (March 20, 2003) Sep 12 10:39:38 localhost kernel: bond0 registered with MII link monitoring set to 100 ms, in fault-tolerance (active-backup) mode. Sep 12 10:39:38 localhost kernel: bond0 registered without ARP monitoring Sep 12 10:39:38 localhost kernel: Intel(R) PRO/100 Network Driver - version 2.2.21-k1 Sep 12 10:39:38 localhost kernel: Copyright (c) 2003 Intel Corporation Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: e100: eth0: Intel(R) PRO/100 Network Connection Sep 12 10:39:38 localhost kernel: Hardware receive checksums enabled Sep 12 10:39:38 localhost kernel: cpu cycle saver enabled Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: e100: eth1: Intel(R) PRO/100 Network Connection Sep 12 10:39:38 localhost kernel: Hardware receive checksums enabled Sep 12 10:39:38 localhost kernel: cpu cycle saver enabled Sep 12 10:39:38 localhost kernel: Sep 12 10:39:38 localhost kernel: bond0: enslaving eth0 as a backup interface with a down link. Sep 12 10:39:38 localhost kernel: bond0: enslaving eth1 as a backup interface with a down link. Sep 12 10:39:38 localhost kernel: e100: eth0 NIC Link is Up 100 Mbps Full duplex Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth0, enabling it in 1000 ms. Sep 12 10:39:38 localhost kernel: bond0: making interface eth0 the new active one 100 ms earlier. Sep 12 10:39:38 localhost kernel: e100: eth1 NIC Link is Up 100 Mbps Full duplex Sep 12 10:39:38 localhost kernel: bond0: link status up for interface eth1, enabling it in 1000 ms. Sep 12 10:39:38 localhost kernel: bond0: link status definitely up for interface eth1.
[ The QA people here growl when I take out their test server to test system stuff, but they'll let me do it as long as I can tell I'm working with you ]
Do you have 'TYPE=Bonding' in your bonding configuration? If not, does that help?
My ifcfg-bond0 is : DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO=dhcp Should I add something ? The static ifcfg-bond0 (that works on 7.3) is : DEVICE=bond0 ONBOOT=yes USERCTL=no BOOTPROTO='none' IPADDR='192.168.1.16' NETMASK='255.255.255.0' NETWORK='192.168.1.0' BROADCAST='192.168.1.255'
Add 'TYPE=Bonding' and then try again. I suppose in the backport I can do checks for device name (i.e., bondXXX), but at least in RHEL3 and the current Severn beta, that's not valid as you can rename ethernet devices whatever you want.
Ok, I will try to wrest the control of the server from QA's greedy hands. No need to backport any checking - I used this because that's how it worked before, adding TYPE information quite ok. Should I also put some TYPE in the ethernet config files ?
No, as long as they have MASTER & SLAVE set, things should work.
Ok, will test it on monday if I can
Well with Type in it must do something because that's a sure way to oops the server (using latest RH 7.3 kernel) Curiously it oopses both with and without dhcp, while the shipped initscript do work with static addressing at least. Are you interested in the oops ? (And if so how can I grab it - on homemade kernels I'd just use a fb console with lots of lines but with a distribution kernel I'm a bit at a loss)
Sure, I'm interested. You could try grabbing it with a serial console.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2003-341.html