Bug 432622

Summary: Bonding interface always starts with one slave down
Product: Red Hat Enterprise Linux 4 Reporter: Juanjo Villaplana <villapla>
Component: initscriptsAssignee: initscripts Maintenance Team <initscripts-maint-list>
Status: CLOSED NEXTRELEASE QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: low    
Version: 4.6CC: fleitner, jn, kajtzu, michael, notting, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-18 21:15:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/etc/sysconfig/network
none
/etc/sysconfig/network-scripts/ifcfg-eth0
none
/etc/sysconfig/network-scripts/ifcfg-eth1
none
/etc/sysconfig/network-scripts/ifcfg-bond0
none
/etc/modprobe.conf
none
'service network start' kernel messages
none
/etc/rc.d/init.d/network patch
none
'service network start' kernel messages
none
'ifconfig eth0 up' kernel messages
none
Instrumented net.agent output
none
/etc/sysconfig/network-scripts/ifcfg-vlan8 none

Description Juanjo Villaplana 2008-02-13 13:27:41 UTC
Description of problem:

Since we upgraded to initscripts-7.93.31.EL-2, our bonding interface always
starts with eth0 slave down.

Version-Release number of selected component (if applicable):

initscripts-7.93.31.EL-2

How reproducible:

Always

Steps to Reproduce:
1. service network start
2. cat /proc/net/bonding/bond0
3. ifconfig
  
Actual results:

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.3-rh (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up   
MII Polling Interval (ms): 100
Up Delay (ms): 0 
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down 
Link Failure Count: 1
Permanent HW addr: 00:19:bb:c7:a8:72

Slave Interface: eth1
MII Status: up   
Link Failure Count: 0
Permanent HW addr: 00:19:bb:c7:a8:70

# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:19:BB:C7:A8:72  
          BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:127 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:13608 (13.2 KiB)  TX bytes:412 (412.0 b)
          Interrupt:169 Memory:f8000000-f8012100

Expected results:

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3-rh (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:bb:c7:a8:72

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:bb:c7:a8:70

Additional info:

Comment 1 Juanjo Villaplana 2008-02-13 13:27:41 UTC
Created attachment 294772 [details]
/etc/sysconfig/network

Comment 2 Juanjo Villaplana 2008-02-13 13:28:43 UTC
Created attachment 294773 [details]
/etc/sysconfig/network-scripts/ifcfg-eth0

Comment 3 Juanjo Villaplana 2008-02-13 13:29:31 UTC
Created attachment 294774 [details]
/etc/sysconfig/network-scripts/ifcfg-eth1

Comment 4 Juanjo Villaplana 2008-02-13 13:30:13 UTC
Created attachment 294775 [details]
/etc/sysconfig/network-scripts/ifcfg-bond0

Comment 5 Juanjo Villaplana 2008-02-13 13:30:50 UTC
Created attachment 294776 [details]
/etc/modprobe.conf

Comment 6 Juanjo Villaplana 2008-02-13 13:32:43 UTC
Created attachment 294777 [details]
'service network start' kernel messages

Comment 7 Juanjo Villaplana 2008-02-13 13:47:37 UTC
Created attachment 294783 [details]
/etc/rc.d/init.d/network patch

After some testing we have found that the problem is related to that (from
initscript's changelog):

* Sat Jun 23 2007 Bill Nottingham <notting> - 7.93.30.EL-1
- init.d/network, network-functions: don't fiddle with hotplug settings
(#185569, #209307)

The attached patch reverts hotplug code to /etc/rc.d/init.d/network and fixes
this problem.

I'm not authorized to access bugs #185569 and #209307, so I guess this patch
may break something else, but this hotplug code was already present on
initscripts-7.93.29.EL-1 and it worked fine for us.

Comment 8 Bill Nottingham 2008-02-13 15:38:16 UTC
What happens if you change the slaves to be 'ONBOOT=no'?

Comment 9 Juanjo Villaplana 2008-02-13 17:38:25 UTC
Created attachment 294816 [details]
'service network start' kernel messages 

Setting ONBOOT=no doesn't help:

# service network start
Setting network parameters:				   [  OK  ]
Bringing up loopback interface: 			   [  OK  ]
Setting 802.1Q VLAN parameters: 			   [  OK  ]
Bringing up interface bond0:				   [  OK  ]
Bringing up interface vlan8:				   [  OK  ]

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3-rh (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:19:bb:c7:a8:72

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:bb:c7:a8:70

# ifconfig etho
etho: error fetching interface information: Device not found
[root@clu108 bz432622]# ifconfig eth0
eth0	  Link encap:Ethernet  HWaddr 00:19:BB:C7:A8:72  
	  BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
	  RX packets:94 errors:0 dropped:0 overruns:0 frame:0
	  TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
	  collisions:0 txqueuelen:1000 
	  RX bytes:9595 (9.3 KiB)  TX bytes:412 (412.0 b)
	  Interrupt:169 Memory:f8000000-f8012100

Note that this setup worked fine until the upgrade of initscripts.

Comment 10 Bill Nottingham 2008-02-13 18:05:38 UTC
I'm assuming eth0 does actually have a valid link, of course.

If you instrument /etc/hotplug/net.agent, is it actually being invoked?

Comment 11 Juanjo Villaplana 2008-02-13 18:14:19 UTC
Created attachment 294817 [details]
'ifconfig eth0 up' kernel messages

> I'm assuming eth0 does actually have a valid link, of course.

Yes:

# ifconfig eth0 up
# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.3-rh (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:19:bb:c7:a8:72

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:bb:c7:a8:70

Comment 12 Juanjo Villaplana 2008-02-13 18:28:37 UTC
Created attachment 294818 [details]
Instrumented net.agent output

> If you instrument /etc/hotplug/net.agent, is it actually being invoked?

I added this on the top of /etc/hotplug/net.agent:

set -x
exec > /tmp/net.agent.$$ 2>&1

and attached the output generated by "service network start".

Is this what you needed?

Comment 13 Bill Nottingham 2008-02-13 18:35:05 UTC
Can you attach your vlan config?

Comment 14 Juanjo Villaplana 2008-02-13 18:58:25 UTC
Created attachment 294820 [details]
/etc/sysconfig/network-scripts/ifcfg-vlan8

Comment 15 Bill Nottingham 2008-02-13 19:21:48 UTC
OK, the simple reason this is happening is that the bringing up of vlan8 is
causing a hotplug event which is not delivered until well after
/etc/init.d/network finishes. That doesn't make much sense at first glance.

I suspect adding a 'sleep 5' before the 'touch /var/lock/subsys/network' in
/etc/init.d/network will 'fix' it, but that's obviously not the right answer.

This hotplug event looks for the MAC address of vlan8 in your config to bring up
the device; it matches eth0. So, 'ifup eth0' is run, which attempts to enslave
the (already enslaved) device. The first step of enslavement is setting the link
down; it then attempts to enslave it, which fails (due to already being
enslaved.). But the link remains down.

There are a couple of options to 'fix' this; one would be to detach the device
before enslaving it. That could cause a lot of bouncing of link status, though.
We could check whether or not it's enslaved, but that is not practical with the
bonding support in RHEL 4. Root causing why the hotplug event is so late might
help, but there may still be a race there.

The simplest fix I can think of for your case is to add 'HOTPLUG=no' to
ifcfg-eth0 and ifcfg-eth1; that should solve the problem.

Comment 16 Juanjo Villaplana 2008-02-13 19:37:12 UTC
I reverted ONBOOT=yes and added HOTPLUG=no to ifcfg-eth[01] and this solved the
problem.

Does setting HOTPLUG=no have any side effect we should care about?

Comment 17 Bill Nottingham 2008-02-13 19:45:31 UTC
Reverting the ONBOOT=yes shouldn't make a difference.

HOTPLUG=no means that hotplug events (caused by adding/removing the device, or
module) will be ignored. It would mean that you'd have to manually bring the
interface up if you unloaded and reloaded the bnx2 module, for example.

Comment 18 Juanjo Villaplana 2008-02-13 20:34:40 UTC
OK.
I will leave untouched /etc/rc.d/init.d/network and add HOTPLUG=no to
ifcfg-eth[01] in order to fix this issue.

Your (extremely fast) help is very appreciated. Regards, Juanjo.

Comment 19 Juanjo Villaplana 2008-08-07 08:08:24 UTC
This issue persists on RHEL 4.7 (initscripts-7.93.33-1.el4).

Comment 20 Bill Nottingham 2008-12-08 21:59:21 UTC
*** Bug 159500 has been marked as a duplicate of this bug. ***

Comment 21 Bill Nottingham 2009-03-18 21:15:40 UTC
Given the existing workaround (HOTPLUG=no in configuration), and the current update status of RHEL 4, I'm closing this. It should work without configuration changes in RHEL 5. With the goal of minimizing risk of change for deployed systems, and in response to customer and partner requirements, Red Hat takes a conservative approach when evaluating changes for inclusion in maintenance updates for currently deployed products. The primary objectives of update releases are to enable new hardware platform support and to resolve critical defects.

Comment 22 Paul Batkowski 2009-11-05 20:05:13 UTC
*** Bug 498480 has been marked as a duplicate of this bug. ***