Bug 747404

Summary: bond0 breaks after reboot.
Product: Red Hat Enterprise Linux 6 Reporter: Ross Marshall <ross.marshall>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.0CC: arozansk, peterm
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.34.7-56.fc13.x86_64 and 2.6.18-128.2.1.4.37.el5xen Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-24 20:43:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
config files
none
output of ifconfig -a before and after, /proc/net/bonding/bondx before and after and dmesg none

Description Ross Marshall 2011-10-19 17:49:36 UTC
Created attachment 529066 [details]
config files

Description of problem:
After configuring bond0, if the server reboots, the bond becomes flaky, some computers can ping it, others can't, and ssh does not work. Doing a manual network restart fixes the problem, but breaks again upon reboot.

Version-Release number of selected component (if applicable):


How reproducible:
Reboot server.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Can't connect to server via ssh, and some computers can ping it, others can't. Tried to find a pattern, but can't.

Expected results:


Additional info:

Comment 2 Ross Marshall 2011-10-19 19:17:49 UTC
One thing that I noticed, after reboot, the secondary nic does not get the same mac address as the primary or bond interfaces, but after the manual restart, they are all the same.

Comment 3 Andy Gospodarek 2011-10-21 15:17:40 UTC
Other than the fact that you are not using the standard 'BONDING_OPTS' flags in your ifcfg-bondX files I don't see anything that looks incorrect.  I don't imagine that would really make a difference either, but it might if dracut put bonding in the initrd and the bonding options were somehow not loaded.

To make this work all you need to do is add this line to the end of ifcfg-bond[01] files:

BONDING_OPTS="mode=802.3ad miimon=100"

Test that for me and if that doesn't work please provide the following information:

1.  Output from 'dmesg' from your system after it boots and you have run 'service network restart' and everything is operational. 

2.  Output from 'ifconfig -a' before and after restarting the network.

3.  The files /proc/net/bonding/bond0 and bond1 before and after your network restart.

Comment 4 Ross Marshall 2011-10-21 15:53:31 UTC
The configs that I uploaded are from a fedora13 box that the bonding works properly. When I realized that the bonding broke on the other box, I did change the configs to have the BONDING_OPTS line, but after reboot, it didn't go into 802.3ad mode, but rather mode 1. I will try to get the additional info, but I am unable to reboot it right now, but ifconfig does show the mac address of the secondary interface is its hardware mac, not the bond or eth0 mac until the network restart, and after if is the same as the bond. /proc/net/bonding/bond0 is the sasme before and after the restart.

Comment 5 Andy Gospodarek 2011-10-21 16:27:51 UTC
When you reboot it goes to mode 1 instead of mode 4?!?!?

That seems extremely odd.  I will try and reproduce it.

Comment 6 Ross Marshall 2011-10-21 16:51:53 UTC
Yes, this is correct. If I rmmod bonding and then modprobe bonding mode=4 miimon=100, then do a network restart, it is fine. I did read that having these parameters with the module conf file was deprecated, but this is what happened, and why I changed it back. The servers are IBM x3650 M3 computers with broadcom NetExtreme II nics using bnx2 module.

Comment 7 Andy Gospodarek 2011-10-21 18:12:35 UTC
If you have to remove the bonding module to get things working, then it definitely seems like the bonding driver is being loaded somewhere else with different parameters.

The only lines in *any* of the files in /etc/modprobe.d/ that contain the word bonding should be these:

alias bond0 bonding
alias bond1 bonding

Remove everything else you see and make sure the proper BONDING_OPTS (as mentioned in comment #3) are specified in the ifcfg-bond0 and ifcfg-bond1 files.

This works well for me and has for others.  Falling back to the RHEL4-way of specifying the bonding options is not preferred.

I would also encourage you to check other spots in /etc/ and make sure the bonding module is not being loaded with some odd parameters.  Were you using mode 1 in the past?

Comment 8 Ross Marshall 2011-10-21 19:18:28 UTC
No, we have never configured it to use mode 1. The files attached are the ones that I originally used, but after the first reboot and the bond breaking, I changed them so that the BONDING_OPTS was in the ifcfg-bondx file, but no matter what I tried, it would always go into mode 1. I noticed this after rebooting and the bond broke and even after doing a manual network restart. I would have to do the rmmod bonding and them modprobe with the options in order to get it to work in mode 4. So I put the files back to the way they were.

Comment 9 Andy Gospodarek 2011-10-21 19:39:23 UTC
That is extremely odd.  Did you ever try "mode=4" instead of "mode=802.3ad?"  I'm not sure why that would make a difference but it might be worth a try.  Having this in the contents of my ifcfg-bond0 file works just fine:

# cat ifcfg-bond0
DEVICE="bond0"                                                                 
ONBOOT="yes"                                                                   
BOOTPROTO="static"                                                             
BONDING_OPTS="mode=802.3ad miimon=1000"                                        
NM_CONTROLLED="no"

# cat /proc/net/bonding/bond0                  
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)                   
                                                                               
Bonding Mode: IEEE 802.3ad Dynamic link aggregation                            
Transmit Hash Policy: layer2 (0)                                               
MII Status: down                                                               
MII Polling Interval (ms): 1000                                                
Up Delay (ms): 0                                                               
Down Delay (ms): 0                                                             
                                                                               
802.3ad info                                                                   
LACP rate: slow                                                                
Aggregator selection policy (ad_select): stable                                
bond bond0 has no active aggregator                                            
                                                                               
Slave Interface: eth5                                                          
MII Status: down                                                               
Speed: 100 Mbps                                                                
Duplex: full                                                                   
Link Failure Count: 0                                                          
Permanent HW addr: 00:13:20:f8:43:e6                                           
Aggregator ID: 1                                                               
Slave queue ID: 0

Comment 10 Ross Marshall 2011-10-21 19:52:24 UTC
Yes, that is what I started to use, but got tired of typing it in, so started using the mode=4. I also changed the lacp rate to fast, but that didn't help either. Out of 7 servers, all identical hardware and firmware, the 2 that are running el6 (RH, CentOS and Oracle UEL tested them all) all show the same symptoms, the one running fedora 13, which I got the config files from, and the others running Oracle VM Server 2.2, all work fine.

Comment 11 Ross Marshall 2011-10-24 14:14:29 UTC
Yes, that is what I started to use, but got tired of typing it in, so started using the mode=4. I also changed the lacp rate to fast, but that didn't help either. Out of 7 servers, all identical hardware and firmware, the 2 that are running el6 (RH, CentOS and Oracle UEL tested them all) all show the same symptoms, the one running fedora 13, which I got the config files from, and the others running Oracle VM Server 2.2, all work fine.

Comment 12 Ross Marshall 2011-10-24 15:39:38 UTC
I am just rebooting the server and will get the additional info for you. One thing that we did do in testing was while the bond was broken, we did a packet capture on the server and noticed that on the packet level, the ping requests and replies were being seen, but ping itself didn't see the reply, so responded with 100% packet loss.

Comment 13 Ross Marshall 2011-10-24 15:48:44 UTC
Created attachment 529914 [details]
output of ifconfig -a before and after, /proc/net/bonding/bondx before and after and dmesg

So these are the outputs from a fresh reboot with the bond broken and after a network restart.

Comment 14 Ross Marshall 2011-10-24 20:04:59 UTC
After disabling NetworkManager, chkconfig --levels 2345 NetworkManager off, server and bonding have survived 2 reboots.

Comment 15 Andy Gospodarek 2011-10-24 20:43:52 UTC
Ah yes, NetworkManager can do all sorts of fun things to your system when you do not have:

NM_CONTROLLED="no"

in your ifcfg-ethX files and expect changes there to stick.  Sounds like this is working now, so I will close this as NOTABUG.

Comment 16 Ross Marshall 2011-10-24 20:48:37 UTC
Yes, thanks for your help.