Hide Forgot
Created attachment 529066 [details] config files Description of problem: After configuring bond0, if the server reboots, the bond becomes flaky, some computers can ping it, others can't, and ssh does not work. Doing a manual network restart fixes the problem, but breaks again upon reboot. Version-Release number of selected component (if applicable): How reproducible: Reboot server. Steps to Reproduce: 1. 2. 3. Actual results: Can't connect to server via ssh, and some computers can ping it, others can't. Tried to find a pattern, but can't. Expected results: Additional info:
One thing that I noticed, after reboot, the secondary nic does not get the same mac address as the primary or bond interfaces, but after the manual restart, they are all the same.
Other than the fact that you are not using the standard 'BONDING_OPTS' flags in your ifcfg-bondX files I don't see anything that looks incorrect. I don't imagine that would really make a difference either, but it might if dracut put bonding in the initrd and the bonding options were somehow not loaded. To make this work all you need to do is add this line to the end of ifcfg-bond[01] files: BONDING_OPTS="mode=802.3ad miimon=100" Test that for me and if that doesn't work please provide the following information: 1. Output from 'dmesg' from your system after it boots and you have run 'service network restart' and everything is operational. 2. Output from 'ifconfig -a' before and after restarting the network. 3. The files /proc/net/bonding/bond0 and bond1 before and after your network restart.
The configs that I uploaded are from a fedora13 box that the bonding works properly. When I realized that the bonding broke on the other box, I did change the configs to have the BONDING_OPTS line, but after reboot, it didn't go into 802.3ad mode, but rather mode 1. I will try to get the additional info, but I am unable to reboot it right now, but ifconfig does show the mac address of the secondary interface is its hardware mac, not the bond or eth0 mac until the network restart, and after if is the same as the bond. /proc/net/bonding/bond0 is the sasme before and after the restart.
When you reboot it goes to mode 1 instead of mode 4?!?!? That seems extremely odd. I will try and reproduce it.
Yes, this is correct. If I rmmod bonding and then modprobe bonding mode=4 miimon=100, then do a network restart, it is fine. I did read that having these parameters with the module conf file was deprecated, but this is what happened, and why I changed it back. The servers are IBM x3650 M3 computers with broadcom NetExtreme II nics using bnx2 module.
If you have to remove the bonding module to get things working, then it definitely seems like the bonding driver is being loaded somewhere else with different parameters. The only lines in *any* of the files in /etc/modprobe.d/ that contain the word bonding should be these: alias bond0 bonding alias bond1 bonding Remove everything else you see and make sure the proper BONDING_OPTS (as mentioned in comment #3) are specified in the ifcfg-bond0 and ifcfg-bond1 files. This works well for me and has for others. Falling back to the RHEL4-way of specifying the bonding options is not preferred. I would also encourage you to check other spots in /etc/ and make sure the bonding module is not being loaded with some odd parameters. Were you using mode 1 in the past?
No, we have never configured it to use mode 1. The files attached are the ones that I originally used, but after the first reboot and the bond breaking, I changed them so that the BONDING_OPTS was in the ifcfg-bondx file, but no matter what I tried, it would always go into mode 1. I noticed this after rebooting and the bond broke and even after doing a manual network restart. I would have to do the rmmod bonding and them modprobe with the options in order to get it to work in mode 4. So I put the files back to the way they were.
That is extremely odd. Did you ever try "mode=4" instead of "mode=802.3ad?" I'm not sure why that would make a difference but it might be worth a try. Having this in the contents of my ifcfg-bond0 file works just fine: # cat ifcfg-bond0 DEVICE="bond0" ONBOOT="yes" BOOTPROTO="static" BONDING_OPTS="mode=802.3ad miimon=1000" NM_CONTROLLED="no" # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: down MII Polling Interval (ms): 1000 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Aggregator selection policy (ad_select): stable bond bond0 has no active aggregator Slave Interface: eth5 MII Status: down Speed: 100 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:13:20:f8:43:e6 Aggregator ID: 1 Slave queue ID: 0
Yes, that is what I started to use, but got tired of typing it in, so started using the mode=4. I also changed the lacp rate to fast, but that didn't help either. Out of 7 servers, all identical hardware and firmware, the 2 that are running el6 (RH, CentOS and Oracle UEL tested them all) all show the same symptoms, the one running fedora 13, which I got the config files from, and the others running Oracle VM Server 2.2, all work fine.
I am just rebooting the server and will get the additional info for you. One thing that we did do in testing was while the bond was broken, we did a packet capture on the server and noticed that on the packet level, the ping requests and replies were being seen, but ping itself didn't see the reply, so responded with 100% packet loss.
Created attachment 529914 [details] output of ifconfig -a before and after, /proc/net/bonding/bondx before and after and dmesg So these are the outputs from a fresh reboot with the bond broken and after a network restart.
After disabling NetworkManager, chkconfig --levels 2345 NetworkManager off, server and bonding have survived 2 reboots.
Ah yes, NetworkManager can do all sorts of fun things to your system when you do not have: NM_CONTROLLED="no" in your ifcfg-ethX files and expect changes there to stick. Sounds like this is working now, so I will close this as NOTABUG.
Yes, thanks for your help.