Description of problem: In OSP9/10 we use Heat deployment templates to configure Linux bridge bonding on controllers and compute nodes (we have LACP on switches so do not use OVS bonding). This works fine. OSP11 I noticed that the deployment hangs. I tracked the issue down to Linux bonding breaking mid deployment on the Controllers. It works for a minute or so then suddenly breaks as the deployment progresses. The controller console gives the message - bond1: option slaves: invalid value (-ens2f0) We are using the same hardware as OSP9/10 and the same Heat template bonding configuration. Version-Release number of selected component (if applicable): OSP 11 How reproducible: always Steps to Reproduce: 1. Deploy and environment using LACP with linux bondig 2. 3. Actual results: bond1: option slaves: invalid value (-ens2f0) Expected results: Deployment finishes Additional info:
I wonder if this has something to do with the device name changing. In the controller console output in the case the error message is for ens2f0 (Naming Scheme 2 in https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/ch-Consistent_Network_Device_Naming.html) The workaround in the case is using enoXX, e.g. eno49 for the interfaces (Naming Scheme 1 in doc above). Is interface ens2f0 no longer being reported on the system?
Hi Bob, The interfaces were present and didn't change the name. The extrange thing is this dash inserted in the beginning of the interface name bond1: option slaves: invalid value (-ens2f0) for sure interface -ens2f0 does not exist. The yaml template is using nic1 and nic2 so no typo there type: linux_bond name: bond1 defroute: false bonding_options: "mode=4 lacp_rate=1 updelay=1000 miimon=50" members: - type: interface name: nic1 primary: true - type: interface name: nic2
Thanks Eduard. Yes strange, it must have detected an interface named ens2f0 at some point. When they are able to duplicate it can we get the logs (sosreport etc.)?
Thanks Eduard. Yes, very strange that only happens with single controller deployments. Would like to get any logs they have access to.
Bob, the environment is not available any more. Did you have the chance to reproduce it ?
Eduard, Can we get logs on the controller that is exhibiting the problem plus ifcfg-x files, /etc/os-net-config/config.json, and any custom nic mapping files? In addition can we get the complete nic config files and command used for deployment? Thanks
Hi, I don't know if the problem could be resolved in the meantime. But I had a similar issue at a customer with 1 of 120 compute nodes on OSP10. The deployment constantly failed as one of the interfaces has been marked down in the Linux Bond. I've tested to install a regular RHEL 7.3 on that machine and configure a LACP bond via NetworkManager and that worked flawlessly. So the configuration and cabling were OK. What helped to get the bond online was to add "rd.net.timeout.carrier=30" to the grub command line to give the interface a bit more time than the default 5 seconds to determine that the link is up. Best regards.
Ganesh - has customer tried setting "rd.net.timeout.carrier=30"?
Eduard - any more info on this, or whether the suggested workaround may help? Also, regarding the output in Comment 3, adding that preceding '-' seems to be a common kernel message with bonding, see for example http://lists.us.dell.com/pipermail/linux-poweredge/2015-November/050269.html (unrelated, but same message)
Closing this out. We don't have a way to test this as the customer environment is no longer available but there is a workaround to increase the carrier check time. The appears to be a link detection issue with the bonds. Please reopen if we can get the logs and look into issues which may be causing long link detection times on this port.
I hit this error w/OSP 11 - but only when trying to add routing rules to /etc/sysconfig/network-scripts/route-<adapter> with OS::TripleO::NodeExtraConfigPost as part of overcloud deployment. Each overcloud deploy resulted in the kernel errors below and the deployment failed on a software config that tried to put pacemaker out of maintenance mode but failed with a cib_replace error. journalctl -p err -k -- Logs begin at Thu 2018-02-01 09:10:33 CST, end at Wed 2018-03-07 17:17:50 CST. -- Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth2) Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth1) Removing the attempted modification of route- files from post config fixed it.