Bug 1455865

Summary: Deployment on OSP11 with linux bonding fails
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: openstack-tripleo-heat-templatesAssignee: Bob Fournier <bfournie>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Gurenko Alex <agurenko>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: aschultz, bfournie, cfields, ebarrera, gkadam, mburns, mirko.schmidt, mlammon, rhel-osp-director-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-03 20:28:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduard Barrera 2017-05-26 10:40:55 UTC
Description of problem:

In OSP9/10 we use Heat deployment templates to configure Linux bridge bonding on controllers and compute nodes (we have LACP on switches so do not use OVS bonding). This works fine.

OSP11 I noticed that the deployment hangs. I tracked the issue down to Linux bonding breaking mid deployment on the Controllers. It works for a minute or so then suddenly breaks as the deployment progresses. The controller console gives the message - 

bond1: option slaves: invalid value (-ens2f0)


We are using the same hardware as OSP9/10 and the same Heat template bonding configuration.

Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
always

Steps to Reproduce:
1. Deploy and environment using LACP with linux bondig
2.
3.

Actual results:

bond1: option slaves: invalid value (-ens2f0)


Expected results:

Deployment finishes

Additional info:

Comment 2 Bob Fournier 2017-05-26 16:13:21 UTC
I wonder if this has something to do with the device name changing.  In the controller console output in the case the error message is for ens2f0 (Naming Scheme 2 in https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/ch-Consistent_Network_Device_Naming.html)

The workaround in the case is using enoXX, e.g. eno49 for the interfaces (Naming Scheme 1 in doc above).

Is interface ens2f0 no longer being reported on the system?

Comment 3 Eduard Barrera 2017-06-06 12:41:52 UTC
Hi Bob,

The interfaces were present and didn't change the name. The extrange thing is this dash inserted in the beginning of the interface name

bond1: option slaves: invalid value (-ens2f0)

for sure interface -ens2f0 does not exist. The yaml template is using nic1 and nic2 so no typo there

 type: linux_bond
                  name: bond1
                  defroute: false
                  bonding_options: "mode=4 lacp_rate=1 updelay=1000 miimon=50"
                  members:
                    -
                      type: interface
                      name: nic1
                      primary: true
                    -
                      type: interface
                      name: nic2

Comment 4 Bob Fournier 2017-06-06 14:17:26 UTC
Thanks Eduard.  Yes strange, it must have detected an interface named ens2f0 at some point. 

When they are able to duplicate it can we get the logs (sosreport etc.)?

Comment 7 Bob Fournier 2017-06-12 12:59:42 UTC
Thanks Eduard. Yes, very strange that only happens with single controller deployments. Would like to get any logs they have access to.

Comment 8 Eduard Barrera 2017-06-13 06:42:59 UTC
Bob, the environment is not available any more. Did you have the chance to reproduce it ?

Comment 10 Bob Fournier 2017-06-14 11:39:14 UTC
Eduard,

Can we get logs on the controller that is exhibiting the problem plus ifcfg-x files, /etc/os-net-config/config.json, and any custom nic mapping files?  In addition can we get the complete nic config files and command used for deployment? Thanks

Comment 12 mirko.schmidt 2017-07-27 07:59:39 UTC
Hi, I don't know if the problem could be resolved in the meantime. 

But I had a similar issue at a customer with 1 of 120 compute nodes on OSP10. The deployment constantly failed as one of the interfaces has been marked down in the Linux Bond. I've tested to install a regular RHEL 7.3 on that machine and configure a LACP bond via NetworkManager and that worked flawlessly. So the configuration and cabling were OK.

What helped to get the bond online was to add "rd.net.timeout.carrier=30" to the grub command line to give the interface a bit more time than the default 5 seconds to determine that the link is up.

Best regards.

Comment 16 Bob Fournier 2017-08-24 21:08:24 UTC
Ganesh - has customer tried setting "rd.net.timeout.carrier=30"?

Comment 17 Bob Fournier 2017-09-20 19:59:54 UTC
Eduard - any more info on this, or whether the suggested workaround may help?

Also, regarding the output in Comment 3, adding that preceding '-' seems to be a common  kernel message with bonding, see for example http://lists.us.dell.com/pipermail/linux-poweredge/2015-November/050269.html (unrelated, but same message)

Comment 18 Bob Fournier 2017-10-03 20:28:59 UTC
Closing this out.  We don't have a way to test this as the customer environment is no longer available but there is a workaround to increase the carrier check time.  The appears to be a link detection issue with the bonds.

Please reopen if we can get the logs and look into issues which may be causing long link detection times on this port.

Comment 19 Chris Fields 2018-03-07 23:28:46 UTC
I hit this error w/OSP 11 - but only when trying to add routing rules to /etc/sysconfig/network-scripts/route-<adapter> with OS::TripleO::NodeExtraConfigPost as part of overcloud deployment.  Each overcloud deploy resulted in the kernel errors below and the deployment failed on a software config that tried to put pacemaker out of maintenance mode but failed with a cib_replace error.  

journalctl -p err -k
-- Logs begin at Thu 2018-02-01 09:10:33 CST, end at Wed 2018-03-07 17:17:50 CST. --
Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth2)
Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth1)

Removing the attempted modification of route- files from post config fixed it.