Bug 1455865
Summary: | Deployment on OSP11 with linux bonding fails | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> |
Component: | openstack-tripleo-heat-templates | Assignee: | Bob Fournier <bfournie> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Gurenko Alex <agurenko> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 11.0 (Ocata) | CC: | aschultz, bfournie, cfields, ebarrera, gkadam, mburns, mirko.schmidt, mlammon, rhel-osp-director-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-03 20:28:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eduard Barrera
2017-05-26 10:40:55 UTC
I wonder if this has something to do with the device name changing. In the controller console output in the case the error message is for ens2f0 (Naming Scheme 2 in https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/ch-Consistent_Network_Device_Naming.html) The workaround in the case is using enoXX, e.g. eno49 for the interfaces (Naming Scheme 1 in doc above). Is interface ens2f0 no longer being reported on the system? Hi Bob, The interfaces were present and didn't change the name. The extrange thing is this dash inserted in the beginning of the interface name bond1: option slaves: invalid value (-ens2f0) for sure interface -ens2f0 does not exist. The yaml template is using nic1 and nic2 so no typo there type: linux_bond name: bond1 defroute: false bonding_options: "mode=4 lacp_rate=1 updelay=1000 miimon=50" members: - type: interface name: nic1 primary: true - type: interface name: nic2 Thanks Eduard. Yes strange, it must have detected an interface named ens2f0 at some point. When they are able to duplicate it can we get the logs (sosreport etc.)? Thanks Eduard. Yes, very strange that only happens with single controller deployments. Would like to get any logs they have access to. Bob, the environment is not available any more. Did you have the chance to reproduce it ? Eduard, Can we get logs on the controller that is exhibiting the problem plus ifcfg-x files, /etc/os-net-config/config.json, and any custom nic mapping files? In addition can we get the complete nic config files and command used for deployment? Thanks Hi, I don't know if the problem could be resolved in the meantime. But I had a similar issue at a customer with 1 of 120 compute nodes on OSP10. The deployment constantly failed as one of the interfaces has been marked down in the Linux Bond. I've tested to install a regular RHEL 7.3 on that machine and configure a LACP bond via NetworkManager and that worked flawlessly. So the configuration and cabling were OK. What helped to get the bond online was to add "rd.net.timeout.carrier=30" to the grub command line to give the interface a bit more time than the default 5 seconds to determine that the link is up. Best regards. Ganesh - has customer tried setting "rd.net.timeout.carrier=30"? Eduard - any more info on this, or whether the suggested workaround may help? Also, regarding the output in Comment 3, adding that preceding '-' seems to be a common kernel message with bonding, see for example http://lists.us.dell.com/pipermail/linux-poweredge/2015-November/050269.html (unrelated, but same message) Closing this out. We don't have a way to test this as the customer environment is no longer available but there is a workaround to increase the carrier check time. The appears to be a link detection issue with the bonds. Please reopen if we can get the logs and look into issues which may be causing long link detection times on this port. I hit this error w/OSP 11 - but only when trying to add routing rules to /etc/sysconfig/network-scripts/route-<adapter> with OS::TripleO::NodeExtraConfigPost as part of overcloud deployment. Each overcloud deploy resulted in the kernel errors below and the deployment failed on a software config that tried to put pacemaker out of maintenance mode but failed with a cib_replace error. journalctl -p err -k -- Logs begin at Thu 2018-02-01 09:10:33 CST, end at Wed 2018-03-07 17:17:50 CST. -- Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth2) Mar 06 13:45:28 controller-2 kernel: bond1: option slaves: invalid value (-eth1) Removing the attempted modification of route- files from post config fixed it. |