1269036 – RFE: Support the workflow that automatically detects nics and lets user customize bonds

Bug 1269036 - RFE: Support the workflow that automatically detects nics and lets user customize bonds

Summary: RFE: Support the workflow that automatically detects nics and lets user custo...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bob Fournier
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-06 05:39 UTC by bigswitch
Modified:	2023-09-14 03:06 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-28 14:00:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 4 bigswitch 2015-10-08 08:33:24 UTC

Following is the detailed log analysis to explain why this feature is necessary.

Because RHOSP7 doesn't have the workflow to let user configure the uplinks, neutron-bsn-lldp service has to be smart enough to figure out which are the uplinks to send out LLDP. If all the following 3 conditions holds, we consider a link as an uplink.
1) the link is a physical link and is up (managed by network-online.service)
2) the link is attached to ovs (managed by os-collect-config.service)
3) the link does not have a IP address (managed by os-collect-config.service)

As a result, neutron-bsn-lldp.service should be enabled AFTER network-online.service and os-collect-config.service has been started. Otherwise, neutron-bsn-lldp cannot decide which are the uplinks.

However, the os-collect-config.service not only does 2) and 3), but also starts openstack services that require IP connectivity. The problem is that without properly sending out LLDP, the fabric cannot provide IP connectivity.

If we put "Wants=network-online.target" and "After=syslog.target network.target network-online.target" into neutron-bsn-lldp.service,
This log shows that the services start in following order:
bring up links -> start lldp service -> attach uplinks to ovs.

Oct  7 18:01:26 localhost NetworkManager[604]: <info>  (p1p1): link connected
Oct  7 18:01:26 localhost NetworkManager[604]: <info>  (p1p2): link connected
Oct  7 18:01:33 localhost systemd: Started bsn lldp.
Oct  7 18:02:34 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --fake-iface add-bond br-ex bond1 p1p1 p1p2 bond_mode=balance-tcp lacp=active other-config:lacp-fallback-ab=true other-config:lacp-time=fast
Oct  7 18:02:36 localhost kernel: device bond1 entered promiscuous mode
Oct  7 18:02:36 localhost systemd: Started DHCP interface bond1.
Oct  7 18:02:36 localhost NetworkManager[604]: <info>  (bond1): link connected

This order is wrong and shouldn't be working. However, the reason it works in most cases is https://github.com/stackforge/networking-bigswitch/blob/master/bsnstacklib/bsnlldp/bsnlldp.py#L331-L334, in which neutron-bsn-lldp service keeps looking for uplinks until it finds at least one uplink.

However, if an uplink temporarily fails to be attached to ovs, LLDP won't be sent via that uplink. Following is an example,

Oct  7 18:02:47 localhost os-collect-config: [2015/10/07 06:02:47 PM] [INFO] running ifup on interface: p1p1
Oct  7 18:02:48 localhost os-collect-config: [2015/10/07 06:02:48 PM] [INFO] running ifup on interface: p1p2
Oct  7 18:02:48 localhost os-collect-config: [2015/10/07 06:02:48 PM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'p1p1')
Oct  7 18:02:48 localhost os-collect-config: Traceback (most recent call last):
Oct  7 18:02:48 localhost os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Oct  7 18:02:48 localhost os-collect-config: sys.exit(main())
Oct  7 18:02:48 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Oct  7 18:02:48 localhost os-collect-config: activate=not opts.no_activate)
Oct  7 18:02:48 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 312, in apply
Oct  7 18:02:48 localhost os-collect-config: self.bond_primary_ifaces[bond])
Oct  7 18:02:48 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 146, in ovs_appctl
Oct  7 18:02:48 localhost os-collect-config: self.execute(msg, '/bin/ovs-appctl', action, *parameters)
Oct  7 18:02:48 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 108, in execute
Oct  7 18:02:48 localhost os-collect-config: processutils.execute(cmd, *args, **kwargs)
Oct  7 18:02:48 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 266, in execute
Oct  7 18:02:48 localhost os-collect-config: cmd=sanitized_cmd)
Oct  7 18:02:48 localhost os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Oct  7 18:02:48 localhost os-collect-config: Command: /bin/ovs-appctl bond/set-active-slave bond1 p1p1
Oct  7 18:02:48 localhost os-collect-config: Exit code: 2
Oct  7 18:02:48 localhost os-collect-config: Stdout: u''
Oct  7 18:02:48 localhost os-collect-config: Stderr: u'cannot make disabled slave active\novs-appctl: ovs-vswitchd: server returned an error\n'
Oct  7 18:02:48 localhost os-collect-config: + RETVAL=1
Oct  7 18:02:48 localhost os-collect-config: + [[ 1 == 2 ]]
Oct  7 18:02:48 localhost os-collect-config: + [[ 1 != 0 ]]
Oct  7 18:02:48 localhost os-collect-config: + echo 'ERROR: os-net-config configuration failed.'
Oct  7 18:02:48 localhost os-collect-config: ERROR: os-net-config configuration failed.
Oct  7 18:02:48 localhost os-collect-config: + exit 1
Oct  7 18:02:48 localhost os-collect-config: [2015-10-07 18:02:48,413] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
Oct  7 18:02:48 localhost os-collect-config: [2015-10-07 18:02:48,413] (os-refresh-config) [ERROR] Aborting...
Oct  7 18:02:48 localhost os-collect-config: 2015-10-07 18:02:48.416 7470 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1
Oct  7 18:02:48 localhost os-collect-config: 2015-10-07 18:02:48.416 7470 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec.

Comment 6 Mike Burns 2016-04-07 20:54:03 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 9 Red Hat Bugzilla Rules Engine 2017-02-06 15:26:47 UTC

This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 10 Dan Sneddon 2017-02-06 19:37:34 UTC

If this bug is still applicable, we can create a new bug so that we can work on this further in the next release. Otherwise, I'll assume that workarounds have been found.

Comment 12 Bob Fournier 2018-06-28 14:00:24 UTC

Closing this out due to lack of manpower, lower priority, and the fact that workarounds exist.

Comment 13 Red Hat Bugzilla 2023-09-14 03:06:20 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.