Bug 1657746 - [RHOSP10][os-collect-config]Interfaces are not properly removed from bonds after network configuration failed
Summary: [RHOSP10][os-collect-config]Interfaces are not properly removed from bonds af...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 10.0 (Newton)
Assignee: Bob Fournier
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-10 11:08 UTC by Alex Stupnikov
Modified: 2019-04-30 16:59 UTC (History)
8 users (show)

Fixed In Version: os-net-config-5.2.3-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-30 16:58:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 467311 None MERGED Continue bringing up interfaces even if one fails 2020-11-22 17:17:17 UTC
Red Hat Product Errata RHBA-2019:0921 None None None 2019-04-30 16:59:10 UTC

Description Alex Stupnikov 2018-12-10 11:08:33 UTC
Description of problem:

os-collect-config is repeatedly trying to provision networking configuration on slave nodes. If networking templates contain bond configuration and for some reason os-collect-config will fail to provision this bond from the first try, it  will generate the following errors for subsequent runs against bond's slaves:


Dec  7 04:19:53 localhost os-collect-config: ++ cat /sys/class/net/ens3f1/addr_assign_type
Dec  7 04:19:53 localhost os-collect-config: + local mac_addr_type=3
Dec  7 04:19:53 localhost os-collect-config: + '[' 3 '!=' 0 ']'
Dec  7 04:19:53 localhost os-collect-config: + echo 'Device has generated MAC, skipping.'
Dec  7 04:19:53 localhost os-collect-config: Device has generated MAC, skipping.

Dec  7 04:19:59 localhost os-collect-config: ++ cat /sys/class/net/bond0/addr_assign_type
Dec  7 04:19:59 localhost os-collect-config: + local mac_addr_type=1
Dec  7 04:19:59 localhost os-collect-config: + '[' 1 '!=' 0 ']'
Dec  7 04:19:59 localhost os-collect-config: + echo 'Device has generated MAC, skipping.'
Dec  7 04:19:59 localhost os-collect-config: Device has generated MAC, skipping.


Version-Release number of selected component (if applicable):

RHOSP 10:
- os-apply-config-5.1.1-1.el7ost.noarch
- os-collect-config-5.2.1-1.el7ost.noarch
- os-net-config-5.2.2-1.el7ost.noarch
- os-prober-1.58-9.el7.x86_64
- os-refresh-config-5.1.0-1.el7ost.noarch


How reproducible:

This issue came up after os-collect-config failed to provision networking configuration due to a down link on one of the NICs. As a result, it failed to map a nic* that was assigned to bond0. However, subsequent runs failed for other reasons with errors above.


Steps to Reproduce:
1. Describe network setup with bond interface with two NICs
2. Leave one of those NICs unlinked
3. Run deploy command and link the NIC after first os-collect-config failure

Actual results:

Deployment will fail.


Expected results:

Network configuration is properly provisioned.

Comment 2 Bob Fournier 2018-12-12 20:14:03 UTC
Hi Alex,

Can we get the nic config template files that were used - i.e. controller.yaml, compute.yaml?  It doesn't look like they are in the sosreport.
It sounds like a valid issue, we will try to duplicate.  Thanks.

Comment 3 Alex Stupnikov 2018-12-13 08:07:59 UTC
Hi Bob.

Asked customer to provide templates. BR, Alex.

Comment 5 Bob Fournier 2018-12-21 15:25:03 UTC
Thanks Alex.  I think I understand what's going on and it there appears to be a patch that needs to be backported to OSP-10 to fix this.

To summarize the issue:
- eno50 is configured properly as a slave in bond by os-net-config
Dec  7 04:12:45 localhost os-collect-config: [2018/12/07 04:12:45 AM] [INFO] running ifdown on interface: eno50
Dec  7 04:12:45 localhost os-collect-config: [2018/12/07 04:12:45 AM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-eno50
Dec  7 04:12:53 localhost os-collect-config: [2018/12/07 04:12:53 AM] [INFO] running ifup on interface: eno50
Dec  7 04:12:54 localhost kernel: bond0: Enslaving eno50 as a backup interface with an up link

- a separate interface defined on the compute - nic5 - does not get mapped and appears to not be connected. When ifup is run on this interface it causes
an os-net-config crash because the exception is not properly handled:
Dec  7 04:12:54 localhost os-collect-config: [2018/12/07 04:12:54 AM] [INFO] running ifup on interface: nic5
Dec  7 04:12:54 localhost /etc/sysconfig/network-scripts/ifup-eth: Device nic5 does not seem to be present, delaying initialization.
Dec  7 04:12:54 localhost os-collect-config: Traceback (most recent call last):
Dec  7 04:12:54 localhost os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Dec  7 04:12:54 localhost os-collect-config: sys.exit(main())
Dec  7 04:12:54 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Dec  7 04:12:54 localhost os-collect-config: activate=not opts.no_activate)
Dec  7 04:12:54 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 972, in apply
Dec  7 04:12:54 localhost os-collect-config: self.ifup(interface)
Dec  7 04:12:54 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 244, in ifup
Dec  7 04:12:54 localhost os-collect-config: self.execute(msg, '/sbin/ifup', interface)
Dec  7 04:12:54 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 224, in execute
Dec  7 04:12:54 localhost os-collect-config: processutils.execute(cmd, *args, **kwargs)
Dec  7 04:12:54 localhost os-collect-config: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 394, in execute
Dec  7 04:12:54 localhost os-collect-config: cmd=sanitized_cmd)
Dec  7 04:12:54 localhost os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Dec  7 04:12:54 localhost os-collect-config: Command: /sbin/ifup nic5
Dec  7 04:12:54 localhost os-collect-config: Exit code: 1
Dec  7 04:12:54 localhost os-collect-config: Stdout: u'ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Device nic5 does not seem to be present, delaying initialization.\n'

- after the os-net-config crash configure_safe_defaults is run
Dec  7 04:12:54 localhost os-collect-config: + configure_safe_defaults

- since there has been no change to eno50 it is still configured as a slave (addr_assign_type=3) so its skipped in configure_safe_defaults
Dec  7 04:13:24 localhost os-collect-config: ++ cat /sys/class/net/eno50/addr_assign_type
Dec  7 04:13:24 localhost os-collect-config: + local mac_addr_type=3
Dec  7 04:13:24 localhost os-collect-config: + '[' 3 '!=' 0 ']'
Dec  7 04:13:24 localhost os-collect-config: + echo 'Device has generated MAC, skipping.'

The problem is really the os-net-config crash by not handling the ifup exception.  This has been fixed here https://review.openstack.org/#/c/467311/ and it solves a similar problem.  This fix is not present in OSP-10, it should be backported.  Once the backport is done we can provide a hot_fix for the customer to test.

Comment 17 errata-xmlrpc 2019-04-30 16:58:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0921


Note You need to log in before you can comment on or make changes to this bug.