Bug 1915299

Summary: os-net-config fails to re-provision networking config on compute node with DPDK interfaces mapped to numbered interfaces
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: os-net-configAssignee: Dan Sneddon <dsneddon>
Status: CLOSED ERRATA QA Contact: Paras Babbar <pbabbar>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: bfournie, dsneddon, fiezzi, hbrock, jslagle, mburns, pbabbar, pweeks, sbaker
Target Milestone: z6Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: os-net-config-11.3.2-1.20210406083710.f49ab16.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-26 13:50:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2021-01-12 12:12:26 UTC
Description of problem:

Customer reported a problem when deployment command fails for existing overcloud when invoked with templates that contain definition [1]. From provided sosreport (attached to case) it looks like this issue is caused by failed os-net-config.

Exception [2] is logged. From os-net-confg code it looks like that the failure occurs because /var/lib/os-net-config/dpdk_mapping.yaml doesn't contain a record for nic6. From provided sosreport I can see that var/lib/os-net-config/dpdk_mapping.yaml file is valid, but contains information for real interfaces instead of numbered interfaces (for example, it contains record for p2p2 instead of nic6).

Version-Release number of selected component (if applicable):

RHOSP 13, os-net-config-8.4.4-6.el7ost.noarch


How reproducible:

Run deployment command for existing overcloud which DPDK interfaces were provisioned using numbered NICs with "NetworkDeploymentActions: ['CREATE','UPDATE']"


[1]
NodeDPDKNetworkDeploymentActions: ['CREATE','UPDATE']

[2]
Jan 11 12:03:14 dpdk-compute0 os-collect-config: Traceback (most recent call last):
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Jan 11 12:03:14 dpdk-compute0 os-collect-config: sys.exit(main())
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 303, in main
Jan 11 12:03:14 dpdk-compute0 os-collect-config: provider.add_object(obj)
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 70, in add_object
Jan 11 12:03:14 dpdk-compute0 os-collect-config: self.add_object(member)
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 104, in add_object
Jan 11 12:03:14 dpdk-compute0 os-collect-config: self.add_ovs_dpdk_bond(obj)
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 920, in add_ovs_dpdk_bond
Jan 11 12:03:14 dpdk-compute0 os-collect-config: utils.bind_dpdk_interfaces(ifname, dpdk_port.driver, self.noop)
Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 298, in bind_dpdk_interfaces
Jan 11 12:03:14 dpdk-compute0 os-collect-config: raise OvsDpdkBindException(msg)
Jan 11 12:03:14 dpdk-compute0 os-collect-config: os_net_config.utils.OvsDpdkBindException: Interface nic6 cannot be found

Comment 1 Dan Sneddon 2021-01-12 21:00:56 UTC
Looking at the attached support case, I see that the NICs are not being detected correctly. The NICs p2p1 and p2p2 are being detected twice, so the numbered NIC ordering is skipping nic6 and nic8 which are being mapped to p2p1 and p2p2, however these NICs have already been assigned to p2p1 and p2p2:


Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] Active nics are ['em1', 'em2', 'p1p1', 'p1p2', 'p2p1', 'p2p1', 'p2p2', 'p2p2', 'p3p1', 'p3p1', 'p3p2', 
'p3p2']
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic2 mapped to: em2
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic3 mapped to: p1p1
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic4 mapped to: p1p2
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic7 mapped to: p2p2
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic5 mapped to: p2p1
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic1 mapped to: em1
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic11 mapped to: p3p2
Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic9 mapped to: p3p1

In order to troubleshoot this, I need to see the NIC config templates that are being used in the stack update, as well as more information about what changes were made manually. What was the goal of the manual changes? What were the changes made to the NIC config templates (or network environment files) before running a stack update with NetworkDeployActions set to ["CREATE","UPDATE"].

Comment 5 Dan Sneddon 2021-01-19 21:32:27 UTC
I think I have discovered where the bug lies here. When os-net-config runs for the first time, the DPDK nics have no entry in /sys/net. Since the NICs are not present there, we look at the DPDK mapping and add the NICs to the list of active NICs.

When you made the LACP change and updated the stack, the DPDK NICs would have been active and would have an entry in /sys/net. The NICs were added to the list of active NICs, but the DPDK mapping added those NICs to the list of active NICs a second time.

To fix this we probably have to made sure we only add the DPDK NIC to the list once.

I can file an upstream bug and patch, but I don't know if or how long it would take for the change to be made in OSP 13. It is probably best to use the following workaround instead.

My recommendation is to use real NIC names in the computeDPDK.yaml template. If the nodes do not all have the same NIC name configuration, then a mapping will have to be provided. See the file in firstboot/os-net-config-mappings.yaml in the openstack-tripleo-heat-templates directory and the associated documentation for more information.

Comment 6 Alex Stupnikov 2021-01-20 08:54:08 UTC
Thank you so much Dan! We will try to explain available options to customer.

Comment 7 pweeks 2021-01-20 20:47:55 UTC
Dan, can you fill in the fixed in version field and link to the patch?
I'll set tags appropriate for 16.1.5.

Comment 23 errata-xmlrpc 2021-05-26 13:50:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097