Description of problem: Customer reported a problem when deployment command fails for existing overcloud when invoked with templates that contain definition [1]. From provided sosreport (attached to case) it looks like this issue is caused by failed os-net-config. Exception [2] is logged. From os-net-confg code it looks like that the failure occurs because /var/lib/os-net-config/dpdk_mapping.yaml doesn't contain a record for nic6. From provided sosreport I can see that var/lib/os-net-config/dpdk_mapping.yaml file is valid, but contains information for real interfaces instead of numbered interfaces (for example, it contains record for p2p2 instead of nic6). Version-Release number of selected component (if applicable): RHOSP 13, os-net-config-8.4.4-6.el7ost.noarch How reproducible: Run deployment command for existing overcloud which DPDK interfaces were provisioned using numbered NICs with "NetworkDeploymentActions: ['CREATE','UPDATE']" [1] NodeDPDKNetworkDeploymentActions: ['CREATE','UPDATE'] [2] Jan 11 12:03:14 dpdk-compute0 os-collect-config: Traceback (most recent call last): Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/bin/os-net-config", line 10, in <module> Jan 11 12:03:14 dpdk-compute0 os-collect-config: sys.exit(main()) Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 303, in main Jan 11 12:03:14 dpdk-compute0 os-collect-config: provider.add_object(obj) Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 70, in add_object Jan 11 12:03:14 dpdk-compute0 os-collect-config: self.add_object(member) Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 104, in add_object Jan 11 12:03:14 dpdk-compute0 os-collect-config: self.add_ovs_dpdk_bond(obj) Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 920, in add_ovs_dpdk_bond Jan 11 12:03:14 dpdk-compute0 os-collect-config: utils.bind_dpdk_interfaces(ifname, dpdk_port.driver, self.noop) Jan 11 12:03:14 dpdk-compute0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 298, in bind_dpdk_interfaces Jan 11 12:03:14 dpdk-compute0 os-collect-config: raise OvsDpdkBindException(msg) Jan 11 12:03:14 dpdk-compute0 os-collect-config: os_net_config.utils.OvsDpdkBindException: Interface nic6 cannot be found
Looking at the attached support case, I see that the NICs are not being detected correctly. The NICs p2p1 and p2p2 are being detected twice, so the numbered NIC ordering is skipping nic6 and nic8 which are being mapped to p2p1 and p2p2, however these NICs have already been assigned to p2p1 and p2p2: Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] Active nics are ['em1', 'em2', 'p1p1', 'p1p2', 'p2p1', 'p2p1', 'p2p2', 'p2p2', 'p3p1', 'p3p1', 'p3p2', 'p3p2'] Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic2 mapped to: em2 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic3 mapped to: p1p1 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic4 mapped to: p1p2 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic7 mapped to: p2p2 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic5 mapped to: p2p1 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic1 mapped to: em1 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic11 mapped to: p3p2 Jan 11 12:03:14 cpt0-dpdk-dell-tovb os-collect-config: [2021/01/11 11:58:07 AM] [INFO] nic9 mapped to: p3p1 In order to troubleshoot this, I need to see the NIC config templates that are being used in the stack update, as well as more information about what changes were made manually. What was the goal of the manual changes? What were the changes made to the NIC config templates (or network environment files) before running a stack update with NetworkDeployActions set to ["CREATE","UPDATE"].
I think I have discovered where the bug lies here. When os-net-config runs for the first time, the DPDK nics have no entry in /sys/net. Since the NICs are not present there, we look at the DPDK mapping and add the NICs to the list of active NICs. When you made the LACP change and updated the stack, the DPDK NICs would have been active and would have an entry in /sys/net. The NICs were added to the list of active NICs, but the DPDK mapping added those NICs to the list of active NICs a second time. To fix this we probably have to made sure we only add the DPDK NIC to the list once. I can file an upstream bug and patch, but I don't know if or how long it would take for the change to be made in OSP 13. It is probably best to use the following workaround instead. My recommendation is to use real NIC names in the computeDPDK.yaml template. If the nodes do not all have the same NIC name configuration, then a mapping will have to be provided. See the file in firstboot/os-net-config-mappings.yaml in the openstack-tripleo-heat-templates directory and the associated documentation for more information.
Thank you so much Dan! We will try to explain available options to customer.
Dan, can you fill in the fixed in version field and link to the patch? I'll set tags appropriate for 16.1.5.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2097