Description of problem: os-net-config mapping.yaml will only map active interfaces, but no-carrier interfaces need to be mapped as well Let's suppose that a customer has 3 mappings: nic1 -> eno50 nic2 -> eno51 nic3 -> eno52 nic2 and nic3 are bonded together. The customer has 100 servers. On one server, eno51 goes down, but the bond remains up. Now, the customer runs a stack update, and os-net-config is triggered again. The mapping for nic2 to eno51 will fail, because the interface is no-carrier. This in turn makes os-net-config generate a bond configuration of bond1 -> nic2+eno52 (it may actually even simply crash, leaving nic2+nic3). In any case, the will create network issues. Version-Release number of selected component (if applicable): tested in OSP 7, testing tomorrow if this also applies to OSP 10 and providing more details How reproducible: encountered in a customer production environment Steps to Reproduce: 1. 2. 3. Actual results: Expected results: os-net-config should always map interfaces from a mapping.yaml, even if the carrier is down! Additional info:
Troubleshooting os-net-config issue Related case: 01779978 - scale down of overcloud by one node triggered a rerun of os-net-config. In combination with network issues (interfaces with no carrier) on 4 out of over 50 compute nodes, this lead to the 4 compute nodes to go out of service. The compute nodes were running fine, yet degraded, which was not known to the customer at this time. However, the run of os-net-config and the broken remapping of interfaces due to no-carrier caused production outages. This environment was RHOSP 7, but this document tries to demonstrate that the issues still exists on OSP 10 and suggests remedies to the problem. Problem: a) os-net-config is run on every stack update, even if https://access.redhat.com/solutions/2213711 is not configured b) os-net-config interface mapping depends on on-carrier interfaces, which can lead to catastrophic failures on updates c) We update all nodes when we add a new node and also when we remove existing nodes. There are already RFEs for this, but this issue here shows yet again how fragile this approach is. There is no need to run a full stack update when we only remove a node; on the contrary, this will create problems more often than not. Suggestions: a) do not run os-net-config on every stack update (every run of os-collect-config) - unless `NetworkDeploymentActions: ['CREATE','UPDATE']` is set. Network reconfigurations can have catastrophic failures when we push them when administrators do not expect them to take place b) os-net-config's interface mapping should not depend on on-carrier. At least when the mapping file is used! c) On a scale down, do not run stack updates on all nodes. It's not needed. Verification of hypothesis: a) On a compute in OSP 10, modify the network information. ~~~ [root@overcloud-compute-0 ~]# ip a c dev vlan 903 172.18.0.214/24 Error: ??? prefix is expected rather than "903". [root@overcloud-compute-0 ~]# ip a c dev vlan903 172.18.0.214/24 [root@overcloud-compute-0 ~]# ip a ls dev vlan903 12: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether ae:c5:d0:91:bd:c9 brd ff:ff:ff:ff:ff:ff inet 172.18.0.14/24 brd 172.18.0.255 scope global vlan903 valid_lft forever preferred_lft forever inet 172.18.0.214/24 scope global secondary vlan903 valid_lft forever preferred_lft forever inet6 fe80::acc5:d0ff:fe91:bdc9/64 scope link valid_lft forever preferred_lft forever ~~~ Monitor the logs and check network reconfiguration: ~~~ [root@overcloud-compute-0 ~]# journalctl -u os-collect-config -f [root@overcloud-compute-0 ~]# ip -o monitor ~~~ Kick off a new stack update ~~~ [stack@undercloud-1 ~]$ templates/deploy.sh control_scale=3, compute_scale=1, ceph_scale=0 1 nodes with profile compute won't be used for deployment now Configuration has 1 warnings, fix them before proceeding. Removing the current plan files Uploading new plan files ~~~ Indeed, os-net-config reruns: ~~~ ar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: dib-run-parts Thu Mar 30 20:52:24 UTC 2017 20-os-apply-config completed Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: dib-run-parts Thu Mar 30 20:52:24 UTC 2017 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: ++ os-apply-config --key os_net_config --type raw --key-default '' Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + NET_CONFIG='{"network_config": [{"dns_servers": ["192.0.2.1"], "addresses": [{"ip_netmask": "192.0.2.11/24"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.2.8/24"}], "vlan_id": 901}, {"type": "vlan", "addresses": [{"ip_netmask": "172.18.0.22/24"}], "vlan_id": 903}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.5/24"}], "vlan_id": 902}]}, {"type": "interface", "defroute": false, "name": "nic3", "use_dhcp": false}, {"type": "interface", "defroute": false, "name": "nic4", "use_dhcp": false}]}' Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + '[' -n '{"network_config": [{"dns_servers": ["192.0.2.1"], "addresses": [{"ip_netmask": "192.0.2.11/24"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.2.8/24"}], "vlan_id": 901}, {"type": "vlan", "addresses": [{"ip_netmask": "172.18.0.22/24"}], "vlan_id": 903}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.5/24"}], "vlan_id": 902}]}, {"type": "interface", "defroute": false, "name": "nic3", "use_dhcp": false}, {"type": "interface", "defroute": false, "name": "nic4", "use_dhcp": false}]}' ']' Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + trap configure_safe_defaults EXIT Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Using config file at: /etc/os-net-config/config.json Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Ifcfg net config provider created. Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic5 mapped to: eth4 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic4 mapped to: eth3 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic3 mapped to: eth2 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic2 mapped to: eth1 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic1 mapped to: eth0 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth0 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding custom route for interface: eth0 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding bridge: br-ex Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth1 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan901 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan903 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan902 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth2 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth3 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] applying network configs... Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth3 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth2 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth1 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth0 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan903 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan902 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan901 Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for bridge: br-ex ~~~ And in this example keeps the manually configured IP but also adds a new one: ~~~ [root@overcloud-compute-0 ~]# ip link ls dev vlan903 13: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1000 link/ether 9a:84:24:e3:b7:f8 brd ff:ff:ff:ff:ff:ff [root@overcloud-compute-0 ~]# ip a ls dev vlan903 13: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 9a:84:24:e3:b7:f8 brd ff:ff:ff:ff:ff:ff inet 172.18.0.22/24 brd 172.18.0.255 scope global vlan903 valid_lft forever preferred_lft forever inet 172.18.0.222/24 scope global secondary vlan903 valid_lft forever preferred_lft forever inet6 fe80::9884:24ff:fee3:b7f8/64 scope link valid_lft forever preferred_lft forever ~~~ b) In a KVM lab with virtualized compute node: ~~~ [root@rhospbl-4 ~]# virsh domif-setlink overcloud-node4 vnet20 down Device updated successfully ~~~ Verify in the compute: ~~~ 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT qlen 1000 link/ether 52:54:00:24:e2:7c brd ff:ff:ff:ff:ff:ff ~~~ Mapping file on compute ~~~ [root@overcloud-compute-0 ~]# cat /etc/os-net-config/mapping.yaml interface_mapping: nic1: eth0 nic2: eth1 nic3: eth2 nic4: eth3 nic5: eth4 ~~~ Running os-net-config in verbose, noop, with the mapping file and with carrier: ~~~ [root@overcloud-compute-0 ~]# os-net-config --noop -v -c /etc/os-net-config/config.json [2017/03/30 09:29:10 PM] [INFO] Using config file at: /etc/os-net-config/config.json [2017/03/30 09:29:10 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml [2017/03/30 09:29:10 PM] [INFO] Ifcfg net config provider created. [2017/03/30 09:29:10 PM] [INFO] nic2 mapped to: eth1 [2017/03/30 09:29:10 PM] [INFO] nic3 mapped to: eth2 [2017/03/30 09:29:10 PM] [INFO] nic1 mapped to: eth0 [2017/03/30 09:29:10 PM] [INFO] nic4 mapped to: eth3 [2017/03/30 09:29:10 PM] [INFO] nic5 mapped to: eth4 [2017/03/30 09:29:10 PM] [INFO] adding interface: eth0 [2017/03/30 09:29:10 PM] [INFO] adding custom route for interface: eth0 [2017/03/30 09:29:10 PM] [INFO] adding bridge: br-ex [2017/03/30 09:29:10 PM] [INFO] adding interface: eth1 [2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan901 [2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan903 [2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan902 [2017/03/30 09:29:10 PM] [INFO] adding interface: eth2 [2017/03/30 09:29:10 PM] [INFO] adding interface: eth3 [2017/03/30 09:29:10 PM] [INFO] applying network configs... [2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth3 [2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth2 [2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth1 [2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth0 [2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan903 [2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan902 [2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan901 [2017/03/30 09:29:10 PM] [INFO] No changes required for bridge: br-ex ~~~ Running os-net-config in verbose, noop, with the mapping file and no-carrier: ~~~ root@overcloud-compute-0 ~]# os-net-config --noop -v -c /etc/os-net-config/config.json [2017/03/30 09:27:52 PM] [INFO] Using config file at: /etc/os-net-config/config.json [2017/03/30 09:27:52 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml [2017/03/30 09:27:52 PM] [INFO] Ifcfg net config provider created. [2017/03/30 09:27:52 PM] [INFO] nic2 mapped to: eth1 [2017/03/30 09:27:52 PM] [WARNING] interface eth2 is not an active nic (eth0, eth1, eth3, eth4) [2017/03/30 09:27:52 PM] [INFO] nic1 mapped to: eth0 [2017/03/30 09:27:52 PM] [INFO] nic4 mapped to: eth3 [2017/03/30 09:27:52 PM] [INFO] nic5 mapped to: eth4 [2017/03/30 09:27:52 PM] [INFO] adding interface: eth0 [2017/03/30 09:27:52 PM] [INFO] adding custom route for interface: eth0 [2017/03/30 09:27:52 PM] [INFO] adding bridge: br-ex [2017/03/30 09:27:52 PM] [INFO] adding interface: eth1 [2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan901 [2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan903 [2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan902 [2017/03/30 09:27:52 PM] [INFO] adding interface: nic3 [2017/03/30 09:27:52 PM] [INFO] adding interface: eth3 [2017/03/30 09:27:52 PM] [INFO] applying network configs... [2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth3 [2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth1 [2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth0 [2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan903 [2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan902 [2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan901 [2017/03/30 09:27:52 PM] [INFO] No changes required for bridge: br-ex [2017/03/30 09:27:52 PM] [INFO] NOOP: running ifdown on interface: nic3 [2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/ifcfg-nic3 [2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/route6-nic3 [2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/route-nic3 [2017/03/30 09:27:52 PM] [INFO] NOOP: running ifup on interface: nic3 File: /etc/sysconfig/network-scripts/ifcfg-nic3 # This file is autogenerated by os-net-config DEVICE=nic3 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no BOOTPROTO=none DEFROUTE=no ---- File: /etc/sysconfig/network-scripts/route6-nic3 ---- File: /etc/sysconfig/network-scripts/route-nic3 ---- ~~~ Note how the fact of having a an interface in no-carrier will trigger a network update on any overcloud operation, such as noop update, scale out or scale down.
To make this clear: - imagine an environment with 100 nodes, running totally fine, one node having an interface in a bond with an issue. Administrator starts a scale down. A completely unrelated node may lose its network due to the fact that the nic mapping fails; because prior to the scale down, one interface in a bond had a no-carrier. Even if only ifdown/ifup is run, this is not acceptable - network updates should only be made when the administrator asks for them. In the worst case, an interface flap can cause other issues (imagine a faulty driver or SFP) where the flap could lead to a network outage. Although this may be a very rare event, we have observed exactly this in a production environment.
Bob, mind taking a look?
Proposed upstream patch is here - https://review.openstack.org/#/c/453284/
Hi, Until this is fixed, is this a safe operation to disable os-net-config? I tested this in a lab and it works - I understand the consequences, but due to the recent incident, we'd like to disable os-net-config for the duration of the scale down: ~~~ /bin/cp /usr/bin/os-net-config /usr/bin/os-net-config.orig echo -e '#!/bin/bash\nlogger os_net_config_skipped\nexit 0' > /usr/bin/os-net-config ~~~ And afterwards, rollback: ~~~ /bin/cp -f /usr/bin/os-net-config.orig /usr/bin/os-net-config ~~~ Regards, Andreas
Andreas, We have not tested this. but it seems like a reasonable approach and if it works in the lab it should be ok.
Hi, Thanks :-) - Andreas
Fix has been merged upstream - https://review.openstack.org/#/c/453284/
*** Bug 1448233 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2654
Sergii - the bug that is addressed by this fix is when the interfaces are listed in a mapping file [1], is that the scenario you are using? If the interfaces are not in a mapping file you can create a mapping file and add them. Its not possible for os-net-config to detect that interfaces which are down should be included in a bond otherwise. If they are in a mapping file and you are still having problems please open a new bug with a full sosreport and related package versions. [1] https://review.opendev.org/#/c/453284/
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days