Bug 1437320 - os-net-config mapping.yaml will only map active interfaces, but no-carrier interfaces need to be mapped as well
Summary: os-net-config mapping.yaml will only map active interfaces, but no-carrier in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z4
: 10.0 (Newton)
Assignee: Bob Fournier
QA Contact: mlammon
URL:
Whiteboard:
: 1448233 (view as bug list)
Depends On:
Blocks: 1465595
TreeView+ depends on / blocked
 
Reported: 2017-03-30 04:50 UTC by Andreas Karis
Modified: 2024-01-06 04:25 UTC (History)
10 users (show)

Fixed In Version: os-net-config-5.2.0-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1465595 (view as bug list)
Environment:
Last Closed: 2017-09-06 17:09:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 453284 0 'None' MERGED os_net_config should map nics that are down if nic is in mapping file 2021-02-20 22:10:00 UTC
OpenStack gerrit 459875 0 'None' MERGED os_net_config should map nics that are down if nic is in mapping file 2021-02-20 22:10:01 UTC
Red Hat Issue Tracker OSP-28171 0 None None None 2023-09-07 18:53:52 UTC
Red Hat Knowledge Base (Solution) 3087741 0 None None None 2020-03-25 11:10:47 UTC
Red Hat Product Errata RHBA-2017:2654 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 director Bug Fix Advisory 2017-09-06 20:55:36 UTC

Internal Links: 1227955

Description Andreas Karis 2017-03-30 04:50:40 UTC
Description of problem:
os-net-config mapping.yaml will only map active interfaces, but no-carrier interfaces need to be mapped as well

Let's suppose that a customer has 3 mappings:
nic1 -> eno50
nic2 -> eno51
nic3 -> eno52

nic2 and nic3 are bonded together. The customer has 100 servers. On one server, eno51 goes down, but the bond remains up. Now, the customer runs a stack update, and os-net-config is triggered again. The mapping for nic2 to eno51 will fail, because the interface is no-carrier. This in turn makes os-net-config generate a bond configuration of bond1 -> nic2+eno52 (it may actually even simply crash, leaving nic2+nic3). In any case, the will create network issues.

Version-Release number of selected component (if applicable):
tested in OSP 7, testing tomorrow if this also applies to OSP 10 and providing more details

How reproducible:
encountered in a customer production environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
os-net-config should always map interfaces from a mapping.yaml, even if the carrier is down!

Additional info:

Comment 1 Andreas Karis 2017-03-30 21:30:51 UTC
Troubleshooting os-net-config issue

Related case: 01779978 - scale down of overcloud by one node triggered a rerun of os-net-config. In combination with network issues (interfaces with no carrier) on 4 out of over 50 compute nodes, this lead to the 4 compute nodes to go out of service. The compute nodes were running fine, yet degraded, which was not known to the customer at this time. However, the run of os-net-config and the broken remapping of interfaces due to no-carrier caused production outages. This environment was RHOSP 7, but this document tries to demonstrate that the issues still exists on OSP 10 and suggests remedies to the problem.

Problem: 
a) os-net-config is run on every stack update, even if https://access.redhat.com/solutions/2213711 is not configured
b) os-net-config interface mapping depends on on-carrier interfaces, which can lead to catastrophic failures on updates
c) We update all nodes when we add a new node and also when we remove existing nodes. There are already RFEs for this, but this issue here shows yet again how fragile this approach is. There is no need to run a full stack update when we only remove a node; on the contrary, this will create problems more often than not.

Suggestions:
a) do not run os-net-config on every stack update (every run of os-collect-config) - unless 
`NetworkDeploymentActions: ['CREATE','UPDATE']` is set. Network reconfigurations can have catastrophic failures when we push them when administrators do not expect them to take place
b) os-net-config's interface mapping should not depend on on-carrier. At least when the mapping file is used!
c) On a scale down, do not run stack updates on all nodes. It's not needed.

Verification of hypothesis:
a) On a compute in OSP 10, modify the network information. 
~~~
[root@overcloud-compute-0 ~]# ip a c dev vlan 903 172.18.0.214/24
Error: ??? prefix is expected rather than "903".
[root@overcloud-compute-0 ~]# ip a c dev vlan903 172.18.0.214/24
[root@overcloud-compute-0 ~]# ip a ls dev vlan903
12: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether ae:c5:d0:91:bd:c9 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.14/24 brd 172.18.0.255 scope global vlan903
       valid_lft forever preferred_lft forever
    inet 172.18.0.214/24 scope global secondary vlan903
       valid_lft forever preferred_lft forever
    inet6 fe80::acc5:d0ff:fe91:bdc9/64 scope link 
       valid_lft forever preferred_lft forever
~~~

Monitor the logs and check network reconfiguration:
~~~
[root@overcloud-compute-0 ~]# journalctl -u os-collect-config -f
[root@overcloud-compute-0 ~]# ip -o monitor
~~~

Kick off a new stack update
~~~
[stack@undercloud-1 ~]$ templates/deploy.sh 
control_scale=3, compute_scale=1, ceph_scale=0
1 nodes with profile compute won't be used for deployment now
Configuration has 1 warnings, fix them before proceeding. 
Removing the current plan files
Uploading new plan files
~~~

Indeed, os-net-config reruns:
~~~
ar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: dib-run-parts Thu Mar 30 20:52:24 UTC 2017 20-os-apply-config completed
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: dib-run-parts Thu Mar 30 20:52:24 UTC 2017 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: ++ os-apply-config --key os_net_config --type raw --key-default ''
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + NET_CONFIG='{"network_config": [{"dns_servers": ["192.0.2.1"], "addresses": [{"ip_netmask": "192.0.2.11/24"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.2.8/24"}], "vlan_id": 901}, {"type": "vlan", "addresses": [{"ip_netmask": "172.18.0.22/24"}], "vlan_id": 903}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.5/24"}], "vlan_id": 902}]}, {"type": "interface", "defroute": false, "name": "nic3", "use_dhcp": false}, {"type": "interface", "defroute": false, "name": "nic4", "use_dhcp": false}]}'
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + '[' -n '{"network_config": [{"dns_servers": ["192.0.2.1"], "addresses": [{"ip_netmask": "192.0.2.11/24"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "192.0.2.1"}, {"default": true, "next_hop": "192.0.2.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.2.8/24"}], "vlan_id": 901}, {"type": "vlan", "addresses": [{"ip_netmask": "172.18.0.22/24"}], "vlan_id": 903}, {"type": "vlan", "addresses": [{"ip_netmask": "172.16.0.5/24"}], "vlan_id": 902}]}, {"type": "interface", "defroute": false, "name": "nic3", "use_dhcp": false}, {"type": "interface", "defroute": false, "name": "nic4", "use_dhcp": false}]}' ']'
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + trap configure_safe_defaults EXIT
Mar 30 20:52:24 overcloud-compute-0.localdomain os-collect-config[3360]: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Using config file at: /etc/os-net-config/config.json
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] Ifcfg net config provider created.
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic5 mapped to: eth4
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic4 mapped to: eth3
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic3 mapped to: eth2
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic2 mapped to: eth1
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] nic1 mapped to: eth0
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth0
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding custom route for interface: eth0
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding bridge: br-ex
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth1
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan901
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan903
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding vlan: vlan902
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth2
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] adding interface: eth3
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] applying network configs...
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth3
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth2
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth1
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for interface: eth0
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan903
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan902
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for vlan interface: vlan901
Mar 30 20:52:25 overcloud-compute-0.localdomain os-collect-config[3360]: [2017/03/30 08:52:25 PM] [INFO] No changes required for bridge: br-ex
~~~

And in this example keeps the manually configured IP but also adds a new one:
~~~
[root@overcloud-compute-0 ~]# ip link ls dev vlan903
13: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1000
    link/ether 9a:84:24:e3:b7:f8 brd ff:ff:ff:ff:ff:ff
[root@overcloud-compute-0 ~]# ip a ls dev vlan903
13: vlan903: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 9a:84:24:e3:b7:f8 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.22/24 brd 172.18.0.255 scope global vlan903
       valid_lft forever preferred_lft forever
    inet 172.18.0.222/24 scope global secondary vlan903
       valid_lft forever preferred_lft forever
    inet6 fe80::9884:24ff:fee3:b7f8/64 scope link 
       valid_lft forever preferred_lft forever
~~~

b) In a KVM lab with virtualized compute node:
~~~
[root@rhospbl-4 ~]# virsh domif-setlink overcloud-node4 vnet20 down
Device updated successfully
~~~

Verify in the compute:
~~~
4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT qlen 1000
    link/ether 52:54:00:24:e2:7c brd ff:ff:ff:ff:ff:ff
~~~

Mapping file on compute
~~~
[root@overcloud-compute-0 ~]# cat /etc/os-net-config/mapping.yaml 
interface_mapping:
  nic1: eth0
  nic2: eth1 
  nic3: eth2
  nic4: eth3
  nic5: eth4
~~~

Running os-net-config in verbose, noop, with the mapping file and with carrier:
~~~
[root@overcloud-compute-0 ~]# os-net-config --noop -v -c /etc/os-net-config/config.json 
[2017/03/30 09:29:10 PM] [INFO] Using config file at: /etc/os-net-config/config.json
[2017/03/30 09:29:10 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
[2017/03/30 09:29:10 PM] [INFO] Ifcfg net config provider created.
[2017/03/30 09:29:10 PM] [INFO] nic2 mapped to: eth1
[2017/03/30 09:29:10 PM] [INFO] nic3 mapped to: eth2
[2017/03/30 09:29:10 PM] [INFO] nic1 mapped to: eth0
[2017/03/30 09:29:10 PM] [INFO] nic4 mapped to: eth3
[2017/03/30 09:29:10 PM] [INFO] nic5 mapped to: eth4
[2017/03/30 09:29:10 PM] [INFO] adding interface: eth0
[2017/03/30 09:29:10 PM] [INFO] adding custom route for interface: eth0
[2017/03/30 09:29:10 PM] [INFO] adding bridge: br-ex
[2017/03/30 09:29:10 PM] [INFO] adding interface: eth1
[2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan901
[2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan903
[2017/03/30 09:29:10 PM] [INFO] adding vlan: vlan902
[2017/03/30 09:29:10 PM] [INFO] adding interface: eth2
[2017/03/30 09:29:10 PM] [INFO] adding interface: eth3
[2017/03/30 09:29:10 PM] [INFO] applying network configs...
[2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth3
[2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth2
[2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth1
[2017/03/30 09:29:10 PM] [INFO] No changes required for interface: eth0
[2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan903
[2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan902
[2017/03/30 09:29:10 PM] [INFO] No changes required for vlan interface: vlan901
[2017/03/30 09:29:10 PM] [INFO] No changes required for bridge: br-ex
~~~

Running os-net-config in verbose, noop, with the mapping file and no-carrier:
~~~
root@overcloud-compute-0 ~]# os-net-config --noop -v -c /etc/os-net-config/config.json 
[2017/03/30 09:27:52 PM] [INFO] Using config file at: /etc/os-net-config/config.json
[2017/03/30 09:27:52 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
[2017/03/30 09:27:52 PM] [INFO] Ifcfg net config provider created.
[2017/03/30 09:27:52 PM] [INFO] nic2 mapped to: eth1
[2017/03/30 09:27:52 PM] [WARNING] interface eth2 is not an active nic (eth0, eth1, eth3, eth4)
[2017/03/30 09:27:52 PM] [INFO] nic1 mapped to: eth0
[2017/03/30 09:27:52 PM] [INFO] nic4 mapped to: eth3
[2017/03/30 09:27:52 PM] [INFO] nic5 mapped to: eth4
[2017/03/30 09:27:52 PM] [INFO] adding interface: eth0
[2017/03/30 09:27:52 PM] [INFO] adding custom route for interface: eth0
[2017/03/30 09:27:52 PM] [INFO] adding bridge: br-ex
[2017/03/30 09:27:52 PM] [INFO] adding interface: eth1
[2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan901
[2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan903
[2017/03/30 09:27:52 PM] [INFO] adding vlan: vlan902
[2017/03/30 09:27:52 PM] [INFO] adding interface: nic3
[2017/03/30 09:27:52 PM] [INFO] adding interface: eth3
[2017/03/30 09:27:52 PM] [INFO] applying network configs...
[2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth3
[2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth1
[2017/03/30 09:27:52 PM] [INFO] No changes required for interface: eth0
[2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan903
[2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan902
[2017/03/30 09:27:52 PM] [INFO] No changes required for vlan interface: vlan901
[2017/03/30 09:27:52 PM] [INFO] No changes required for bridge: br-ex
[2017/03/30 09:27:52 PM] [INFO] NOOP: running ifdown on interface: nic3
[2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/ifcfg-nic3
[2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/route6-nic3
[2017/03/30 09:27:52 PM] [INFO] NOOP: Writing config /etc/sysconfig/network-scripts/route-nic3
[2017/03/30 09:27:52 PM] [INFO] NOOP: running ifup on interface: nic3
File: /etc/sysconfig/network-scripts/ifcfg-nic3

# This file is autogenerated by os-net-config
DEVICE=nic3
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
BOOTPROTO=none
DEFROUTE=no

----
File: /etc/sysconfig/network-scripts/route6-nic3


----
File: /etc/sysconfig/network-scripts/route-nic3


----
~~~

Note how the fact of having a an interface in no-carrier will trigger a network update on any overcloud operation, such as noop update, scale out or scale down.

Comment 3 Andreas Karis 2017-03-30 21:35:42 UTC
To make this clear:
- imagine an environment with 100 nodes, running totally fine, one node having an interface in a bond with an issue.

Administrator starts a scale down.

A completely unrelated node may lose its network due to the fact that the nic mapping fails; because prior to the scale down, one interface in a bond had a no-carrier. 

Even if only ifdown/ifup is run, this is not acceptable - network updates should only be made when the administrator asks for them.

In the worst case, an interface flap can cause other issues (imagine a faulty driver or SFP) where the flap could lead to a network outage. Although this may be a very rare event, we have observed exactly this in a production environment.

Comment 4 Dmitry Tantsur 2017-04-03 15:10:06 UTC
Bob, mind taking a look?

Comment 8 Bob Fournier 2017-04-04 18:15:16 UTC
Proposed upstream patch is here - https://review.openstack.org/#/c/453284/

Comment 9 Andreas Karis 2017-04-06 14:30:29 UTC
Hi,

Until this is fixed, is this a safe operation to disable os-net-config? I tested this in a lab and it works - I understand the consequences, but due to the recent incident, we'd like to disable os-net-config for the duration of the scale down:

~~~
/bin/cp /usr/bin/os-net-config /usr/bin/os-net-config.orig
echo -e '#!/bin/bash\nlogger os_net_config_skipped\nexit 0' > /usr/bin/os-net-config
~~~

And afterwards, rollback:
~~~
/bin/cp -f /usr/bin/os-net-config.orig /usr/bin/os-net-config
~~~

Regards,

Andreas

Comment 10 Bob Fournier 2017-04-06 17:07:15 UTC
Andreas,

We have not tested this. but it seems like a reasonable approach and if it works in the lab it should be ok.

Comment 11 Andreas Karis 2017-04-06 17:16:24 UTC
Hi,

Thanks :-)

- Andreas

Comment 12 Bob Fournier 2017-04-25 13:32:07 UTC
Fix has been merged upstream - https://review.openstack.org/#/c/453284/

Comment 16 Bob Fournier 2017-08-17 17:34:08 UTC
*** Bug 1448233 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2017-09-06 17:09:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2654

Comment 20 Bob Fournier 2019-08-06 14:01:09 UTC
Sergii - the bug that is addressed by this fix is when the interfaces are listed in a mapping file [1], is that the scenario you are using? If the interfaces are not in a mapping file you can create a mapping file and add them.  Its not possible for os-net-config to detect that interfaces which are down should be included in a bond otherwise.

If they are in a mapping file and you are still having problems please open a new bug with a full sosreport and related package versions.

[1] https://review.opendev.org/#/c/453284/

Comment 21 Red Hat Bugzilla 2024-01-06 04:25:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.