Description of problem: This problem caused a network outage on 3 compute nodes when performing an undercloud update (it triggers an os-update-config on the overcloud) # Created by cloud-init on instance boot automatically, do not edit. # BOOTPROTO=dhcp DEVICE=em1 HWADDR=XX:XX:XX:XX:XX:c0 ONBOOT=yes TYPE=Ethernet USERCTL=no em1 should be part of a ovs-bond. Whe os-update-config was triggered the following command failed and all the interfaces configuration got updated with safe default, causing a connectivity outage: Oct 9 17:18:33 XXXX os-collect-config: [2019/10/09 05:18:33 PM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'em1') ... ... Oct 9 17:18:33 XXXX os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Oct 9 17:18:33 XXXX os-collect-config: Command: /bin/ovs-appctl bond/set-active-slave bond1 em1 Oct 9 17:18:33 XXXX os-collect-config: Exit code: 2 Here we can see cloud-init logs: 2019-10-10 15:35:09,019 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])] 2019-10-10 15:35:09,060 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])] 2019-10-10 15:35:09,064 - handlers.py[DEBUG]: finish: init-network/consume-user-data: SUCCESS: reading and applying user-data 2019-10-10 15:35:09,064 - handlers.py[DEBUG]: start: init-network/consume-vendor-data: reading and applying vendor-data 2019-10-10 15:35:09,064 - handlers.py[DEBUG]: finish: init-network/consume-vendor-data: SUCCESS: reading and applying vendor-data 2019-10-11 08:45:55,088 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'em1', 'mac_address': '14:18:77:60:39:c0'}]} <========= 2019-10-11 08:46:56,109 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'br-ex', 'mac_address': '14:18:77:60:39:c0'}]} 2019-10-11 08:46:56,112 - stages.py[WARNING]: Failed to rename devices: duplicate mac found! both 'em1' and 'br-ex' have mac '14:XX:XX:60:XX:c0' <==== 2019-10-11 08:46:56,146 - handlers.py[DEBUG]: start: init-network/consume-user-data: reading and applying user-data 2019-10-11 08:46:56,149 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])] 2019-10-11 08:46:56,191 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])] 2019-10-11 08:46:56,195 - handlers.py[DEBUG]: finish: init-network/consume-user-data: SUCCESS: reading and applying user-data 2019-10-11 08:46:56,195 - handlers.py[DEBUG]: start: init-network/consume-vendor-data: reading and applying vendor-data 2019-10-11 08:46:56,195 - handlers.py[DEBUG]: finish: init-network/consume-vendor-data: SUCCESS: reading and applying vendor-data I can't see why cloud-init is having this behaviour Version-Release number of selected component (if applicable): OSP10 How reproducible: each boot Steps to Reproduce: 1. reboot 2. 3. Actual results: cloud-init apply the safe defaults and is not in sync with os-net-config Expected results: cloud-init must not overwrite net interfaces configuration Additional info:
Root cause for us was that config-drive used by cloud-init (/dev/disk/by-label/config-2) was invisible to the OS. The physical partition /dev/sda1 cannot be access for unknown reason in those 3 nodes. The partition was not listed on /proc/partition at boot, but after running partprobe /dev/sda it appears and can be access again. So, cloud-init was missing stored configuration. We haven't been able to determine why, we just solved upgrading the kernel / rebuild initrd and grub (just in case). After that, cloud-init is not overwritting /etc/sysconfig/network-scripts/ifcfg-em1 any more so our network configuration is now consistent after reboots. old kernel: kernel-3.10.0-862.14.4.el7.x86_64 new kernel: kernel-3.10.0-1062.1.2.el7.x86_64
This may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1760806, there is a suggested config there to prevent cloud-init from overwriting the network config.
Marking as duplicate so we can track the workaround in one place. *** This bug has been marked as a duplicate of bug 1760806 ***