Bug 1761363 - cloud-init is overwriting network configuration with safe default each boot
Summary: cloud-init is overwriting network configuration with safe default each boot
Keywords:
Status: CLOSED DUPLICATE of bug 1760806
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-cloud-config
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jay Dobies
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-14 09:14 UTC by Eduard Barrera
Modified: 2019-10-28 13:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-28 13:27:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eduard Barrera 2019-10-14 09:14:30 UTC
Description of problem:

This problem caused a network outage on 3 compute nodes when performing an undercloud update (it triggers an os-update-config on the overcloud)

# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=em1
HWADDR=XX:XX:XX:XX:XX:c0
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

em1 should be part of a ovs-bond.

Whe os-update-config was triggered the following command failed and all the interfaces configuration got updated with safe default, causing a connectivity outage:

Oct  9 17:18:33 XXXX os-collect-config: [2019/10/09 05:18:33 PM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'em1')
...
...
Oct  9 17:18:33 XXXX os-collect-config: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Oct  9 17:18:33 XXXX os-collect-config: Command: /bin/ovs-appctl bond/set-active-slave bond1 em1
Oct  9 17:18:33 XXXX os-collect-config: Exit code: 2

Here we can see cloud-init logs:

2019-10-10 15:35:09,019 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])]
2019-10-10 15:35:09,060 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])]
2019-10-10 15:35:09,064 - handlers.py[DEBUG]: finish: init-network/consume-user-data: SUCCESS: reading and applying user-data
2019-10-10 15:35:09,064 - handlers.py[DEBUG]: start: init-network/consume-vendor-data: reading and applying vendor-data
2019-10-10 15:35:09,064 - handlers.py[DEBUG]: finish: init-network/consume-vendor-data: SUCCESS: reading and applying vendor-data


2019-10-11 08:45:55,088 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'em1', 'mac_address': '14:18:77:60:39:c0'}]}
<=========

2019-10-11 08:46:56,109 - stages.py[DEBUG]: applying net config names for {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'br-ex', 'mac_address': '14:18:77:60:39:c0'}]}
2019-10-11 08:46:56,112 - stages.py[WARNING]: Failed to rename devices: duplicate mac found! both 'em1' and 'br-ex' have mac '14:XX:XX:60:XX:c0'
<====

2019-10-11 08:46:56,146 - handlers.py[DEBUG]: start: init-network/consume-user-data: reading and applying user-data
2019-10-11 08:46:56,149 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])]
2019-10-11 08:46:56,191 - cloud_config.py[DEBUG]: Merging by applying [('dict', ['replace']), ('list', []), ('str', [])]
2019-10-11 08:46:56,195 - handlers.py[DEBUG]: finish: init-network/consume-user-data: SUCCESS: reading and applying user-data
2019-10-11 08:46:56,195 - handlers.py[DEBUG]: start: init-network/consume-vendor-data: reading and applying vendor-data
2019-10-11 08:46:56,195 - handlers.py[DEBUG]: finish: init-network/consume-vendor-data: SUCCESS: reading and applying vendor-data


I can't see why cloud-init is having this behaviour

Version-Release number of selected component (if applicable):
OSP10

How reproducible:
each boot

Steps to Reproduce:
1. reboot
2.
3.

Actual results:
cloud-init apply the safe defaults and is not in sync with os-net-config

Expected results:
cloud-init must not overwrite net interfaces configuration

Additional info:

Comment 2 Jorge Martinez Garcia 2019-10-14 09:56:27 UTC
Root cause for us was that config-drive used by cloud-init (/dev/disk/by-label/config-2) was invisible to the OS. The physical partition /dev/sda1 cannot be access for unknown reason in those 3 nodes. The partition was not listed on /proc/partition at boot, but after running partprobe /dev/sda it appears and can be access again. So, cloud-init was missing stored configuration. 

We haven't been able to determine why, we just solved upgrading the kernel / rebuild initrd and grub (just in case). After that, cloud-init is not overwritting /etc/sysconfig/network-scripts/ifcfg-em1 any more so our network configuration is now consistent after reboots.

old kernel: kernel-3.10.0-862.14.4.el7.x86_64
new kernel: kernel-3.10.0-1062.1.2.el7.x86_64

Comment 3 Bob Fournier 2019-10-18 13:20:12 UTC
This may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1760806, there is a suggested config there to prevent cloud-init from overwriting the network config.

Comment 4 Bob Fournier 2019-10-28 13:27:12 UTC
Marking as duplicate so we can track the workaround in one place.

*** This bug has been marked as a duplicate of bug 1760806 ***


Note You need to log in before you can comment on or make changes to this bug.