Description of problem: Disabling cloud-init network configuration post deployment - any risks? A customer ran into the following issue. On one of their controllers, after a reboot, they saw: [root@overcloud-controller-0 network-scripts]# cat ifcfg-em1 # Created by cloud-init on instance boot automatically, do not edit. # BOOTPROTO=dhcp DEVICE=em1 HWADDR=e4:43:4b:5a:e4:10 MTU=1500 ONBOOT=yes TYPE=Ethernet USERCTL=no [root@overcloud-controller-0 network-scripts]# ls -l ifcfg-em1 -rw-r--r--. 1 root root 167 Sep 17 12:45 ifcfg-em1 [root@overcloud-controller-0 network-scripts]# While on controller 2 , everything is o.k.: [root@overcloud-controller-2 network-scripts]# cat ifcfg-em1 # This file is autogenerated by os-net-config DEVICE=em1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no BOOTPROTO=none MTU=9000 [root@overcloud-controller-2 network-scripts]# ls -l ifcfg-em1 -rw-r--r--. 1 root root 131 Sep 11 10:49 ifcfg-em1 This does look similar to https://bugzilla.redhat.com/show_bug.cgi?id=1593010 However, this issue happened once only and we could not get to the bottom of this. In order to restore customer confidence, I suggested the following workaround to be applied on all overcloud nodes: echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg [root@computeovsdpdk-1 ~]# ls /etc/cloud/cloud.cfg.d/ 05_logging.cfg 10_etc_hosts.cfg 99-disable-network-config.cfg README [root@computeovsdpdk-1 ~]# From my tests, I can see that the file persists and disabled cloud-init's network configuration, no matter what happens. Thus, preventing whatever unidentified issue happened on the customer's controller. I'd like to know if there's any objection against this from engineering's side or if I can provide a support exception for this
RCA: We can be relatively sure that something deleted or emptied file /var/lib/cloud/data/instance-id This caused a rerun of cloud-init which reconfigured ifcfg-em1 A workaround is: echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg I could not find the reason why /var/lib/cloud/data/instance-id was empty and thus would like to tell the customer that they can push the workaround, just in case.
And here's a test of my workaround with that scenario (/var/lib/cloud/data/instance-id empty): [root@computeovsdpdk-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1 # This file is autogenerated by os-net-config DEVICE=em1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no BOOTPROTO=static IPADDR=192.168.24.26 NETMASK=255.255.255.0 DNS1=10.11.5.4 DNS2=10.11.5.3 [root@computeovsdpdk-1 ~]# echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg (reverse-i-search)`> /va': ^C/var/lib/cloud/data/instance-id [root@computeovsdpdk-1 ~]# > /var/lib/cloud/data/instance-id [root@computeovsdpdk-1 ~]# reboot Connection to 192.168.24.26 closed by remote host. Connection to 192.168.24.26 closed. After the system comes up again, we can see that the file was not changed by cloud-init: [root@computeovsdpdk-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1 # This file is autogenerated by os-net-config DEVICE=em1 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no BOOTPROTO=static IPADDR=192.168.24.26 NETMASK=255.255.255.0 DNS1=10.11.5.4 DNS2=10.11.5.3 [root@computeovsdpdk-1 ~]# cat /var/lib/cloud/data/previous-instance-id 33badfca-5060-4609-91af-65b3d0066f20 [root@computeovsdpdk-1 ~]# egrep 'Read 0 bytes from /var/lib/cloud/data/instance-id|ifcfg-em1' /var/log/cloud-init.log 2019-10-09 09:09:35,298 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-em1 - wb: [644] 167 bytes 2019-10-09 09:09:35,299 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False) 2019-10-09 09:09:35,299 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False) 2019-10-10 13:25:38,568 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id 2019-10-10 13:25:38,616 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id <-------------- I reproduced this issue before 2019-10-10 13:25:38,730 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-em1 - wb: [644] 167 bytes <-------------- this is when I do not disable network config 2019-10-10 13:25:38,731 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False) 2019-10-10 13:25:38,731 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False) 2019-10-10 14:20:18,047 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id 2019-10-10 14:20:18,117 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id <-------------- this is with the workaround [root@computeovsdpdk-1 ~]#
This isn't really a HardProv issue per se, os-net-config writes the network-scripts file originally but it looks like its being overwritten by cloud-init. HardProv DFG (Dan, Harald, and myself) have no objections to the method described by Andreas in the description to disable network config. It seems reasonable.
From the case it looks like this is a workaround for the RHEL bugs: RHEL7: https://bugzilla.redhat.com/show_bug.cgi?id=1750859 RHEL8: https://bugzilla.redhat.com/show_bug.cgi?id=1750862 I'll leave this bug for tracking purposes and in case anyone hits a similar issue but there are no Openstack fixes planned.
*** Bug 1761363 has been marked as a duplicate of this bug. ***
I'm not sure why we are suddenly seeing this, was there a change in cloud-init in a recent RHEL update triggering this? Also, is there any reason we should'nt add a step in tripleo to implement the workaround, i.e disabling network config with cloud-init once we have succesfully configured networking with os-net-config? I.e add something that runs somewhere around here common/deploy-steps.j2#L654? (if NetworkConfig_result.rc == 0 and AllNodesValidationConfig completed succesfully we disable cloud-init's network-config.) echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
Seeing this on other deployments. It seems this behavior happens when the config drive is not detected by cloud-init as discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1761363#c2 I'm not able to reproduce why the happens. It might be kernel related but was unable to reproduce with the kernel version being used in this specific environment (3.10.0-957.1.3.el7.x86_64). I can reproduce the behavior for cloud-init by manually removing config drive from the overcloud node. After removing the config drive (I just removed the partition in a lab node) and rebooting, cloud-init will reset the primary interface to dhcp (default config). This seem like the wrong behavior and it should still detect this is not the initial run from the contents of /var/lib/cloud/instance/, correct? So it seems there are multiple issues and overall we should protect the overcloud nodes from this crazy behavior by disabling cloud-init's control of the network config after first boot. echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
Agree with disabling it in director. Note that we've also seen issues with cloud-init in Backup and Restore - https://bugzilla.redhat.com/show_bug.cgi?id=1791949.
As we have this tracked with cloud-init (https://bugzilla.redhat.com/show_bug.cgi?id=1773637) and director (https://bugzilla.redhat.com/show_bug.cgi?id=1773642) I am closing this one. *** This bug has been marked as a duplicate of bug 1773642 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days