Bug 1760806 - Disabling cloud-init network configuration post deployment - any risks?
Summary: Disabling cloud-init network configuration post deployment - any risks?
Keywords:
Status: CLOSED DUPLICATE of bug 1773642
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Bob Fournier
QA Contact: Arik Chernetsky
URL:
Whiteboard:
: 1761363 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-11 11:20 UTC by Andreas Karis
Modified: 2024-10-01 16:21 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-29 20:31:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-3596 0 None None None 2022-08-23 10:37:48 UTC

Description Andreas Karis 2019-10-11 11:20:14 UTC
Description of problem: Disabling cloud-init network configuration post deployment - any risks?

A customer ran into the following issue. On one of their controllers, after a reboot, they saw:

[root@overcloud-controller-0 network-scripts]# cat ifcfg-em1
# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=em1
HWADDR=e4:43:4b:5a:e4:10
MTU=1500
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
[root@overcloud-controller-0 network-scripts]#  ls -l ifcfg-em1
-rw-r--r--. 1 root root 167 Sep 17 12:45 ifcfg-em1
[root@overcloud-controller-0 network-scripts]#

While on controller 2 , everything is o.k.:

[root@overcloud-controller-2 network-scripts]# cat ifcfg-em1
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
BOOTPROTO=none
MTU=9000
[root@overcloud-controller-2 network-scripts]# ls -l ifcfg-em1
-rw-r--r--. 1 root root 131 Sep 11 10:49 ifcfg-em1

This does look similar to https://bugzilla.redhat.com/show_bug.cgi?id=1593010

However, this issue happened once only and we could not get to the bottom of this. In order to restore customer confidence, I suggested the following workaround to be applied on all overcloud nodes:

echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

[root@computeovsdpdk-1 ~]# ls /etc/cloud/cloud.cfg.d/
05_logging.cfg  10_etc_hosts.cfg  99-disable-network-config.cfg  README
[root@computeovsdpdk-1 ~]# 

From my tests, I can see that the file persists and disabled cloud-init's network configuration, no matter what happens. Thus, preventing whatever unidentified issue happened on the customer's controller.

I'd like to know if there's any objection against this from engineering's side or if I can provide a support exception for this

Comment 1 Andreas Karis 2019-10-11 11:20:52 UTC
RCA:

We can be relatively sure that something deleted or emptied file /var/lib/cloud/data/instance-id

This caused a rerun of cloud-init which reconfigured ifcfg-em1

A workaround is:

echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

I could not find the reason why /var/lib/cloud/data/instance-id was empty and thus would like to tell the customer that they can push the workaround, just in case.

Comment 2 Andreas Karis 2019-10-11 11:23:29 UTC
And here's a test of my workaround with that scenario (/var/lib/cloud/data/instance-id empty):

[root@computeovsdpdk-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.24.26
NETMASK=255.255.255.0
DNS1=10.11.5.4
DNS2=10.11.5.3
[root@computeovsdpdk-1 ~]# echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
(reverse-i-search)`> /va': ^C/var/lib/cloud/data/instance-id 
[root@computeovsdpdk-1 ~]# > /var/lib/cloud/data/instance-id 
[root@computeovsdpdk-1 ~]# reboot
Connection to 192.168.24.26 closed by remote host.
Connection to 192.168.24.26 closed.

After the system comes up again, we can see that the file was not changed by cloud-init:

[root@computeovsdpdk-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-em1
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.24.26
NETMASK=255.255.255.0
DNS1=10.11.5.4
DNS2=10.11.5.3
[root@computeovsdpdk-1 ~]# cat /var/lib/cloud/data/previous-instance-id
33badfca-5060-4609-91af-65b3d0066f20
[root@computeovsdpdk-1 ~]# egrep 'Read 0 bytes from /var/lib/cloud/data/instance-id|ifcfg-em1' /var/log/cloud-init.log 
2019-10-09 09:09:35,298 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-em1 - wb: [644] 167 bytes
2019-10-09 09:09:35,299 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False)
2019-10-09 09:09:35,299 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False)
2019-10-10 13:25:38,568 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id
2019-10-10 13:25:38,616 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id                           <-------------- I reproduced this issue before
2019-10-10 13:25:38,730 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-em1 - wb: [644] 167 bytes   <-------------- this is when I do not disable network config
2019-10-10 13:25:38,731 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False)
2019-10-10 13:25:38,731 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-em1 (recursive=False)
2019-10-10 14:20:18,047 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id
2019-10-10 14:20:18,117 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/instance-id                           <-------------- this is with the workaround
[root@computeovsdpdk-1 ~]#

Comment 3 Bob Fournier 2019-10-14 18:21:09 UTC
This isn't really a HardProv issue per se, os-net-config writes the network-scripts file originally but it looks like its being overwritten by cloud-init.
 
HardProv DFG (Dan, Harald, and myself) have no objections to the method described by Andreas in the description to disable network config.  It seems reasonable.

Comment 5 Bob Fournier 2019-10-18 13:26:43 UTC
From the case it looks like this is a workaround for the RHEL bugs:
RHEL7: https://bugzilla.redhat.com/show_bug.cgi?id=1750859
RHEL8: https://bugzilla.redhat.com/show_bug.cgi?id=1750862

I'll leave this bug for tracking purposes and in case anyone hits a similar issue but there are no Openstack fixes planned.

Comment 6 Bob Fournier 2019-10-28 13:27:12 UTC
*** Bug 1761363 has been marked as a duplicate of this bug. ***

Comment 7 Harald Jensås 2019-10-28 18:01:09 UTC
I'm not sure why we are suddenly seeing this, was there a change in cloud-init in a recent RHEL update triggering this?

Also, is there any reason we should'nt add a step in tripleo to implement the workaround, i.e disabling network config with cloud-init once we have succesfully configured networking with os-net-config?

I.e add something that runs somewhere around here common/deploy-steps.j2#L654? (if NetworkConfig_result.rc == 0 and AllNodesValidationConfig completed succesfully we disable cloud-init's network-config.)

  echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

Comment 8 Matt Flusche 2019-11-14 16:17:57 UTC
Seeing this on other deployments.  It seems this behavior happens when the config drive is not detected by cloud-init as discussed here:  https://bugzilla.redhat.com/show_bug.cgi?id=1761363#c2

I'm not able to reproduce why the happens.  It might be kernel related but was unable to reproduce with the kernel version being used in this specific environment (3.10.0-957.1.3.el7.x86_64). I can reproduce the behavior for cloud-init by manually removing config drive from the overcloud node.

After removing the config drive (I just removed the partition in a lab node) and rebooting, cloud-init will reset the primary interface to dhcp (default config).  This seem like the wrong behavior and it should still detect this is not the initial run from the contents of /var/lib/cloud/instance/, correct?

So it seems there are multiple issues and overall we should protect the overcloud nodes from this crazy behavior by disabling cloud-init's control of the network config after first boot.

  echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

Comment 14 Bob Fournier 2020-01-27 16:28:32 UTC
Agree with disabling it in director.  Note that we've also seen issues with cloud-init in Backup and Restore - https://bugzilla.redhat.com/show_bug.cgi?id=1791949.

Comment 15 Bob Fournier 2020-07-29 20:31:30 UTC
As we have this tracked with cloud-init (https://bugzilla.redhat.com/show_bug.cgi?id=1773637) and director (https://bugzilla.redhat.com/show_bug.cgi?id=1773642) I am closing this one.

*** This bug has been marked as a duplicate of bug 1773642 ***

Comment 16 Red Hat Bugzilla 2023-09-18 00:17:45 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.