Bug 1593010

Summary: [cloud-init][RHVM]cloud-init network configuration does not persist reboot [RHEL 7.8]
Product: Red Hat Enterprise Linux 7 Reporter: amashah
Component: cloud-initAssignee: Eduardo Otubo <eterrell>
Status: CLOSED ERRATA QA Contact: Yuhui Jiang <yujiang>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: amashah, bpelled, danken, dfu, dholler, dhuertas, dsh, eraviv, eterrell, fgarciad, fsun, greartes, huzhao, ikke, jcoscia, jgreguske, jhunsaker, jonathan.moore, ldu, linl, lrotenbe, lsurette, mkalinin, mkarg, mrezanin, mtessun, pengpengs, Rhev-m-bugs, ribarry, rmccabe, samuel.jon.gunnarsson, sirao, srevivo, svigan, vkuznets, vmware-gos-qa, vshypygu, xiachen, yacao, yanjin, yujiang
Target Milestone: pre-dev-freeze   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cloud-init-18.5-5.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1706482 1750710 (view as bug list) Environment:
Last Closed: 2020-03-31 20:07:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1736852, 1706482, 1710945, 1750710    
Attachments:
Description Flags
cloud-init.log-05/22
none
initial-boot-logs none

Description amashah 2018-06-19 20:38:56 UTC
Description of problem:
When using cloud-init to configure networking, the static IP settings are reset to DHCP on VM reboot, the same applies if performing from run-once.

Version-Release number of selected component (if applicable):

rhvm-4.2.3.8-0.1.el7.noarch                                 Tue May 29 21:31:01 2018

cloud-init-0.7.9-24.el7.x86_64                              Tue May 29 21:02:17 2018

How reproducible:


Steps to Reproduce:
1. Configure static networking from Initial Run VM screen
2. Check networking, works OK
3. Power down VM and power back up, networking is now set as DHCP

Actual results:
Static IP settings are lost

Expected results:
Static IP settings persist rebooting of VM.

Additional info:

Comment 2 Javier Coscia 2018-06-19 22:49:14 UTC
Amar, could you please get a sosreport from the VM with the issue if it's a RHEL VM or the cloud-init logs and version inside the VM if it's not a RHEL VM ?

Not sure, but maybe this BZ should be against RHEL product - cloud-init component instead of RHV ?

Comment 3 Dan Kenigsberg 2018-06-24 07:50:52 UTC
(In reply to Javier Coscia from comment #2)
> Amar, could you please get a sosreport from the VM with the issue if it's a
> RHEL VM or the cloud-init logs and version inside the VM if it's not a RHEL
> VM ?
> 
> Not sure, but maybe this BZ should be against RHEL product - cloud-init
> component instead of RHV ?

I believe so. Calling it Ryan.
But customer in-guest logs are paramount for further debugging.

Comment 4 Ryan McCabe 2018-06-27 04:14:34 UTC
It could be a configuration issue. The network config module will run if the instance id changes, and not being able to retrieve any instance id will also do it. Without knowing the configuration and having the logs, it's hard to say what's going on.

FYI, I have moved on from cloud-init. It's now owned by the Amnon's team in virt.

Comment 9 Ryan McCabe 2018-06-29 16:20:43 UTC
Just from a quick look, the instance id is changing. Is something like a config drive being removed after the first run, or similar?

Comment 12 Dan Kenigsberg 2018-07-04 07:52:16 UTC
(In reply to Ryan McCabe from comment #9)
> Just from a quick look, the instance id is changing. Is something like a
> config drive being removed after the first run, or similar?

Yes, RHV attaches a virtual CD on initial boot. It no longer exists in further boots. Why?

Comment 23 Javier Coscia 2018-07-11 19:55:45 UTC
(In reply to Eduardo Otubo from comment #22)

> 
> Ok, I didn't know there was a switch on/off on RHV GUI for cloud-init. Does
> this switch always worked, meaning, is this behavior a regression? Also, on
> the cloud-init switch, there a hint saying "Set up early initialization of
> Linux virtual machine using cloud-init.", this leads me to understand that
> cloud-init will be used for initial setup, I wouldn't expect it to be on
> after the reboot.
> 

My understanding is that this switch wasn't working as expected, well, actually it works, because when turned on, it sends the information to the VM, but when it's off, and the cloud-init service is enabled on boot on the VM, it tries to fetch cloud-init information which wasn't requested through RHV, looks like it's something the cloud-init service does when enabled on boot.

> > 
> > I think that with the cloud-init service up and running on boot in the VM,
> > and the fact that we are sending an empty cloud-init config/parameters,
> > should be enough to not do changes on the VM.
> 
> Not sure I followed you here. Can you elaborate? 
> 

If you turn on the cloud-init switch on RHV and fill in the information needed, like DNS, IP information, etc, then you're asking RHV to send this information to the VM and you will be asking cloud-init service to actually do something with this data. On the other hand, if you leave the cloud-init switch in off, you should be telling the cloud-init service on the VM, "don't do anything / I don't want cloud-init to fetch anything" 

> > 
> > Not sure what instance-id is here. Could you please help me understanding
> > this so I can see if it can be passed through kernel line before boot ?
> 
> Whenever you create a new instance (a new VM) based on an image, or even
> installing a new VM with the distro image, cloud-init will generate a new
> instance-id for that specific vm. When you remove the Data Source and reboot
> (in this case the distro image on the cirtual CD-ROM), cloud-init thinks
> it's a new instance, regenerates a new instance-id and run again (that's
> when the network config is overwritten).

This is the weird part, I'm not creating/installing a new VM, nor a VM based on an image. I assume that cloud-init will not generate a new instance-id in this case.

I created the VM, installed it manually, no automation or anything different than a normal install, I then installed cloud-init package inside the VM, shut it down and then use the cloud-init switch on with the static IP information for example. 

> 
> What I mean is: cloud-init here is working as designed.
> 
> You could pass this instance-id via configuration or even on the kernel line
> so it wouldn't reset and regenerate the instance-id on the first reboot.


According to the above information, my understanding is that cloud-init is trying to fetch new data because it finds the instance-id changed and since it cannot find the data to use (because we didn't specify any), it overwrites network config. Is this correct ?

Comment 24 Marko Karg 2018-07-26 13:39:48 UTC
I can second what Javier said in comment #23 and I can easily reproduce. If I create a cloned VM of a template, on the first run with cloud-init enabled, the VM gets the correct IP / DNS etc. as specified for cloud init. 
On a subsequent power off / run cycle, cloud-init again kicks in during the reboot, but fails to find the necessary information and decides to fail back to a standard ifcfg-eth0 file e.g. that tries to get information via dhcp.

Please let me know if you need anything from the RHHI setup I use here and I will gladly provide it.

Comment 25 Marko Karg 2018-07-26 13:43:26 UTC
oh and btw - how do I actually disable cloud-init? I tried systemctl disable / mask, touching /etc/cloud/cloud-init.disabled and none of them worked.

Comment 28 Samuel Jon Gunnarsson 2018-08-23 06:29:58 UTC
What is the status of this bug. I have encountered this bug as well and was wondering if there are any workarounds ?

Comment 29 Eduardo Otubo 2018-08-23 15:19:19 UTC
No updates so far, but it's on the third place on my priority list. I'm expecting to take a look at this next week. Sorry for the delay.

Comment 30 Samuel Jon Gunnarsson 2018-08-27 18:33:13 UTC
Ok thanks for the update. 

As a workaround I did was masking out the following services:

cloud-config.service                          
cloud-final.service                           
cloud-init-local.service                      
cloud-init.service                            

Is that an overkill?

I found out that it was not sufficient to mask cloud-init service alone. 

This also happens with imported vm's from other kvm platform that has cloud-init packages enabled on boot. I have just tested import from apache cloudstack based platform where I had to mask the cloud init services.

Regards, 
Sammi

Comment 32 Jonathan Moore 2018-10-30 19:53:51 UTC
Having a similar issue with IBM PowerVC 1.4.1.0 and
cloud-init-0.7.9-24.el7_5.1.ppc64le

Configuration is also reset to DHCP after reboots.  Adding 

network:
  config: disabled

to /etc/cloud/cloud.cfg workaround has been a solution for us.

Comment 33 Eduardo Otubo 2019-03-14 14:52:45 UTC
As far as I understand how cloud-init works, it uses the Config Drive, provided by the CDROM image) to rely on an instance-id which is set to the vm. If the Config Drive is removed (which in this case it is) the instance-id is gone, cloud-init thinks it's a new instance-id and runs again, removing all network configuration set before reboot.

This behaviour works as designed, there's no bug in cloud-init as I see it.

From here we have two solutions:
1) Do not remove Config Drive. The CDROM will remain attached to the vm, cloud-init will read the instance-id every time it needs to be rebooted and no network configuraion will be lost.
2) Pass the instance-id information in the kernel command line in the form: cc:instance-id:<name> (I'm not sure about the kernel variable name)

Comment 37 Dan Kenigsberg 2019-03-22 14:30:02 UTC
IMHO the cleaner workaround is to mask cloud-init services. It is easier to unmask if one wants new config.

Since which version does cloud-init reset everything on a missing config source? I find it very surprising that we have not heard about this years ago.

I would like to see a cloud-init feature where I can ask "keep on next boot, unless you find a fresh data source". Would that be possible, Eduardo?

Comment 38 eraviv 2019-03-24 09:19:21 UTC
From cloud-init's side:
=======================
According to cloud-init 0.7.9 documentation cloud-init is configured to run by default on each boot [1] and to render the user-selected network configuration on first boot [2]. Also, in absence of a data source to configure the network, it will fall back to configuring DHCP on eth0 [2].

If a VM is run once, and then in the next regular run the cloud-init flag is not selected in the VM configuration in engine, there is no data-source and cloud-init falls back to dhcp as documented.

It is possible to modify cloud-init's behaviour with a  'marker' file, documented as follows:

* disabling cloud-init altogether [1] with: touch /etc/cloud/cloud-init.disabled
* preventing cloud-init from configuring the network [2] with: 
  echo ‘network: {config: disabled}‘ >> /etc/cloud/cloud.cfg

this can be accomplished by adding the above commands to the custom_script that cloud-init runs at the last stage of its operation [3].

There is possibly a third 'hack' that would not require any marker file: assign your static IP to a NIC not named 'eth0'.I have not tested it myself but it looks like a corollary of [2]


[1] https://cloudinit.readthedocs.io/en/0.7.9/topics/boot.html#generator
[2] https://cloudinit.readthedocs.io/en/0.7.9/topics/boot.html#local
[3] https://cloudinit.readthedocs.io/en/0.7.9/topics/boot.html#final

From engine's side:
=======================
When a VM is started in 'Run once' mode, the initialization parameters supplied for that run are always passed by engine to cloud-init in the guest for application.


But if a VM is started in 'Run' mode, the initialization parameters are passed to cloud-init on the guest only if this is the first run (be it 'Run' or 'Run once'). On every consecutive run in 'Run' mode no parameters are passed to the guest, and therefore (as I quoted from the cloud-init documentation earlier in this thread) cloud-init falls back to DHCP configuration on the guest.

This is not an overlooked occurrence on engine's behalf but rather the designated behaviour.
When this behaviour was introduced into engine the reasoning was that after the initial configuration of the VM, there is no reason to resend the configuration on every 'Run' but only on 'Run once'. That's because 'Run once' may be used for out-of the ordinary instantiations of the VM.

Due to the behaviour of the current cloud-init package, this causes an unexpected side effect that can be dealt with by disabling cloud-init in one of the methods I described earlier in this thread.

Comment 39 Javier Coscia 2019-03-25 16:20:18 UTC
fwiw, adding the following as `custom script` in `Use Cloud-Init` RHV UI works, the VM will get the correct static IP information and will build the ifcfg-ethX correctly, then will disable network configuration from cloud-init

~~~
#cloud-config

runcmd:
 - [ sh, -c, 'echo "network: {config: disabled}" >> /etc/cloud/cloud.cfg' ]
~~~

Comment 40 Eduardo Otubo 2019-03-27 10:32:33 UTC
(In reply to Dan Kenigsberg from comment #37)
> IMHO the cleaner workaround is to mask cloud-init services. It is easier to
> unmask if one wants new config.
> 
> Since which version does cloud-init reset everything on a missing config
> source? I find it very surprising that we have not heard about this years
> ago.
> 
> I would like to see a cloud-init feature where I can ask "keep on next boot,
> unless you find a fresh data source". Would that be possible, Eduardo?

Yes, this is technically possible. But the question here: Do we want it? There's cleaner ways to solve this issue as pointed on other comments. And besides, this change would have to be sent upstream, and I think this feature would hardly make it to public release.

Comment 41 Dan Kenigsberg 2019-03-27 11:04:38 UTC
What is the cleaner way to express "keep on next boot, unless you find a fresh data source"?
I'm aware only of "keep on next boot, unless someone miraculously removes a line from /etc/cloud/cloud.cfg".
(and of course, we are Red Hat, we are Upstream First).

Comment 42 Eduardo Otubo 2019-03-27 13:01:27 UTC
(In reply to Dan Kenigsberg from comment #41)
> What is the cleaner way to express "keep on next boot, unless you find a
> fresh data source"?

The cleaner way is to configure the guest to have a variable with the instance-id on the kernel line.

> I'm aware only of "keep on next boot, unless someone miraculously removes a
> line from /etc/cloud/cloud.cfg".

Actually what we have today is: Keep on next boot, unless the data source is removed

Comment 43 Dan Kenigsberg 2019-03-27 13:21:31 UTC
Let us assume for a minute that we cannot change the age-old ovirt semantics of passing the data source to the VM only when we want to change something in cloud-init.

What should I do in the guest so that:
- if no data source is found, nothing changes
- if data source is found, it is acted upon

from what I gather from you explanation, once we
  grubby --update-kernel=`grubby --default-kernel` --args=cc:instance-id
we would not get cloud-init acting on new data sources. Please correct me if I am wrong.

Comment 44 Eduardo Otubo 2019-03-27 14:12:02 UTC
I've spoken to Ryan Harper (maintainer) on IRC and he advised to use the option `manual_cache_clean' on the configuration file:

# manual cache clean.
#  By default, the link from /var/lib/cloud/instance to
#  the specific instance in /var/lib/cloud/instances/ is removed on every
#  boot.  The cloud-init code then searches for a DataSource on every boot
#  if your DataSource will not be present on every boot, then you can set
#  this option to 'True', and maintain (remove) that link before the image
#  will be booted as a new instance.
# default is False
manual_cache_clean: False

(source: https://git.launchpad.net/cloud-init/tree/doc/examples/cloud-config.txt#n466)

Perhaps this should be the cleanest way to avoid changing anything outside cloud-init and fixing the issue at the sametime.

Please give it a try and let me know if I can help with anything else.

Comment 45 Dan Kenigsberg 2019-03-27 14:52:36 UTC
I hope Liran can try it out, but would you please confirm what happens if
manual_cache_clean is set to True, and a new DataSource is added to the VM?
Would cloud-init act upon it?

Comment 46 Eduardo Otubo 2019-03-28 14:36:51 UTC
(In reply to Dan Kenigsberg from comment #45)
> I hope Liran can try it out, but would you please confirm what happens if
> manual_cache_clean is set to True, and a new DataSource is added to the VM?
> Would cloud-init act upon it?

In this case cloud-init won't create a new instance-id whenever a new data sources in added. In this case you might want to use `cloud-init clean' before a boot on a different data source that you want to use.

Comment 47 Dan Kenigsberg 2019-03-28 14:54:27 UTC
(In reply to Eduardo Otubo from comment #46)
> (In reply to Dan Kenigsberg from comment #45)
> > I hope Liran can try it out, but would you please confirm what happens if
> > manual_cache_clean is set to True, and a new DataSource is added to the VM?
> > Would cloud-init act upon it?
> 
> In this case cloud-init won't create a new instance-id whenever a new data
> sources in added. In this case you might want to use `cloud-init clean'
> before a boot on a different data source that you want to use.

Thanks. This means that this configurable does not give me what I am looking for.

My request is for a cloud-init mode where
- if a data source is found, it is acted upon
- if no data source is found, nothing changes

Liran tells me that this is actually the behavior for any nic other than eth0. I'd like to extend it to eth0, too.

Comment 48 Eduardo Otubo 2019-03-29 08:32:28 UTC
(In reply to Dan Kenigsberg from comment #47)
> (In reply to Eduardo Otubo from comment #46)
> > (In reply to Dan Kenigsberg from comment #45)
> > > I hope Liran can try it out, but would you please confirm what happens if
> > > manual_cache_clean is set to True, and a new DataSource is added to the VM?
> > > Would cloud-init act upon it?
> > 
> > In this case cloud-init won't create a new instance-id whenever a new data
> > sources in added. In this case you might want to use `cloud-init clean'
> > before a boot on a different data source that you want to use.
> 
> Thanks. This means that this configurable does not give me what I am looking
> for.
> 
> My request is for a cloud-init mode where
> - if a data source is found, it is acted upon
> - if no data source is found, nothing changes
> 
> Liran tells me that this is actually the behavior for any nic other than
> eth0. I'd like to extend it to eth0, too.

I understand your point now. How's the priority and the scope of this BZ exactly? I cantry to push a behaviour like that to the configuration file on cloud-init upstream, but this might take a little while for it to be merged. OR, we could take this configuration outside cloud-init, you just have to use `manual_cache_clean' when you want things to remain the same and `cloud-init clean' when you want a new data source to be acted upon.

Comment 53 Eduardo Otubo 2019-04-05 08:52:33 UTC
According to this thread[0], Ryan pointed that the most recent version of cloud-init should fix the issue:

"Current cloud-init uses a systemd-generator detect if it's booting on a system                                                                                                           
or platform which is providing a datasource.  In the scenario from the bug, it                                                                                                           
initially provides a cdrom with the NoCloud datasource and then the cdrom is                                                                                                             
removed.   After rebooting the VM, the cdrom is no longer present, cloud-init                                                                                                            
generator will not find the cdrom with NoCloud seed and it will disable itself                                                                                                           
(no need to mask services).  If the cdrom is provided again with a new                                                                                                                   
instance-id cloud-init will run as if it's a new instance."

I prepared an unofficial 18.5 rebased package[1] for another BZ, please give it a try and check if this issue is gone.

[0] https://lists.launchpad.net/cloud-init/msg00197.html
[1] http://people.redhat.com/~eterrell/cloud-init/18.5/rhel7/bz1632967/

Comment 54 ldu 2019-04-09 05:58:09 UTC
Hi Eduardo,
I retest with your test build cloud-init-18.5-0.el7.otubo201903151232.x86_64 on ESXi platform.
After reboot the static IP still replace with DHCP IP. seems it doesn't work.

BR,
Lili Du

Comment 63 Dan Kenigsberg 2019-05-21 18:57:59 UTC
Liran, the el8 string tells me that you are testing an el8 build; it is unlikely to be installable in any el7. It would be lovely if you can try it on your favorite el8 guest; we can discuss backporting later.

Comment 64 ldu 2019-05-22 03:02:38 UTC
@Liran, thanks very much for point the correct link!

@Eduardo, I retest with cloud-init-18.2-1.el8.otubo201905200930.noarch.rpm in link http://people.redhat.com/~eterrell/cloud-init/1593010/.
After customize Guest uses static IP with cloud-init, then boot up the cloned guest, the IP not change to static, it's still DHCP, the cloud init customize ip failed.
I attach the cloud-init log, please help review.

Comment 65 ldu 2019-05-22 03:06:53 UTC
Created attachment 1571778 [details]
cloud-init.log-05/22

Comment 66 Liran Rotenberg 2019-05-27 06:28:52 UTC
Created attachment 1573764 [details]
initial-boot-logs

I'm having the same problem. The network configuration of cloud-init didn't pass on the initial boot(first boot before rebooting).
I checked a few types of configurations, including IPv6/IPv4 Static and DHCP. Other cloud-init configurations are set correctly.

Comment 67 Liran Rotenberg 2019-05-28 13:27:29 UTC
Just to make it clear, with the given RPM (cloud-init-18.2-1.el8.otubo201905200930.noarch.rpm) the basic flow of cloud-init networks is broken - regression, and makes it impossible to test the current bug.

Eduardo, can you please check it and provide another RPM to test?

Comment 69 Chris Williams 2019-06-21 20:13:42 UTC
*** Bug 1680465 has been marked as a duplicate of this bug. ***

Comment 71 Marina Kalinin 2019-06-24 19:28:44 UTC
*** Bug 1534694 has been marked as a duplicate of this bug. ***

Comment 73 Marina Kalinin 2019-06-24 19:32:02 UTC
Comment#53 for summary.

Comment 85 Eduardo Otubo 2019-08-16 11:54:44 UTC
After lots of debugging and discussions with QA and maintainers on IRC we found out the root cause being a bug on vmware datasource itself. Vmware resets the datasource on every boot so it can add more customizations every time, but this behavior is just wrong from cloud-init perspective. There's already an open bug on Launchpad (https://bugs.launchpad.net/cloud-init/+bug/1835205, also linked above) and people from vmware are already working on it. Once it's complete I'll backport. Or depending on the time frame, issue a new rebase.

Setting back to NEW since there's no work to be done at the moment.

Comment 98 Miroslav Rezanina 2019-09-10 09:00:51 UTC
Fix included in cloud-init-18.5-5.el7

Patch 90300 (Fix for network configuration not persisting after reboot) committed with hash 9969cf3eaa23398816d140b319b3277465aa4bb8

Comment 106 errata-xmlrpc 2020-03-31 20:07:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1155

Comment 107 xiachen 2020-08-31 02:19:34 UTC
From bug 1862100, we found that this bug introduced one change, that is, 'ds-identify' was added, it affects custom DataSource configuration.
As the bug 1593010 was mentioned in cloud-init Changelog, the key information should be updated here.

*Important*: The 'ds-identify' is required for custom DataSource to work properly in cloud-init 18.5-5+ .
When creating a custom datasource, follow this document to configure your datasource. Using ds-identify ensures that cloud-init recognizes your datasource upon upgrade and reboot. 
https://cloudinit.readthedocs.io/en/latest/topics/datasources.html#creation