Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1440831

Summary: boot into 7.3.4 from 7.3.3 ISO install causes cloud-init to run
Product: Red Hat Enterprise Linux 7 Reporter: Micah Abbott <miabbott>
Component: cloud-initAssignee: Lars Kellogg-Stedman <lars>
Status: CLOSED ERRATA QA Contact: Vratislav Hutsky <vhutsky>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: dustymabe, huzhao, jlebon, lars, lfriedma, lmiksik, mbracho, miabbott, mmagr, salmy, sgirijan, walters, yacao
Target Milestone: rcKeywords: Triaged
Target Release: ---Flags: mbracho: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cloud-init-0.7.9-5.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1455335 (view as bug list) Environment:
Last Closed: 2017-08-01 23:23:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1455335    
Attachments:
Description Flags
journal from boot none

Description Micah Abbott 2017-04-10 14:36:37 UTC
Starting with a 7.3.3 RHELAH system that was installed via ISO (no cloud-init data provided), I upgraded to the 7.3.4 RHELAH release.

Upon booting the system, it was observed on the console that the cloud-init service had started and was trying to retrieve the necessary metadata for the service to complete.

The cloud-init service took approximately 2 minutes before giving up, during which time the system could not be accessed via SSH or console.

After the system had completed booting, SSH to the host with a non-root user that had previously been able to access the host failed.


Hat tip to Martin Jenner for originally pointing out this issue.

Comment 2 Micah Abbott 2017-04-10 14:45:35 UTC
Created attachment 1270509 [details]
journal from boot

Comment 3 Micah Abbott 2017-04-10 14:46:17 UTC
(In reply to Micah Abbott from comment #0)

> The cloud-init service took approximately 2 minutes before giving up, during
> which time the system could not be accessed via SSH or console.

Scratch that...I actually measured it at about 4.5 minutes

Comment 4 Jonathan Lebon 2017-04-10 15:04:36 UTC
So, I *think* the issue here is that cloud-init now ships a generator:

http://cloudinit.readthedocs.io/en/latest/topics/boot.html#generator

We do disable cloud-init in the interactive-defaults kickstart, but that seems to get overridden by the generator. Adding 'cloud-init=disabled' to the kernel cmdline does work around it fortunately.

As Lars mentioned, the easiest thing to do here might be to back out the generator for now, until we can figure out how to do this properly *without* affecting already installed clients.

Comment 5 Jonathan Lebon 2017-04-10 15:10:20 UTC
Smoking gun:

Apr 10 11:05:57 localhost.localdomain systemd[572]: Spawned /usr/lib/systemd/system-generators/cloud-init-generator as 573.
...
Apr 10 11:05:57 localhost.localdomain systemd[572]: /usr/lib/systemd/system-generators/cloud-init-generator succeeded.
...
Apr 10 11:05:57 localhost.localdomain systemd[1]: Installed new job cloud-init.target/start as 215
...
Apr 10 11:05:57 localhost.localdomain systemd[1]: Installed new job cloud-init.service/start as 223
Apr 10 11:05:57 localhost.localdomain systemd[1]: Installed new job cloud-init-local.service/start as 221
...

Comment 6 Lars Kellogg-Stedman 2017-04-10 15:34:52 UTC
for my reference, code that disables cloud-init is:

https://github.com/projectatomic/rpm-ostree-toolbox/blob/master/src/py/lorax-http-repo.tmpl

Comment 7 Colin Walters 2017-04-11 13:54:55 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=850058

Comment 8 Colin Walters 2017-04-11 13:56:05 UTC
Specifically https://bugzilla.redhat.com/show_bug.cgi?id=850058#c5

I don't see a major risk in backporting this to RHEL - the only thing to note is that it requires a coordinated change with spin-kickstarts to enable it for the cloud images.

Comment 9 Jonathan Lebon 2017-04-11 14:55:58 UTC
I renamed the bug title since it doesn't actually have to be a bare metal install.

> I don't see a major risk in backporting this to RHEL - the only thing to note is that it requires a coordinated change with spin-kickstarts to enable it for the cloud images.

The issue is that the generator will actively re-enable services, even if previously disabled: https://git.launchpad.net/cloud-init/tree/systemd/cloud-init-generator?id=61eb03f#n132. This made me wonder why the generator was added in the first place, which led to:

https://git.launchpad.net/cloud-init/commit?id=df2d69

So, it was added so that cloud-init could be disabled from the kernel cmdline. The assumption that it should be enabled in all other cases probably makes sense in most places where cloud-init is found, though not for the AH install path of course. This makes me wonder if we should just not rather completely neuter the generator in Atomc Host composes, since here we do want to preserve the semantics of enabled/disabled services. E.g. in postprocess, do

ln -s /dev/null /etc/systemd/system-generators/cloud-init-generator

Comment 10 Colin Walters 2017-04-11 15:10:35 UTC
I don't understand why that generator was added upstream.  That's not really what generators are for (IMO).

The default systemd mechanism for controlling service enablement is *presets*.

Anyways, I'm fine nuking the generator just for RHELAH, am also fine nuking it gloablly.

Comment 11 Lars Kellogg-Stedman 2017-04-11 17:09:43 UTC
I've produced a scratch build of cloud-init that removes the generator from the package. I generally agree with Colin that this isn't the best way to manage services.  You can find the build at:

  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12991127

As soon as this bz has the appropriate acks I'll turn that into a real build.

Comment 12 Dusty Mabe 2017-04-11 17:18:17 UTC
@larsks should we ask upstream about the sanity of using generators? and see if it should be changed?

Comment 13 Lars Kellogg-Stedman 2017-04-11 17:40:12 UTC
Dusty: Possibly. I think that the generator could be replaced by a ConditionKernelCommandLine and a ConditionPathExists in the unit files.  I'll suggest that, but I won't hold my breath.

Comment 14 Lars Kellogg-Stedman 2017-04-11 17:58:35 UTC
Incidentally, while the -4 package noted in #11 will fix the issue, I think we may have also mis-characterized the underlying problem.  In #9, jlebon says:

> The issue is that the generator will actively re-enable services, even if previously disabled

This isn't actually true. If, *after* upgrading to 0.7.9-3, you disable the cloud-init services:

    systemctl disable cloud-init cloud-init-local cloud-config cloud-infal

They will remain disabled when you reboot the system.  That is, cloud-init upstream did not fundamentally break service management, although as Colin said their decision to use a generator is questionable.

I believe the root cause is that with 0.7.9, the cloud-init services moved from multi-user.target to cloud-init.target, which means that with the new compose there would be new files in /etc/systemd/system/cloud-init.target.wants.

The generator operates by enabling/disabling cloud-init.target, rather than individual services.

Because of the "new" files in the package, I believe the three-way merge of /etc that happens as part of the atomic upgrade process will populate those new files into /etc as part of the upgrade.

Comment 15 Lars Kellogg-Stedman 2017-04-11 17:59:53 UTC
(Also, note that the above is *only* going to happen during an ostree compose, because the services are only enabled automatically by the spec file during an *install*.  A package *upgrade* will not trigger this behavior.)

Comment 16 Laurie Friedman 2017-04-12 16:18:39 UTC
Blocker for 7.3.5.

Comment 18 Colin Walters 2017-04-12 18:53:45 UTC
(In reply to Lars Kellogg-Stedman from comment #14)

> 
> Because of the "new" files in the package, I believe the three-way merge of
> /etc that happens as part of the atomic upgrade process will populate those
> new files into /etc as part of the upgrade.

Yes, that's expected.  But the same thing would happen with rpm's default config management, no?

Anyways, in the big picture, I think it's simplest if cloud-init worked like every other service - use the systemd macros which honor presets.  That's what the patch in https://bugzilla.redhat.com/show_bug.cgi?id=850058#c5 does.

Currently, that'd mean it was disabled by default in /usr, though we could change that to allow it to default-enable.  Before in Fedora when "Cloud" was a separate edition with its own presets, that would have made sense.

But since Atomic Host spans metal and cloud, we *have* to carry a delta for one or the other.  Right now we're hacking around cloud-init enabling itself and forcibly disabling it on metal.

My preference again is inverting this - in the cloud image kickstart, we `systemctl enable cloud-init`.

Comment 19 Lars Kellogg-Stedman 2017-04-27 15:18:51 UTC
> use the systemd macros which honor presets

I agree that presets would be nifty.  I wasn't sure we were making much use of systemd presets in rhel7, but if that's not true, awesome. We'll need to get appropriate presets into our images before we can modify the package and I'm not sure what that process would look like...I guess a bug against rhel-guest-image and atomic?

Comment 21 Lars Kellogg-Stedman 2017-05-02 18:07:22 UTC
There's a scratch build here that mostly reverts the systemd units to what we had in previous releases:

  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13120763

Still no presets, but they are once again installed to multi-user.target and the generator is gone.  Can you build a new tree with this package and test the upgrade?

Thanks!

Comment 22 Micah Abbott 2017-05-04 18:23:57 UTC
I made a local scratch compose with the build that Lars linked to and I can confirm it worked in the case of:

- installing via ISO onto bare-metal, then upgrade to scratch compose
- installing via cloud-image via libvirt, then upgrade to scratch compose
- installing via cloud-image via OpenStack, then upgrade to scratch compose

I didn't do any kind of significant testing of the various features supported by clout-init, but the most basic cases seemed to work fine.

Comment 26 Micah Abbott 2017-07-27 15:32:55 UTC
I successfully repeated my simple tests from comment#22 using the latest RHELAH 7.4 compose, which includes cloud-init-0.7.9-9.el7.x86_64

Based on those results, I'm moving this to VERIFIED

Comment 27 errata-xmlrpc 2017-08-01 23:23:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2275