1977949 – [RFE] RHCOS: help determining whether a user-provided image was already booted (Ignition provisioning already performed)

Bug 1977949 - [RFE] RHCOS: help determining whether a user-provided image was already booted (Ignition provisioning already performed)

Summary: [RFE] RHCOS: help determining whether a user-provided image was already boote...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Bob Furu
QA Contact:	Michael Nguyen
Docs Contact:	Latha S
URL:
Whiteboard:
Depends On:	2016004
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-30 19:20 UTC by Christopher Wawak
Modified:	2024-10-01 18:51 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2016004 (view as bug list)
Environment:
Last Closed:	2021-10-28 23:14:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	coreos fedora-coreos-config pull 1086	0	None	Merged	15fcos: remember when Ignition ran and print on console	2021-10-20 14:23:55 UTC

Description Christopher Wawak 2021-06-30 19:20:02 UTC

Version:
OCP 4.6

Platform:
vmware

Please specify:
UPI

What happened?
Someone in customer's environment booted a template used for installing OpenShift, and it was difficult to troubleshoot that the system had been booted. 

Support engineers and support missed that the template was booted for days while the case ran, ultimately testing the customer's patience. Eventually one person noticed the "firstboot" kernel command line was gone, signifying the template had been booted. 

What did you expect to happen?
Clear, easy to find and read log message sent to serial console stating the system had already booted, and would not be consuming ignition configs.

The firstboot kernel cmdline option is too easy to miss. Multiple experience engineers missed this for multiple days.

How to reproduce it (as minimally and precisely as possible)?
Import an OpenShift OVF template
Boot template, removing its untouched status
Attempt to install cluster using that imege

Anything else we need to know?
This resulted in a significant customer satisfaction escalation. We should absolutely make it easier to see when a template has been booted.

Comment 2 Scott Dodson 2021-07-06 13:34:53 UTC

This is not something that the Installer can address, moving to RHCOS.

Comment 4 Jonathan Lebon 2021-07-06 16:42:10 UTC

This is like another side of https://github.com/coreos/ignition/issues/1214: that issue is concerned with catching users consciously trying to reuse an already booted image by re-injecting `ignition.firstboot`. We can detect this and error out clearly.

But here, we're talking about a user unknowingly using an already booted image. From the point of view of RHCOS itself, in the limit there isn't really any difference between that and just the machine being rebooted. Of course, we can add information to the console to make it clearer. Opened a PR to do this:

https://github.com/coreos/fedora-coreos-config/pull/1086

But we can't really be more strict than that at the RHCOS level.

However, at the installer level, I think it easily could (and should) detect this case directly and clearly error out. The simplest way is to simply check that `ignition.firstboot` is on the kernel cmdline. As such, I'm tentatively moving this back to the installer team.

Comment 5 Christopher Wawak 2021-07-06 16:50:46 UTC

> systemd[1]: Reached target Subsequent (Not Ignition) boot complete.
> [...]
> systemd[1]: Started CoreOS: Mount (subsequent) /sysroot.

Luca, just to close the loop, thank you for identifying those logs. What I was told was to look for the "firstboot" cmdline option, where these logs identify the situation much clearer.

If these log messages reliably show up in the console every time after the initial boot of a clean image, then Support can write solutions and build tools to identify this. 

Finally, I think erroring on the use case of unwittingly booting an already booted image is a fine idea if that will never complete. That would have sidestepped this issue altogether.

Comment 6 Jonathan Lebon 2021-07-06 17:05:43 UTC

(In reply to Jonathan Lebon from comment #4)
> However, at the installer level, I think it easily could (and should) detect
> this case directly and clearly error out. The simplest way is to simply
> check that `ignition.firstboot` is on the kernel cmdline. As such, I'm
> tentatively moving this back to the installer team.

Hmm, although... this assumes that the baked first boot was done with the installer-provided Ignition (containing the proposed code that would check and e.g. enter emergency.target), which is not necessarily true. And in the second boot, Ignition doesn't run, and so the installer might not be able to access the machine at all.

Does the installer today already have a "I'm booting and setting up" kind of signal which it could queue off of to know whether the machine isn't just sitting there idle but actually e.g. downloading containers, etc...?

Anyway... I guess feel free to bounce back to the RHCOS component. :)
At least https://github.com/coreos/fedora-coreos-config/pull/1086 (and eventually https://github.com/coreos/ignition/issues/1214) should help.

Comment 7 Russell Teague 2021-08-02 18:06:53 UTC

The installer doesn't have any view of the instance other than what is provided by the cloud platform.  Possibly need to add docs to help user understand the status of the booted host.

Comment 9 Jonathan Lebon 2021-08-04 15:56:17 UTC

I think https://github.com/coreos/fedora-coreos-config/pull/1086 may be enough here. Once this gets into RHCOS, users will see something like this on the console:

    Ignition: ran on 2021/07/16 15:55:05 UTC (at least 2 boots ago)
    Ignition: no config provided by user

I guess we could add docs which tell the user to sanity-check that the Ignition run date on the console is what they expected.

Comment 10 Russell Teague 2021-08-24 17:45:27 UTC

Based on feedback in comment 9, moving this to the docs team.

Comment 12 Servesha 2021-09-14 08:15:56 UTC

Hi  Christopher,

Could you please point me to the section/topic with the doc link? It will help me find the right place to add the instructions.

Comment 21 Jonathan Lebon 2021-10-20 15:16:10 UTC

Chatted with Bob OOB. Let's keep this RHBZ to track the documentation change needed.
I've opened a new RHCOS bug which tracks the RHCOS change actually landing in 4.10: https://bugzilla.redhat.com/show_bug.cgi?id=2016004.

Comment 22 Bob Furu 2021-10-20 16:00:22 UTC

Verified with Jonathan that this work is ready for 4.10. Created https://github.com/openshift/openshift-docs/pull/37793 to add instruction starting in OCP 4.10 doc. It is essentially the same as https://github.com/openshift/openshift-docs/pull/37535 (now closed), though I updated the console output date to be closer to the 4.10 timeframe. This is ready to merge/CP to 4.10 upon QE approval and does not need to be backported to older versions. Moving to ON_QA.

Comment 23 Bob Furu 2021-10-28 19:42:55 UTC

Reviews completed. PR merged and cherrypicked to 4.10. Waiting to verify docs are live.

Comment 24 Bob Furu 2021-10-28 23:14:39 UTC

An enterprise-4.10 branch build is not yet available. However, these docs are merged in the main and 4.10 branches (e.g., https://github.com/openshift/openshift-docs/blob/enterprise-4.10/modules/installation-user-infra-machines-iso.adoc) and will be available once that branch is cut (e.g., https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-iso_installing-bare-metal). Closing as NEXT RELEASE.

Note You need to log in before you can comment on or make changes to this bug.