Version: OCP 4.6 Platform: vmware Please specify: UPI What happened? Someone in customer's environment booted a template used for installing OpenShift, and it was difficult to troubleshoot that the system had been booted. Support engineers and support missed that the template was booted for days while the case ran, ultimately testing the customer's patience. Eventually one person noticed the "firstboot" kernel command line was gone, signifying the template had been booted. What did you expect to happen? Clear, easy to find and read log message sent to serial console stating the system had already booted, and would not be consuming ignition configs. The firstboot kernel cmdline option is too easy to miss. Multiple experience engineers missed this for multiple days. How to reproduce it (as minimally and precisely as possible)? Import an OpenShift OVF template Boot template, removing its untouched status Attempt to install cluster using that imege Anything else we need to know? This resulted in a significant customer satisfaction escalation. We should absolutely make it easier to see when a template has been booted.
This is not something that the Installer can address, moving to RHCOS.
This is like another side of https://github.com/coreos/ignition/issues/1214: that issue is concerned with catching users consciously trying to reuse an already booted image by re-injecting `ignition.firstboot`. We can detect this and error out clearly. But here, we're talking about a user unknowingly using an already booted image. From the point of view of RHCOS itself, in the limit there isn't really any difference between that and just the machine being rebooted. Of course, we can add information to the console to make it clearer. Opened a PR to do this: https://github.com/coreos/fedora-coreos-config/pull/1086 But we can't really be more strict than that at the RHCOS level. However, at the installer level, I think it easily could (and should) detect this case directly and clearly error out. The simplest way is to simply check that `ignition.firstboot` is on the kernel cmdline. As such, I'm tentatively moving this back to the installer team.
> systemd[1]: Reached target Subsequent (Not Ignition) boot complete. > [...] > systemd[1]: Started CoreOS: Mount (subsequent) /sysroot. Luca, just to close the loop, thank you for identifying those logs. What I was told was to look for the "firstboot" cmdline option, where these logs identify the situation much clearer. If these log messages reliably show up in the console every time after the initial boot of a clean image, then Support can write solutions and build tools to identify this. Finally, I think erroring on the use case of unwittingly booting an already booted image is a fine idea if that will never complete. That would have sidestepped this issue altogether.
(In reply to Jonathan Lebon from comment #4) > However, at the installer level, I think it easily could (and should) detect > this case directly and clearly error out. The simplest way is to simply > check that `ignition.firstboot` is on the kernel cmdline. As such, I'm > tentatively moving this back to the installer team. Hmm, although... this assumes that the baked first boot was done with the installer-provided Ignition (containing the proposed code that would check and e.g. enter emergency.target), which is not necessarily true. And in the second boot, Ignition doesn't run, and so the installer might not be able to access the machine at all. Does the installer today already have a "I'm booting and setting up" kind of signal which it could queue off of to know whether the machine isn't just sitting there idle but actually e.g. downloading containers, etc...? Anyway... I guess feel free to bounce back to the RHCOS component. :) At least https://github.com/coreos/fedora-coreos-config/pull/1086 (and eventually https://github.com/coreos/ignition/issues/1214) should help.
The installer doesn't have any view of the instance other than what is provided by the cloud platform. Possibly need to add docs to help user understand the status of the booted host.
I think https://github.com/coreos/fedora-coreos-config/pull/1086 may be enough here. Once this gets into RHCOS, users will see something like this on the console: Ignition: ran on 2021/07/16 15:55:05 UTC (at least 2 boots ago) Ignition: no config provided by user I guess we could add docs which tell the user to sanity-check that the Ignition run date on the console is what they expected.
Based on feedback in comment 9, moving this to the docs team.
Hi Christopher, Could you please point me to the section/topic with the doc link? It will help me find the right place to add the instructions.
Chatted with Bob OOB. Let's keep this RHBZ to track the documentation change needed. I've opened a new RHCOS bug which tracks the RHCOS change actually landing in 4.10: https://bugzilla.redhat.com/show_bug.cgi?id=2016004.
Verified with Jonathan that this work is ready for 4.10. Created https://github.com/openshift/openshift-docs/pull/37793 to add instruction starting in OCP 4.10 doc. It is essentially the same as https://github.com/openshift/openshift-docs/pull/37535 (now closed), though I updated the console output date to be closer to the 4.10 timeframe. This is ready to merge/CP to 4.10 upon QE approval and does not need to be backported to older versions. Moving to ON_QA.
Reviews completed. PR merged and cherrypicked to 4.10. Waiting to verify docs are live.
An enterprise-4.10 branch build is not yet available. However, these docs are merged in the main and 4.10 branches (e.g., https://github.com/openshift/openshift-docs/blob/enterprise-4.10/modules/installation-user-infra-machines-iso.adoc) and will be available once that branch is cut (e.g., https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal/installing-bare-metal.html#installation-user-infra-machines-iso_installing-bare-metal). Closing as NEXT RELEASE.