Description of problem:
When deploying OCP4.6 cluster, the bootstrap VM and master VMs are created and started as expected. However, when I SSH to the bootstrap VM right after its start, I'm greeted with this:
This is the bootstrap node; it will be destroyed when the master is fully up.
The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.
journalctl -b -f -u release-image.service -u bootkube.service
Failed Units: 1
Upon examining status of the failed service:
$ systemctl status ignition-firstboot-complete.service
● ignition-firstboot-complete.service - Mark boot complete
Loaded: loaded (/usr/lib/systemd/system/ignition-firstboot-complete.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2020-08-07 08:52:43 UTC; 47s ago
Process: 1442 ExecStart=/bin/sh -c mount -o remount,rw /boot && if [[ $(uname -m) = s390x ]]; then zipl; fi && rm /boot/ignition.firstboot (code=exited, status=1/FAILURE)
Main PID: 1442 (code=exited, status=1/FAILURE)
Aug 07 08:52:43 localhost systemd: Starting Mark boot complete...
Aug 07 08:52:43 localhost sh: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 07 08:52:43 localhost systemd: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 07 08:52:43 localhost systemd: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 07 08:52:43 localhost systemd: Failed to start Mark boot complete.
I tried to restart the service, but to no avail. When you examing journalctl logs of release-image and bootkube, you see nothing:
$ journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Fri 2020-08-07 08:49:48 UTC. --
It stays like this until the installation fails.
Version-Release number of the following components:
Reproduced twice out of two attempts
Steps to Reproduce:
1. Run openshift-install create cluster
2. Wait for bootstrap VM to be started
3. SSH to bootstrap VM
Bootstrap process does not start
I think I also found a workaround. Here are the steps I conducted to revive the installation process:
1. touch /boot/ignition.firstboot
2. systemctl restart ignition-firstboot-complete.service
3. systemctl restart release-image.service bootkube.service
After that, the installation continued normally.
Changing the severity to urgent as the WA I indicated does not seem to work completely. While it re-starts the bootstrap process, the bootstrapping will eventually get stuck at this point:
Aug 07 09:44:42 <hostname> bootkube.sh: [#162] failed to create some manifests:
Aug 07 09:44:42 <hostname> bootkube.sh: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1"
Aug 07 09:44:42 <hostname> bootkube.sh: Created "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api
Aug 07 09:44:42 <hostname> bootkube.sh: Updated status for "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api
I tried it twice and I got stuck here both times. Full journalctl output for bootkube.service and release-image.serivce is attached.
i Couldnt reproduce this issue on master release.. the installation is passing correctly.
the release version that worked for me is 4.6.0-0.nightly-2020-08-07-202945
can you please repeat your test with latest template rchos and release.
Hi Evgeny, I can reproduce the issue with 4.6.0-0.nightly-2020-08-09-151434. It's also been reproduced by Roberto (https://coreos.slack.com/archives/C68TNFWA2/p1596802763312800) and Brendan (https://coreos.slack.com/archives/CNSJG0ZED/p1597053057237100)
I confirm that I had the same problem with 2020-08-09 nightly
we also hitting this issue in 4.6 CI
Looks like mine was actually:
rather than: 4.6.0-0.nightly-2020-08-09-151434
built from commit d36a3719da1ee43da5691d90ac51afc190d9b708
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:a64554cb6ff8a61d7509c9994716b1908f40426aee33681af98f170a20190688
According to the comment in the systemd file, it should only be running if that file exists:
It seems as though it again after the first boot so the file was already gone?
[core@ocp4-lhcnb-worker-0-xx7cb ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/0 ignition.platform.id=openstack
[core@ocp4-lhcnb-worker-0-xx7cb ~]$ sudo journalctl -u ignition-firstboot-complete
-- Logs begin at Mon 2020-08-10 02:02:21 UTC, end at Mon 2020-08-10 12:57:32 UTC. --
Aug 10 02:03:07 localhost systemd: Starting Mark boot complete...
Aug 10 02:03:08 localhost systemd: Started Mark boot complete.
-- Reboot --
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd: Starting Mark boot complete...
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb sh: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd: Failed to start Mark boot complete.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd: ignition-firstboot-complete.service: Consumed 28ms CPU time
I just rebooted again and it does't try to start again. So just something with that first boot.
i can confirm when using `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` with custom (latest rchos) template on our CI the installation on 4.6 is passing.
I can confirm that this does not reproduce with 4.6.0-0.nightly-2020-08-12-062953
Seeing this with 4.5.8 w/o `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` specified on vSphere IPI install attempt. See https://access.redhat.com/support/cases/#/case/02715714/discussion?attachmentId=a092K000025L2bZQAS
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.