1867052 – ignition-firstboot-complete.service fails on bootstrap machine

Bug 1867052 - ignition-firstboot-complete.service fails on bootstrap machine

Summary: ignition-firstboot-complete.service fails on bootstrap machine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Evgeny Slutsky
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:	non-multi-arch
Depends On:	1867853
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 09:11 UTC by Jan Zmeskal
Modified:	2020-10-27 16:26 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:25:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:26:16 UTC

Description Jan Zmeskal 2020-08-07 09:11:58 UTC

Description of problem:
When deploying OCP4.6 cluster, the bootstrap VM and master VMs are created and started as expected. However, when I SSH to the bootstrap VM right after its start, I'm greeted with this:

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  ignition-firstboot-complete.service

Upon examining status of the failed service:

$ systemctl status ignition-firstboot-complete.service
● ignition-firstboot-complete.service - Mark boot complete
   Loaded: loaded (/usr/lib/systemd/system/ignition-firstboot-complete.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2020-08-07 08:52:43 UTC; 47s ago
     Docs: https://github.com/coreos/ignition
  Process: 1442 ExecStart=/bin/sh -c mount -o remount,rw /boot &&       if [[ $(uname -m) = s390x ]]; then zipl; fi &&          rm /boot/ignition.firstboot (code=exited, status=1/FAILURE)
 Main PID: 1442 (code=exited, status=1/FAILURE)

Aug 07 08:52:43 localhost systemd[1]: Starting Mark boot complete...
Aug 07 08:52:43 localhost sh[1442]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 07 08:52:43 localhost systemd[1]: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 07 08:52:43 localhost systemd[1]: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 07 08:52:43 localhost systemd[1]: Failed to start Mark boot complete.

I tried to restart the service, but to no avail. When you examing journalctl logs of release-image and bootkube, you see nothing:

$   journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Fri 2020-08-07 08:49:48 UTC. --

It stays like this until the installation fails.


Version-Release number of the following components:
4.6.0-0.nightly-2020-08-07-034746

How reproducible:
Reproduced twice out of two attempts

Steps to Reproduce:
1. Run openshift-install create cluster
2. Wait for bootstrap VM to be started
3. SSH to bootstrap VM

Actual results:
Bootstrap process does not start

Additional info:
I think I also found a workaround. Here are the steps I conducted to revive the installation process:
1. touch /boot/ignition.firstboot
2. systemctl restart ignition-firstboot-complete.service
3. systemctl restart release-image.service bootkube.service

After that, the installation continued normally.

Comment 2 Jan Zmeskal 2020-08-07 09:56:49 UTC

Changing the severity to urgent as the WA I indicated does not seem to work completely. While it re-starts the bootstrap process, the bootstrapping will eventually get stuck at this point:

Aug 07 09:44:42 <hostname> bootkube.sh[2388]: [#162] failed to create some manifests:
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1"
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Created "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Updated status for "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api

I tried it twice and I got stuck here both times. Full journalctl output for bootkube.service and release-image.serivce is attached.

Comment 4 Evgeny Slutsky 2020-08-09 11:50:45 UTC

i Couldnt reproduce this issue on master release.. the installation is passing correctly.

Comment 5 Evgeny Slutsky 2020-08-09 11:53:58 UTC

the release version that worked for me is 4.6.0-0.nightly-2020-08-07-202945

Comment 6 Evgeny Slutsky 2020-08-10 10:40:26 UTC

@jan , 
can you  please repeat your test with latest template rchos  and release.

Comment 7 Jan Zmeskal 2020-08-10 11:27:39 UTC

Hi Evgeny, I can reproduce the issue with 4.6.0-0.nightly-2020-08-09-151434. It's also been reproduced by Roberto (https://coreos.slack.com/archives/C68TNFWA2/p1596802763312800) and Brendan (https://coreos.slack.com/archives/CNSJG0ZED/p1597053057237100)

Comment 8 Roberto Ciatti 2020-08-10 11:39:43 UTC

I confirm that I had the same problem with 2020-08-09 nightly

Comment 9 Evgeny Slutsky 2020-08-10 12:26:50 UTC

we also hitting this issue in 4.6 CI

Comment 10 Brendan Shephard 2020-08-10 13:08:26 UTC

Looks like mine was actually: 
4.6.0-0.nightly-2020-08-07-202945

rather than: 4.6.0-0.nightly-2020-08-09-151434

For reference:
./openshift-install 4.6.0-0.nightly-2020-08-07-202945
built from commit d36a3719da1ee43da5691d90ac51afc190d9b708
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:a64554cb6ff8a61d7509c9994716b1908f40426aee33681af98f170a20190688

According to the comment in the systemd file, it should only be running if that file exists:
https://github.com/coreos/ignition-dracut/blob/master/systemd/ignition-firstboot-complete.service#L17-L19

It seems as though it again after the first boot so the file was already gone?


[core@ocp4-lhcnb-worker-0-xx7cb ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/0 ignition.platform.id=openstack
[core@ocp4-lhcnb-worker-0-xx7cb ~]$ sudo journalctl -u ignition-firstboot-complete
-- Logs begin at Mon 2020-08-10 02:02:21 UTC, end at Mon 2020-08-10 12:57:32 UTC. --
Aug 10 02:03:07 localhost systemd[1]: Starting Mark boot complete...
Aug 10 02:03:08 localhost systemd[1]: Started Mark boot complete.
-- Reboot --
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Starting Mark boot complete...
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb sh[1391]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Failed to start Mark boot complete.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Consumed 28ms CPU time


I just rebooted again and it does't try to start again. So just something with that first boot.

Comment 11 Evgeny Slutsky 2020-08-11 11:17:06 UTC

i can confirm when using `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE`  with custom  (latest rchos) template on our CI the installation on 4.6 is passing.

Comment 14 Jan Zmeskal 2020-08-12 08:47:43 UTC

I can confirm that this does not reproduce with 4.6.0-0.nightly-2020-08-12-062953

Comment 15 Louis Santillan 2020-09-09 04:56:01 UTC

Seeing this with 4.5.8 w/o `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` specified on vSphere IPI install attempt.  See https://access.redhat.com/support/cases/#/case/02715714/discussion?attachmentId=a092K000025L2bZQAS

Comment 17 errata-xmlrpc 2020-10-27 16:25:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.