Bug 1867052

Summary:	ignition-firstboot-complete.service fails on bootstrap machine
Product:	OpenShift Container Platform	Reporter:	Jan Zmeskal <jzmeskal>
Component:	Installer	Assignee:	Evgeny Slutsky <eslutsky>
Installer sub component:	OpenShift on RHV	QA Contact:	Lucie Leistnerova <lleistne>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	bshephar, danili, eslutsky, lsantill
Version:	4.6	Keywords:	TestBlockerForLayeredProduct
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	non-multi-arch
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:25:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1867853
Bug Blocks:

Description Jan Zmeskal 2020-08-07 09:11:58 UTC

Description of problem:
When deploying OCP4.6 cluster, the bootstrap VM and master VMs are created and started as expected. However, when I SSH to the bootstrap VM right after its start, I'm greeted with this:

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary services are release-image.service followed by bootkube.service. To watch their status, run e.g.

  journalctl -b -f -u release-image.service -u bootkube.service
[systemd]
Failed Units: 1
  ignition-firstboot-complete.service

Upon examining status of the failed service:

$ systemctl status ignition-firstboot-complete.service
● ignition-firstboot-complete.service - Mark boot complete
   Loaded: loaded (/usr/lib/systemd/system/ignition-firstboot-complete.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2020-08-07 08:52:43 UTC; 47s ago
     Docs: https://github.com/coreos/ignition
  Process: 1442 ExecStart=/bin/sh -c mount -o remount,rw /boot &&       if [[ $(uname -m) = s390x ]]; then zipl; fi &&          rm /boot/ignition.firstboot (code=exited, status=1/FAILURE)
 Main PID: 1442 (code=exited, status=1/FAILURE)

Aug 07 08:52:43 localhost systemd[1]: Starting Mark boot complete...
Aug 07 08:52:43 localhost sh[1442]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 07 08:52:43 localhost systemd[1]: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 07 08:52:43 localhost systemd[1]: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 07 08:52:43 localhost systemd[1]: Failed to start Mark boot complete.

I tried to restart the service, but to no avail. When you examing journalctl logs of release-image and bootkube, you see nothing:

$   journalctl -b -f -u release-image.service -u bootkube.service
-- Logs begin at Fri 2020-08-07 08:49:48 UTC. --

It stays like this until the installation fails.


Version-Release number of the following components:
4.6.0-0.nightly-2020-08-07-034746

How reproducible:
Reproduced twice out of two attempts

Steps to Reproduce:
1. Run openshift-install create cluster
2. Wait for bootstrap VM to be started
3. SSH to bootstrap VM

Actual results:
Bootstrap process does not start

Additional info:
I think I also found a workaround. Here are the steps I conducted to revive the installation process:
1. touch /boot/ignition.firstboot
2. systemctl restart ignition-firstboot-complete.service
3. systemctl restart release-image.service bootkube.service

After that, the installation continued normally.

Comment 2 Jan Zmeskal 2020-08-07 09:56:49 UTC

Changing the severity to urgent as the WA I indicated does not seem to work completely. While it re-starts the bootstrap process, the bootstrapping will eventually get stuck at this point:

Aug 07 09:44:42 <hostname> bootkube.sh[2388]: [#162] failed to create some manifests:
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1"
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Created "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api
Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Updated status for "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api

I tried it twice and I got stuck here both times. Full journalctl output for bootkube.service and release-image.serivce is attached.

Comment 4 Evgeny Slutsky 2020-08-09 11:50:45 UTC

i Couldnt reproduce this issue on master release.. the installation is passing correctly.

Comment 5 Evgeny Slutsky 2020-08-09 11:53:58 UTC

the release version that worked for me is 4.6.0-0.nightly-2020-08-07-202945

Comment 6 Evgeny Slutsky 2020-08-10 10:40:26 UTC

@jan , 
can you  please repeat your test with latest template rchos  and release.

Comment 7 Jan Zmeskal 2020-08-10 11:27:39 UTC

Hi Evgeny, I can reproduce the issue with 4.6.0-0.nightly-2020-08-09-151434. It's also been reproduced by Roberto (https://coreos.slack.com/archives/C68TNFWA2/p1596802763312800) and Brendan (https://coreos.slack.com/archives/CNSJG0ZED/p1597053057237100)

Comment 8 Roberto Ciatti 2020-08-10 11:39:43 UTC

I confirm that I had the same problem with 2020-08-09 nightly

Comment 9 Evgeny Slutsky 2020-08-10 12:26:50 UTC

we also hitting this issue in 4.6 CI

Comment 10 Brendan Shephard 2020-08-10 13:08:26 UTC

Looks like mine was actually: 
4.6.0-0.nightly-2020-08-07-202945

rather than: 4.6.0-0.nightly-2020-08-09-151434

For reference:
./openshift-install 4.6.0-0.nightly-2020-08-07-202945
built from commit d36a3719da1ee43da5691d90ac51afc190d9b708
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:a64554cb6ff8a61d7509c9994716b1908f40426aee33681af98f170a20190688

According to the comment in the systemd file, it should only be running if that file exists:
https://github.com/coreos/ignition-dracut/blob/master/systemd/ignition-firstboot-complete.service#L17-L19

It seems as though it again after the first boot so the file was already gone?


[core@ocp4-lhcnb-worker-0-xx7cb ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/0 ignition.platform.id=openstack
[core@ocp4-lhcnb-worker-0-xx7cb ~]$ sudo journalctl -u ignition-firstboot-complete
-- Logs begin at Mon 2020-08-10 02:02:21 UTC, end at Mon 2020-08-10 12:57:32 UTC. --
Aug 10 02:03:07 localhost systemd[1]: Starting Mark boot complete...
Aug 10 02:03:08 localhost systemd[1]: Started Mark boot complete.
-- Reboot --
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Starting Mark boot complete...
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb sh[1391]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Failed with result 'exit-code'.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Failed to start Mark boot complete.
Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Consumed 28ms CPU time


I just rebooted again and it does't try to start again. So just something with that first boot.

Comment 11 Evgeny Slutsky 2020-08-11 11:17:06 UTC

i can confirm when using `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE`  with custom  (latest rchos) template on our CI the installation on 4.6 is passing.

Comment 14 Jan Zmeskal 2020-08-12 08:47:43 UTC

I can confirm that this does not reproduce with 4.6.0-0.nightly-2020-08-12-062953

Comment 15 Louis Santillan 2020-09-09 04:56:01 UTC

Seeing this with 4.5.8 w/o `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` specified on vSphere IPI install attempt.  See https://access.redhat.com/support/cases/#/case/02715714/discussion?attachmentId=a092K000025L2bZQAS

Comment 17 errata-xmlrpc 2020-10-27 16:25:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196