Bug 1867052
Summary: | ignition-firstboot-complete.service fails on bootstrap machine | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jan Zmeskal <jzmeskal> |
Component: | Installer | Assignee: | Evgeny Slutsky <eslutsky> |
Installer sub component: | OpenShift on RHV | QA Contact: | Lucie Leistnerova <lleistne> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | bshephar, danili, eslutsky, lsantill |
Version: | 4.6 | Keywords: | TestBlockerForLayeredProduct |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | non-multi-arch | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:25:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1867853 | ||
Bug Blocks: |
Description
Jan Zmeskal
2020-08-07 09:11:58 UTC
Changing the severity to urgent as the WA I indicated does not seem to work completely. While it re-starts the bootstrap process, the bootstrapping will eventually get stuck at this point: Aug 07 09:44:42 <hostname> bootkube.sh[2388]: [#162] failed to create some manifests: Aug 07 09:44:42 <hostname> bootkube.sh[2388]: "99_openshift-cluster-api_worker-machineset-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_worker-machineset-0.yaml": no matches for kind "MachineSet" in version "machine.openshift.io/v1beta1" Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Created "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api Aug 07 09:44:42 <hostname> bootkube.sh[2388]: Updated status for "99_openshift-cluster-api_worker-machineset-0.yaml" machinesets.v1beta1.machine.openshift.io/primary-lxgq4-worker-0 -n openshift-machine-api I tried it twice and I got stuck here both times. Full journalctl output for bootkube.service and release-image.serivce is attached. i Couldnt reproduce this issue on master release.. the installation is passing correctly. the release version that worked for me is 4.6.0-0.nightly-2020-08-07-202945 @jan , can you please repeat your test with latest template rchos and release. Hi Evgeny, I can reproduce the issue with 4.6.0-0.nightly-2020-08-09-151434. It's also been reproduced by Roberto (https://coreos.slack.com/archives/C68TNFWA2/p1596802763312800) and Brendan (https://coreos.slack.com/archives/CNSJG0ZED/p1597053057237100) I confirm that I had the same problem with 2020-08-09 nightly we also hitting this issue in 4.6 CI Looks like mine was actually: 4.6.0-0.nightly-2020-08-07-202945 rather than: 4.6.0-0.nightly-2020-08-09-151434 For reference: ./openshift-install 4.6.0-0.nightly-2020-08-07-202945 built from commit d36a3719da1ee43da5691d90ac51afc190d9b708 release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:a64554cb6ff8a61d7509c9994716b1908f40426aee33681af98f170a20190688 According to the comment in the systemd file, it should only be running if that file exists: https://github.com/coreos/ignition-dracut/blob/master/systemd/ignition-firstboot-complete.service#L17-L19 It seems as though it again after the first boot so the file was already gone? [core@ocp4-lhcnb-worker-0-xx7cb ~]$ cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/779c2970065f2dd6eb8ec1f73e5e8863f4d8fa91be60790dee6af70450d85c2a/0 ignition.platform.id=openstack [core@ocp4-lhcnb-worker-0-xx7cb ~]$ sudo journalctl -u ignition-firstboot-complete -- Logs begin at Mon 2020-08-10 02:02:21 UTC, end at Mon 2020-08-10 12:57:32 UTC. -- Aug 10 02:03:07 localhost systemd[1]: Starting Mark boot complete... Aug 10 02:03:08 localhost systemd[1]: Started Mark boot complete. -- Reboot -- Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Starting Mark boot complete... Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb sh[1391]: rm: cannot remove '/boot/ignition.firstboot': No such file or directory Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Main process exited, code=exited, status=1/FAILURE Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Failed with result 'exit-code'. Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: Failed to start Mark boot complete. Aug 10 09:25:13 ocp4-lhcnb-worker-0-xx7cb systemd[1]: ignition-firstboot-complete.service: Consumed 28ms CPU time I just rebooted again and it does't try to start again. So just something with that first boot. i can confirm when using `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` with custom (latest rchos) template on our CI the installation on 4.6 is passing. I can confirm that this does not reproduce with 4.6.0-0.nightly-2020-08-12-062953 Seeing this with 4.5.8 w/o `OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE` specified on vSphere IPI install attempt. See https://access.redhat.com/support/cases/#/case/02715714/discussion?attachmentId=a092K000025L2bZQAS Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |