1777036 – Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018

Bug 1777036 - Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018

Summary: Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0....

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Colin Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1775728
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-26 19:31 UTC by Abhinav Dahiya
Modified:	2020-01-23 11:14 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1775728
Environment:
Last Closed:	2020-01-23 11:14:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2724	0	'None'	closed	Bug 1777036: rhcos: Bump to 43.81.201911221453.0	2020-02-24 00:50:23 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:14:46 UTC

Description Abhinav Dahiya 2019-11-26 19:31:00 UTC

+++ This bug was initially created as a clone of Bug #1775728 +++

Description of problem:  Running an OOTB openshift-install on AWS on recent builds like 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018.

Install fails and the bootstrap node refuses ssh connections.   Looking at the screenshot of the instance console from AWS show it is in emergency mode.
(Screenshot attached system log will be attached)


Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-22-050018


How reproducible: Always - not sure how builds passed CI, but always happens.


Steps to Reproduce:
1. On AWS openshift-install create cluster and walk through a normal install
2.
3.

--- Additional comment from Mike Fiedler on 2019-11-22 16:39:52 UTC ---



--- Additional comment from Mike Fiedler on 2019-11-22 16:41:08 UTC ---

[   58.569493] systemd-udevd[1074]: Process '/usr/bin/systemctl --no-block start coreos-luks-open@789cdd0a-07a5-485c-8373-4a2316680b6a' failed with exit code 4.

[   58.569783] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf does not exist, blacklisting all devices.

[   58.569809] multipathd[1091]: Nov 22 16:27:50 | You can run "/sbin/mpathconf --enable" to create

[   58.569823] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf. See man mpathconf(8) for more details

--- Additional comment from Mike Fiedler on 2019-11-22 16:41:42 UTC ---

4.3.0-0.nightly-2019-11-19-122017  is known to be OK

--- Additional comment from Colin Walters on 2019-11-22 20:40:32 UTC ---

Hmm.  I just sent up https://github.com/openshift/installer/pull/2714 which should help with this.

Comment 2 Xingxing Xia 2019-11-29 07:02:50 UTC

Agree, e.g. this installation https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/72806/console where it failed with:
...
level=debug msg="Still waiting for the Kubernetes API: Get https://...:6443/version?timeout=32s: dial tcp 52.78.20.8:6443: connect: connection refused"
level=debug msg="Still waiting for the Kubernetes API: Get https://...:6443/version?timeout=32s: dial tcp 52.78.20.8:6443: connect: connection refused"
level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://...:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp ...:6443: connect: connection refused"
...
level=info msg="Pulling debug logs from the bootstrap machine"
level=error msg="Attempted to gather debug logs after installation failure:...
level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"
...

Checked it in AWS web console about the bootstrap VM's "System Log", it is same as above Description:
...
[   64.912104] systemd[1]: Started Emergency Shell.
[   64.919157] systemd[1]: Reached target Emergency Mode.

Comment 3 chris.liles 2019-12-02 16:12:18 UTC

The cherry-pick PR to 4.3 is still waiting to be merged. - https://github.com/openshift/installer/pull/2724

Comment 5 Michael Nguyen 2019-12-04 15:40:55 UTC

I installed 4.3.0-0.nightly-2019-12-04-054458 with no issues.  The version of RHCOS in the bump is present in the build.  If any one else is still seeing this issue please respond.  Otherwise I will close the BZ as verified.

$ oc get nodes
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-138-104.ec2.internal   Ready    master   22m   v1.16.2
ip-10-0-141-21.ec2.internal    Ready    worker   15m   v1.16.2
ip-10-0-152-248.ec2.internal   Ready    worker   15m   v1.16.2
ip-10-0-158-38.ec2.internal    Ready    master   22m   v1.16.2
ip-10-0-163-90.ec2.internal    Ready    master   22m   v1.16.2
ip-10-0-173-72.ec2.internal    Ready    worker   14m   v1.16.2
$ oc debug node/ip-10-0-138-104.ec2.internal
Starting pod/ip-10-0-138-104ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa2271b7cdf177aa0368fdf854027a5c54d03b90f089701190b2533147d4469d
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201912040340.0 (2019-12-04T03:45:20Z)

  ostree://e884477421640d1285c07a6dd9aaf01c9e125038ebbe6290a5e341eb3695a4d1
                   Version: 43.81.201911221453.0 (2019-11-22T14:58:44Z)
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-12-04-054458   True        False         5m18s   Cluster version is 4.3.0-0.nightly-2019-12-04-054458

Comment 6 Mike Fiedler 2019-12-04 18:31:24 UTC

This is working for me now.  I think we can mark VERIFIED.

Comment 8 errata-xmlrpc 2020-01-23 11:14:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.