1775728 – Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018

Bug 1775728 - Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018

Summary: Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0....

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Colin Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1780120 (view as bug list)
Depends On:
Blocks:	1777036
TreeView+	depends on / blocked

Reported:	2019-11-22 16:38 UTC by Mike Fiedler
Modified:	2020-05-04 11:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1777036 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:16:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Console of hung instance (90.03 KB, image/jpeg) 2019-11-22 16:38 UTC, Mike Fiedler	no flags	Details
System log (63.25 KB, text/plain) 2019-11-22 16:39 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 2714	0	'None'	closed	Bug 1775728: rhcos: Bump to 43.81.201911221453.0	2020-07-17 14:43:05 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:16:54 UTC

Description Mike Fiedler 2019-11-22 16:38:55 UTC

Created attachment 1638801 [details]
Console of hung instance

Description of problem:  Running an OOTB openshift-install on AWS on recent builds like 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018.

Install fails and the bootstrap node refuses ssh connections.   Looking at the screenshot of the instance console from AWS show it is in emergency mode.
(Screenshot attached system log will be attached)


Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-22-050018


How reproducible: Always - not sure how builds passed CI, but always happens.


Steps to Reproduce:
1. On AWS openshift-install create cluster and walk through a normal install
2.
3.

Comment 1 Mike Fiedler 2019-11-22 16:39:52 UTC

Created attachment 1638804 [details]
System log

Comment 2 Mike Fiedler 2019-11-22 16:41:08 UTC

[   58.569493] systemd-udevd[1074]: Process '/usr/bin/systemctl --no-block start coreos-luks-open@789cdd0a-07a5-485c-8373-4a2316680b6a' failed with exit code 4.

[   58.569783] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf does not exist, blacklisting all devices.

[   58.569809] multipathd[1091]: Nov 22 16:27:50 | You can run "/sbin/mpathconf --enable" to create

[   58.569823] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf. See man mpathconf(8) for more details

Comment 3 Mike Fiedler 2019-11-22 16:41:42 UTC

4.3.0-0.nightly-2019-11-19-122017  is known to be OK

Comment 4 Colin Walters 2019-11-22 20:40:32 UTC

Hmm.  I just sent up https://github.com/openshift/installer/pull/2714 which should help with this.

Comment 6 Michael Nguyen 2019-12-04 16:37:58 UTC

I went through a normal install on AWS of the 4.4 nightly below with no emergency shell.  I checked the rhcos version to make sure it had the bump from the PR.  If anyone is still seeing issues, please reply.  Otherwise I will close as verified.

$ oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-128-119.ec2.internal   Ready    master   30m   v1.16.2
ip-10-0-132-51.ec2.internal    Ready    worker   19m   v1.16.2
ip-10-0-145-29.ec2.internal    Ready    worker   19m   v1.16.2
ip-10-0-145-74.ec2.internal    Ready    master   30m   v1.16.2
ip-10-0-164-54.ec2.internal    Ready    master   30m   v1.16.2
ip-10-0-170-213.ec2.internal   Ready    worker   18m   v1.16.2
$ oc debug node/ip-10-0-128-119.ec2.internal
Starting pod/ip-10-0-128-119ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a8502a25bd2c039a8a3dcfb0396688f0801b96a77daec774172f4622fa792b7
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201912040328.0 (2019-12-04T03:33:31Z)

  ostree://e884477421640d1285c07a6dd9aaf01c9e125038ebbe6290a5e341eb3695a4d1
                   Version: 43.81.201911221453.0 (2019-11-22T14:58:44Z)
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2019-12-04-104500   True        False         5m39s   Cluster version is 4.4.0-0.nightly-2019-12-04-104500

Comment 7 Evgeny Slutsky 2019-12-05 16:13:39 UTC

*** Bug 1780120 has been marked as a duplicate of this bug. ***

Comment 8 Mike Fiedler 2019-12-06 01:08:50 UTC

I have not seen this on recent builds.   Marking verified on registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-05-213858

Comment 9 Dave Cain 2020-01-07 02:12:45 UTC

Seeing this on baremetal with the latest metal version rhcos-43.81.201912030353.0-metal.x86_64.raw.gz available on the public mirror here:
https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.3/latest/

Comment 12 errata-xmlrpc 2020-05-04 11:16:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.