Bug 1775728

Summary: Recent builds booting to Emergency mode during install on AWS: e.g. 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: RHCOSAssignee: Colin Walters <walters>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: adahiya, bbreard, behoward, dcain, dustymabe, eslutsky, imcleod, jligon, nstielau
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1777036 (view as bug list) Environment:
Last Closed: 2020-05-04 11:16:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1777036    
Attachments:
Description Flags
Console of hung instance
none
System log none

Description Mike Fiedler 2019-11-22 16:38:55 UTC
Created attachment 1638801 [details]
Console of hung instance

Description of problem:  Running an OOTB openshift-install on AWS on recent builds like 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018.

Install fails and the bootstrap node refuses ssh connections.   Looking at the screenshot of the instance console from AWS show it is in emergency mode.
(Screenshot attached system log will be attached)


Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-22-050018


How reproducible: Always - not sure how builds passed CI, but always happens.


Steps to Reproduce:
1. On AWS openshift-install create cluster and walk through a normal install
2.
3.

Comment 1 Mike Fiedler 2019-11-22 16:39:52 UTC
Created attachment 1638804 [details]
System log

Comment 2 Mike Fiedler 2019-11-22 16:41:08 UTC
[   58.569493] systemd-udevd[1074]: Process '/usr/bin/systemctl --no-block start coreos-luks-open@789cdd0a-07a5-485c-8373-4a2316680b6a' failed with exit code 4.

[   58.569783] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf does not exist, blacklisting all devices.

[   58.569809] multipathd[1091]: Nov 22 16:27:50 | You can run "/sbin/mpathconf --enable" to create

[   58.569823] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf. See man mpathconf(8) for more details

Comment 3 Mike Fiedler 2019-11-22 16:41:42 UTC
4.3.0-0.nightly-2019-11-19-122017  is known to be OK

Comment 4 Colin Walters 2019-11-22 20:40:32 UTC
Hmm.  I just sent up https://github.com/openshift/installer/pull/2714 which should help with this.

Comment 6 Michael Nguyen 2019-12-04 16:37:58 UTC
I went through a normal install on AWS of the 4.4 nightly below with no emergency shell.  I checked the rhcos version to make sure it had the bump from the PR.  If anyone is still seeing issues, please reply.  Otherwise I will close as verified.

$ oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-128-119.ec2.internal   Ready    master   30m   v1.16.2
ip-10-0-132-51.ec2.internal    Ready    worker   19m   v1.16.2
ip-10-0-145-29.ec2.internal    Ready    worker   19m   v1.16.2
ip-10-0-145-74.ec2.internal    Ready    master   30m   v1.16.2
ip-10-0-164-54.ec2.internal    Ready    master   30m   v1.16.2
ip-10-0-170-213.ec2.internal   Ready    worker   18m   v1.16.2
$ oc debug node/ip-10-0-128-119.ec2.internal
Starting pod/ip-10-0-128-119ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0a8502a25bd2c039a8a3dcfb0396688f0801b96a77daec774172f4622fa792b7
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.201912040328.0 (2019-12-04T03:33:31Z)

  ostree://e884477421640d1285c07a6dd9aaf01c9e125038ebbe6290a5e341eb3695a4d1
                   Version: 43.81.201911221453.0 (2019-11-22T14:58:44Z)
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2019-12-04-104500   True        False         5m39s   Cluster version is 4.4.0-0.nightly-2019-12-04-104500

Comment 7 Evgeny Slutsky 2019-12-05 16:13:39 UTC
*** Bug 1780120 has been marked as a duplicate of this bug. ***

Comment 8 Mike Fiedler 2019-12-06 01:08:50 UTC
I have not seen this on recent builds.   Marking verified on registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-05-213858

Comment 9 Dave Cain 2020-01-07 02:12:45 UTC
Seeing this on baremetal with the latest metal version rhcos-43.81.201912030353.0-metal.x86_64.raw.gz available on the public mirror here:
https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.3/latest/

Comment 12 errata-xmlrpc 2020-05-04 11:16:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581