Created attachment 1638801 [details]
Console of hung instance
Description of problem: Running an OOTB openshift-install on AWS on recent builds like 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018.
Install fails and the bootstrap node refuses ssh connections. Looking at the screenshot of the instance console from AWS show it is in emergency mode.
(Screenshot attached system log will be attached)
Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-22-050018
How reproducible: Always - not sure how builds passed CI, but always happens.
Steps to Reproduce:
1. On AWS openshift-install create cluster and walk through a normal install
Created attachment 1638804 [details]
[ 58.569493] systemd-udevd: Process '/usr/bin/systemctl --no-block start coreos-luks-open@789cdd0a-07a5-485c-8373-4a2316680b6a' failed with exit code 4.
[ 58.569783] multipathd: Nov 22 16:27:50 | /etc/multipath.conf does not exist, blacklisting all devices.
[ 58.569809] multipathd: Nov 22 16:27:50 | You can run "/sbin/mpathconf --enable" to create
[ 58.569823] multipathd: Nov 22 16:27:50 | /etc/multipath.conf. See man mpathconf(8) for more details
4.3.0-0.nightly-2019-11-19-122017 is known to be OK
Hmm. I just sent up https://github.com/openshift/installer/pull/2714 which should help with this.
I went through a normal install on AWS of the 4.4 nightly below with no emergency shell. I checked the rhcos version to make sure it had the bump from the PR. If anyone is still seeing issues, please reply. Otherwise I will close as verified.
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-128-119.ec2.internal Ready master 30m v1.16.2
ip-10-0-132-51.ec2.internal Ready worker 19m v1.16.2
ip-10-0-145-29.ec2.internal Ready worker 19m v1.16.2
ip-10-0-145-74.ec2.internal Ready master 30m v1.16.2
ip-10-0-164-54.ec2.internal Ready master 30m v1.16.2
ip-10-0-170-213.ec2.internal Ready worker 18m v1.16.2
$ oc debug node/ip-10-0-128-119.ec2.internal
Starting pod/ip-10-0-128-119ec2internal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status
CustomOrigin: Managed by machine-config-operator
Version: 43.81.201912040328.0 (2019-12-04T03:33:31Z)
Version: 43.81.201911221453.0 (2019-11-22T14:58:44Z)
Removing debug pod ...
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.0-0.nightly-2019-12-04-104500 True False 5m39s Cluster version is 4.4.0-0.nightly-2019-12-04-104500
*** Bug 1780120 has been marked as a duplicate of this bug. ***
I have not seen this on recent builds. Marking verified on registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-05-213858
Seeing this on baremetal with the latest metal version rhcos-43.81.201912030353.0-metal.x86_64.raw.gz available on the public mirror here:
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.