+++ This bug was initially created as a clone of Bug #1775728 +++ Description of problem: Running an OOTB openshift-install on AWS on recent builds like 4.3.0-0.nightly-2019-11-21-122827 and 4.3.0-0.nightly-2019-11-22-050018. Install fails and the bootstrap node refuses ssh connections. Looking at the screenshot of the instance console from AWS show it is in emergency mode. (Screenshot attached system log will be attached) Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-22-050018 How reproducible: Always - not sure how builds passed CI, but always happens. Steps to Reproduce: 1. On AWS openshift-install create cluster and walk through a normal install 2. 3. --- Additional comment from Mike Fiedler on 2019-11-22 16:39:52 UTC --- --- Additional comment from Mike Fiedler on 2019-11-22 16:41:08 UTC --- [ 58.569493] systemd-udevd[1074]: Process '/usr/bin/systemctl --no-block start coreos-luks-open@789cdd0a-07a5-485c-8373-4a2316680b6a' failed with exit code 4. [ 58.569783] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf does not exist, blacklisting all devices. [ 58.569809] multipathd[1091]: Nov 22 16:27:50 | You can run "/sbin/mpathconf --enable" to create [ 58.569823] multipathd[1091]: Nov 22 16:27:50 | /etc/multipath.conf. See man mpathconf(8) for more details --- Additional comment from Mike Fiedler on 2019-11-22 16:41:42 UTC --- 4.3.0-0.nightly-2019-11-19-122017 is known to be OK --- Additional comment from Colin Walters on 2019-11-22 20:40:32 UTC --- Hmm. I just sent up https://github.com/openshift/installer/pull/2714 which should help with this.
Agree, e.g. this installation https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/72806/console where it failed with: ... level=debug msg="Still waiting for the Kubernetes API: Get https://...:6443/version?timeout=32s: dial tcp 52.78.20.8:6443: connect: connection refused" level=debug msg="Still waiting for the Kubernetes API: Get https://...:6443/version?timeout=32s: dial tcp 52.78.20.8:6443: connect: connection refused" level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://...:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp ...:6443: connect: connection refused" ... level=info msg="Pulling debug logs from the bootstrap machine" level=error msg="Attempted to gather debug logs after installation failure:... level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" ... Checked it in AWS web console about the bootstrap VM's "System Log", it is same as above Description: ... [ 64.912104] systemd[1]: Started Emergency Shell. [ 64.919157] systemd[1]: Reached target Emergency Mode.
The cherry-pick PR to 4.3 is still waiting to be merged. - https://github.com/openshift/installer/pull/2724
I installed 4.3.0-0.nightly-2019-12-04-054458 with no issues. The version of RHCOS in the bump is present in the build. If any one else is still seeing this issue please respond. Otherwise I will close the BZ as verified. $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-138-104.ec2.internal Ready master 22m v1.16.2 ip-10-0-141-21.ec2.internal Ready worker 15m v1.16.2 ip-10-0-152-248.ec2.internal Ready worker 15m v1.16.2 ip-10-0-158-38.ec2.internal Ready master 22m v1.16.2 ip-10-0-163-90.ec2.internal Ready master 22m v1.16.2 ip-10-0-173-72.ec2.internal Ready worker 14m v1.16.2 $ oc debug node/ip-10-0-138-104.ec2.internal Starting pod/ip-10-0-138-104ec2internal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aa2271b7cdf177aa0368fdf854027a5c54d03b90f089701190b2533147d4469d CustomOrigin: Managed by machine-config-operator Version: 43.81.201912040340.0 (2019-12-04T03:45:20Z) ostree://e884477421640d1285c07a6dd9aaf01c9e125038ebbe6290a5e341eb3695a4d1 Version: 43.81.201911221453.0 (2019-11-22T14:58:44Z) sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2019-12-04-054458 True False 5m18s Cluster version is 4.3.0-0.nightly-2019-12-04-054458
This is working for me now. I think we can mark VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062