1. basic functionality Description of problem: =============================== OCP 4.5.0-0.nightly-2020-03-15-220309 install fails at 98% in a VMware Vsphere UPI + RHCOS environment. Tried install multiple times but the installs failed with similar error message of "timed out" Worker machines failed to come up (either 1 or all). As seen in vcenter, all 3 compute nodes for Worker are UP, but they failed to become OCP nodes time="2020-03-16T07:28:26Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, cluster-autoscaler, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-api, monitoring" time="2020-03-16T07:32:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-03-15-220309: 98% complete" [root@magna012 mar16-dc7]# E subprocess.TimeoutExpired: Command '['/root/neha_vmware/ocs-ci/bin/openshift-install', 'wait-for', 'install-complete', '--dir', '/root/neha_vmware/mar16-dc7/', '--log-level', 'INFO']' timed out after 1800 seconds /opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py:871: TimeoutExpired During handling of the above exception, another exception occurred: Version-Release number of the following components: =================================== # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 3h14m Unable to apply 4.5.0-0.nightly-2020-03-15-220309: some cluster operators have not yet rolled out How reproducible: ==================== Always Steps to Reproduce: =========================== 1. Use openshift-installer of 4.5 nightly build to bring up a cluster Used run-ci from ocs-ci run-ci -m deployment --cluster-name nberry-dc7-mar16 --cluster-path /root/neha_vmware/mar16-dc7/ --ocsci-conf conf/deployment/vsphere/upi_1az_rhcos_vmfs_3m_3w.yaml --ocsci-conf conf/ocsci/skip_ocs_deploy.yaml --ocsci-conf dc7_c1.yaml --deploy 2. Confirm if install succeeds. Actual results: ==================== An OCP cluster with 3 master and 3 worker nodes doesnt come UP Expected results: =========================== OCP with default 3 master and 3 worker should come up. Additional info: ================== 1. It is seen that even automated runs are also failing, is it due to same reason- [1] [1] - https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-03-15-220309 2. OCP cluster was brought up using ocs-ci. 3. only 3 master nodes came up as OCP nodes # oc get nodes -o wide | NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE | KERNEL-VERSION CONTAINER-RUNTIME | control-plane-0 Ready master 3h43m v1.17.1 10.46.27.133 10.46.27.133 Red Hat Enterprise Li| nux CoreOS 45.81.202003131930-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.git| dfc8414.el8 | control-plane-1 Ready master 3h44m v1.17.1 10.46.27.135 10.46.27.135 Red Hat Enterprise Li| nux CoreOS 45.81.202003131930-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.git| dfc8414.el8 | control-plane-2 Ready master 3h43m v1.17.1 10.46.27.136 10.46.27.136 Red Hat Enterprise Li nux CoreOS 45.81.202003131930-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.git dfc8414.el8 (venv) [root@magna012 mar16-dc7]#
> Worker machines failed to come up (either 1 or all). As seen in vcenter, all 3 compute nodes for Worker are UP, but they failed to become OCP nodes time="2020-03-16T07:28:26Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, cluster-autoscaler, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-api, monitoring" time="2020-03-16T07:32:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-03-15-220309: 98% complete" [root@magna012 mar16-dc7]# The workers are failing to join the cluster, so probably get the node team to look at it.
Mar 16 07:09:48.887942 control-plane-0 crio[1276]: time="2020-03-16 07:09:48.887885286Z" level=error msg="Container creation error: time=\"2020-03-16T07:09:48Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"exec: \\\\\\\"/manager\\\\\\\": stat /manager: no such file or directory\\\"\"\ncontainer_linux.go:349: starting container process caused \"exec: \\\"/manager\\\": stat /manager: no such file or directory\"\n" Mar 16 07:09:49.054036 control-plane-0 crio[1276]: time="2020-03-16 07:09:49.053982991Z" level=error msg="Container creation error: time=\"2020-03-16T07:09:49Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"exec: \\\\\\\"/machine-controller-manager\\\\\\\": stat /machine-controller-manager: no such file or directory\\\"\"\ncontainer_linux.go:349: starting container process caused \"exec: \\\"/machine-controller-manager\\\": stat /machine-controller-manager: no such file or directory\"\n"
Marking as a duplicate. The installer is using the wrong image: - image: docker.io/openshift/origin-machine-api-operator:v4.0.0 imageID: "" lastState: {} name: controller-manager ready: false restartCount: 0 started: false state: waiting: message: | container create failed: time="2020-03-16T08:06:47Z" level=error msg="container_linux.go:349: starting container process caused \"exec: \\\"/manager\\\": stat /manager: no such file or directory\"" container_linux.go:349: starting container process caused "exec: \"/manager\": stat /manager: no such file or directory" *** This bug has been marked as a duplicate of bug 1813026 ***