1813863 – OCP 4.5 fails to get deployed on Vsphere UPI

Bug 1813863 - OCP 4.5 fails to get deployed on Vsphere UPI

Summary: OCP 4.5 fails to get deployed on Vsphere UPI

Keywords:
Status:	CLOSED DUPLICATE of bug 1813026
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-16 10:36 UTC by Neha Berry
Modified:	2020-03-16 20:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-16 20:05:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Neha Berry 2020-03-16 10:36:14 UTC

1. basic functionality


Description of problem:
===============================

OCP 4.5.0-0.nightly-2020-03-15-220309 install fails at 98% in a VMware Vsphere UPI + RHCOS environment. Tried install multiple times but the installs failed with similar error message of "timed out"

Worker machines failed to come up (either 1 or all). As seen in vcenter, all 3 compute nodes for Worker are UP, but they failed to become OCP nodes




time="2020-03-16T07:28:26Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, cluster-autoscaler, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-api, monitoring"
time="2020-03-16T07:32:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-03-15-220309: 98% complete"
[root@magna012 mar16-dc7]# 


E           subprocess.TimeoutExpired: Command '['/root/neha_vmware/ocs-ci/bin/openshift-install', 'wait-for', 'install-complete', '--dir', '/root/neha_vmware/mar16-dc7/', '--log-level', 'INFO']' timed out after 1800 seconds

/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py:871: TimeoutExpired

During handling of the above exception, another exception occurred:


Version-Release number of the following components:
===================================

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          3h14m   Unable to apply 4.5.0-0.nightly-2020-03-15-220309: some cluster operators have not yet rolled out




How reproducible:
====================
Always

Steps to Reproduce:
===========================

1. Use openshift-installer of 4.5 nightly build to bring up a cluster
  Used run-ci from ocs-ci

run-ci  -m deployment  --cluster-name nberry-dc7-mar16 --cluster-path /root/neha_vmware/mar16-dc7/ --ocsci-conf conf/deployment/vsphere/upi_1az_rhcos_vmfs_3m_3w.yaml --ocsci-conf conf/ocsci/skip_ocs_deploy.yaml --ocsci-conf dc7_c1.yaml --deploy



2. Confirm if install succeeds.


Actual results:
====================
An OCP cluster with 3 master and 3 worker nodes doesnt come UP


Expected results:
===========================
OCP with default 3 master and 3 worker should come up.

Additional info:
==================

1. It is seen that even automated runs are also failing, is it due to same reason- [1]


[1] - https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-03-15-220309


2. OCP cluster was brought up using ocs-ci.

3. only 3 master nodes came up as OCP nodes

# oc get nodes -o wide                                                   |
NAME              STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             |
                                          KERNEL-VERSION                CONTAINER-RUNTIME                |
control-plane-0   Ready    master   3h43m   v1.17.1   10.46.27.133   10.46.27.133   Red Hat Enterprise Li|
nux CoreOS 45.81.202003131930-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.git|
dfc8414.el8                                                                                              |
control-plane-1   Ready    master   3h44m   v1.17.1   10.46.27.135   10.46.27.135   Red Hat Enterprise Li|
nux CoreOS 45.81.202003131930-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.git|
dfc8414.el8                                                                                              |
control-plane-2   Ready    master   3h43m   v1.17.1   10.46.27.136   10.46.27.136   Red Hat Enterprise Li
nux CoreOS 45.81.202003131930-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.git
dfc8414.el8                                                                                              
(venv) [root@magna012 mar16-dc7]#

Comment 2 Abhinav Dahiya 2020-03-16 17:17:07 UTC

> Worker machines failed to come up (either 1 or all). As seen in vcenter, all 3 compute nodes for Worker are UP, but they failed to become OCP nodes




time="2020-03-16T07:28:26Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, cluster-autoscaler, console, csi-snapshot-controller, ingress, kube-storage-version-migrator, machine-api, monitoring"
time="2020-03-16T07:32:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-03-15-220309: 98% complete"
[root@magna012 mar16-dc7]# 

The workers are failing to join the cluster, so probably get the node team to look at it.

Comment 3 Ryan Phillips 2020-03-16 18:49:48 UTC

Mar 16 07:09:48.887942 control-plane-0 crio[1276]: time="2020-03-16 07:09:48.887885286Z" level=error msg="Container creation error: time=\"2020-03-16T07:09:48Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"exec: \\\\\\\"/manager\\\\\\\": stat /manager: no such file or directory\\\"\"\ncontainer_linux.go:349: starting container process caused \"exec: \\\"/manager\\\": stat /manager: no such file or directory\"\n"
Mar 16 07:09:49.054036 control-plane-0 crio[1276]: time="2020-03-16 07:09:49.053982991Z" level=error msg="Container creation error: time=\"2020-03-16T07:09:49Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"exec: \\\\\\\"/machine-controller-manager\\\\\\\": stat /machine-controller-manager: no such file or directory\\\"\"\ncontainer_linux.go:349: starting container process caused \"exec: \\\"/machine-controller-manager\\\": stat /machine-controller-manager: no such file or directory\"\n"

Comment 4 Ryan Phillips 2020-03-16 20:05:16 UTC

Marking as a duplicate. The installer is using the wrong image:

    - image: docker.io/openshift/origin-machine-api-operator:v4.0.0
      imageID: ""
      lastState: {}
      name: controller-manager
      ready: false
      restartCount: 0
      started: false
      state:
        waiting:
          message: |
            container create failed: time="2020-03-16T08:06:47Z" level=error msg="container_linux.go:349: starting container process caused \"exec: \\\"/manager\\\": stat /manager: no such file or directory\""
            container_linux.go:349: starting container process caused "exec: \"/manager\": stat /manager: no such file or directory"

*** This bug has been marked as a duplicate of bug 1813026 ***

Note You need to log in before you can comment on or make changes to this bug.