Bug 1804482

Summary: Unable to bring up cluster on vsphere env
Product: OpenShift Container Platform Reporter: Anurag saxena <anusaxen>
Component: InstallerAssignee: Joseph Callen <jcallen>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: jiajliu, jialiu, mifiedle, rbrattai, weliang, wsun, zzhao
Version: 4.4Keywords: Reopened, TestBlocker
Target Milestone: ---Flags: anusaxen: needinfo-
anusaxen: needinfo-
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-24 13:49:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1798945    
Bug Blocks:    

Description Anurag saxena 2020-02-18 23:33:26 UTC
Description of problem: Not able to bring up sdn or ovn clusters on vpshere. Suspecting installer issue as i noticed the installer now waits 20 minutes for kube api to get ready might not be giving enough time as opposed to earlier releases 30 min time. Is it intentional on 4.4 or an error?

level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.qe-anusaxen-vs1.qe.devcluster.openshift.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get https://api.qe-anusaxen-vs1.qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 139.178.76.10:6443: i/o timeout"
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="openshift-install gather bootstrap --help"
level=fatal msg="waiting for Kubernetes API: context deadline exceeded"
+ exit 3

Version-Release number of the following components:4.4.0-0.nightly-2020-02-18-200822

How reproducible: Always

Steps to Reproduce:
1.Bring up OCP cluster on vsphere
2.
3.

Actual results:


Expected results: Cluster should come up fine on Vsphere env

Additional info:

Comment 1 Abhinav Dahiya 2020-02-18 23:38:55 UTC
> Suspecting installer issue as i noticed the installer now waits 20 minutes for kube api to get ready might not be giving enough time as opposed to earlier releases 30 min time. Is it intentional on 4.4 or an error?

This is intentional in 4.4


> level=info msg="Use the following commands to gather logs from the cluster"
> level=info msg="openshift-install gather bootstrap --help"
> level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

Debugging requires that you attach the log bundle as requested by the installer.

Comment 2 Anurag saxena 2020-02-19 14:00:38 UTC
Thanks for confirming. I will try to gather more logs on this.

Comment 3 Anurag saxena 2020-02-19 16:48:49 UTC
@Abhinav, I can share the bootstrap node IP with you to look at. Please ping me when you are in. Thanks

Comment 5 Anurag saxena 2020-02-19 18:47:58 UTC
Thanks @Joseph for refering the PR. I will discuss this with installer QE team to find out more.

Comment 6 liujia 2020-02-20 01:59:34 UTC
We hit it 2 days ago in qe's ci test, the failure is caused by another known issue https://bugzilla.redhat.com/show_bug.cgi?id=1804032. Not installer issue.

Comment 7 Anurag saxena 2020-02-20 13:54:26 UTC
depends on Bug 1798945 as discussed with Jainlin/Jia from installer team. etcd operator issue

Comment 8 Anurag saxena 2020-02-20 13:57:57 UTC
Joseph, target version should be 4.4?

Comment 9 Joseph Callen 2020-02-20 14:49:00 UTC
Was the close a mistake?  What is the current status?

I have installed OCP 4.4 on vSphere w/UPI no problems (after the etcd operator issue was resolved)
CI [0] has flakes (not installer related) and passing at 50%.

[0] - https://prow.svc.ci.openshift.org/?job=*vsphere*4.4

Comment 10 Anurag saxena 2020-02-20 15:10:55 UTC
(Yep, close was a mistake) 
Joseph, The root cause seems to be the broken boot rhcos image(rhcos-44.81.202002071430-0), which is being tracked in Bug 1804032.
And if we use another old boot rhcos image(such as rhcos-44.81.202001241431.0), then we will hit another known issue https://bugzilla.redhat.com/show_bug.cgi?id=1798945#c8

So apparently an etcd+RHCOS component issue, not installer

Comment 11 Joseph Callen 2020-02-21 21:20:27 UTC
Why do we have this BZ when the issue is with components other than the installer?
Is there an issue with vSphere UPI that I can help with? If not this really should be closed.

Comment 12 Anurag saxena 2020-02-21 21:30:48 UTC
Joseph. Yes, we can change this to "closed Duplicate of 1798945" thats what i am hitting.

Comment 13 Joseph Callen 2020-02-24 13:49:36 UTC

*** This bug has been marked as a duplicate of bug 1798945 ***