1807106 – Wasn't able to finish openshift-install wait-for install-complete in 30 mins timeout - Cluster operator authentication is still updating

Bug 1807106 - Wasn't able to finish openshift-install wait-for install-complete in 30 mins timeout - Cluster operator authentication is still updating

Summary: Wasn't able to finish openshift-install wait-for install-complete in 30 mins ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Urvashi Mohnani
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-25 15:39 UTC by Petr Balogh
Modified:	2020-05-13 22:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 22:41:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petr Balogh 2020-02-25 15:39:44 UTC

Description of problem:
During UPI deployment on vmware we see that we weren't able to finish command:
/home/jenkins/bin/openshift-install wait-for install-complete --dir /home/jenkins/current-cluster-dir/openshift-cluster-dir --log-level INFO

Our jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4860/console

Actually we have 30 mins timeout to run this command and it got interrupted from our function run_cmd, so didn't finish in this timeout but I think that 30 mins is reasonable timeout as it's also used in IPI installer IIRC.

Should we wait more or this 30 mins is enough? If you can see some more details from logs please help us to understand what is wrong and stabilize our CI deployment over VmWare. Thanks

Installer logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cs33-t1/jnk-vu1cs33-t1_20200222T074822/logs/openshift_install_create_cluster_1582362132.log

Must gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cs33-t1/jnk-vu1cs33-t1_20200222T074822/logs/failed_testcase_ocs_logs_1582358295/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/

Version-Release number of selected component (if applicable):
4.3.1 GA version

How reproducible:


Steps to Reproduce:
1. Install UPI on VmWare
2. Than wait for install complete with mentioned command


Actual results:
We didn't get cluster ready in 30 mins

Expected results:
Have installation done successfully.

Comment 1 Abhinav Dahiya 2020-02-25 17:52:50 UTC

Moving to Auth as the operator is still progressing.

And as for the timeout, the current expectation is that things should finish under 30 minutes.

Comment 4 Urvashi Mohnani 2020-03-05 20:30:23 UTC

Hi Petr, did this used to work with 4.2? I know some changes went into cri-o 1.16 (ocp 4.3) that lead to permission denied issues.

Comment 5 Petr Balogh 2020-03-06 09:36:26 UTC

Hello Urvashi, 

I do not remember this issue in 4.2 and actually I've seen this issue only once so far. If we will reproduce I will update here for sure but I've attached enough logs I hope from first occurrence.

Petr

Comment 6 Urvashi Mohnani 2020-05-13 22:41:34 UTC

Looked through the logs attached and not much info in there. Since this only happened once, and only happened on 1 out of the 3 nodes, it is very likely that this was a flake. Petr, please re-open if you run into this again.

Note You need to log in before you can comment on or make changes to this bug.