Bug 2102662

Summary: Elevated Azure Install Fail Rate in CI: API i/o timeout
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: InstallerAssignee: OCP Installer <ocp-installer>
Installer sub component: openshift-installer QA Contact: MayXu <maxu>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: unspecified CC: maxu, padillon, rdossant, sdodson
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:22:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-06-30 12:18:36 UTC
We have noticed that Azure install rate is now 85% and dropping:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.11?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522equals%2522%252C%2522value%2522%253A%2522cluster%2520install.install%2520should%2520succeed%253A%2520overall%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522azure%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D

We've identified two problems in https://issues.redhat.com/browse/TRT-367, this bug represents one of them.


https://search.ci.openshift.org/?search=level%3Derror+msg%3DAttempted+to+gather+ClusterOperator+status+after+installation+failure%3A+listing+ClusterOperator+objects.*i%2Fo+timeout&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Currently showing 4.61% of all azure 4.11 jobs. 



We see hits in past releases and as such I don't think this is a product regression, however it may be blocking us from green-lighting 4.11 as the install % for all core platforms is historically 95%+ at time of release.

Unclear when this problem started.

Here's a specific sample prow job to rally around:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn/1542324425607614464

Comment 1 Patrick Dillon 2022-07-05 17:49:05 UTC
This may or may not be an installer issue. We will look through the gather bootstrap to determine if this belongs to a different team.

Comment 4 Rafael Fonseca 2022-07-09 13:22:21 UTC
Attached logs for a run where a similar problem was seen: the API stops responding after the bootstrap is finished. To that end, I had to make a small change in the installer code [1] and I created a cluster from that PR using ci-chat-bot and OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP=1 so we could gather the logs even though the API is inaccessible and the bootstrap was finished. 

[1] https://github.com/openshift/installer/pull/6103

Comment 5 Shiftzilla 2023-03-09 01:22:58 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9354