Bug 2102662 - Elevated Azure Install Fail Rate in CI: API i/o timeout
Summary: Elevated Azure Install Fail Rate in CI: API i/o timeout
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: OCP Installer
QA Contact: MayXu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-30 12:18 UTC by Devan Goodwin
Modified: 2023-03-09 01:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:22:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Devan Goodwin 2022-06-30 12:18:36 UTC
We have noticed that Azure install rate is now 85% and dropping:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.11?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522equals%2522%252C%2522value%2522%253A%2522cluster%2520install.install%2520should%2520succeed%253A%2520overall%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522azure%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D

We've identified two problems in https://issues.redhat.com/browse/TRT-367, this bug represents one of them.


https://search.ci.openshift.org/?search=level%3Derror+msg%3DAttempted+to+gather+ClusterOperator+status+after+installation+failure%3A+listing+ClusterOperator+objects.*i%2Fo+timeout&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Currently showing 4.61% of all azure 4.11 jobs. 



We see hits in past releases and as such I don't think this is a product regression, however it may be blocking us from green-lighting 4.11 as the install % for all core platforms is historically 95%+ at time of release.

Unclear when this problem started.

Here's a specific sample prow job to rally around:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn/1542324425607614464

Comment 1 Patrick Dillon 2022-07-05 17:49:05 UTC
This may or may not be an installer issue. We will look through the gather bootstrap to determine if this belongs to a different team.

Comment 4 Rafael Fonseca 2022-07-09 13:22:21 UTC
Attached logs for a run where a similar problem was seen: the API stops responding after the bootstrap is finished. To that end, I had to make a small change in the installer code [1] and I created a cluster from that PR using ci-chat-bot and OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP=1 so we could gather the logs even though the API is inaccessible and the bootstrap was finished. 

[1] https://github.com/openshift/installer/pull/6103

Comment 5 Shiftzilla 2023-03-09 01:22:58 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9354


Note You need to log in before you can comment on or make changes to this bug.