Bug 2024900

Summary: Operator upgrade kube-apiserver
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: InstallerAssignee: Arda Guclu <aguclu>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Eldar Weiss <eweiss>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aguclu, aos-bugs, eweiss, mfojtik, sanchezl, sippy, sttts, xxia
Version: 4.10Keywords: Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:29:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devan Goodwin 2021-11-19 13:12:48 UTC
Operator upgrade kube-apiserver

has begun failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=Operator%20upgrade%20kube-apiserver

Aggregated jobs on CI payloads appear to have caught a regression in this test, historically passing 100% of the time, now failing 20-30% of the time.

A good sample prow job would be:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461619400548290560

Test failure looks as follows:

Failed to upgrade kube-apiserver, operator was degraded (ValidatingAdmissionWebhookConfiguration_WebhookServiceConnectionError): ValidatingAdmissionWebhookConfigurationDegraded: vprovisioning.kb.io: dial tcp 172.30.203.253:443: connect: connection refused

It is often accompanied by:

operator conditions kube-apiserver expand_less 	0s
Operator degraded (ValidatingAdmissionWebhookConfiguration_WebhookServiceConnectionError): ValidatingAdmissionWebhookConfigurationDegraded: vprovisioning.kb.io: dial tcp 172.30.203.253:443: connect: connection refused

The problem appears to have begun last night, somewhere around this CI release:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.ci/release/4.10.0-0.ci-2021-11-19-045525

This payload did contain a kube apiserver operator change:

cluster-kube-apiserver-operator

    set kube-apiserver degraded=true if a webhook service is missing or down #1245

Comment 3 Devan Goodwin 2021-11-19 13:43:16 UTC
Problem has likely been around for awhile, but new checks went in last night which caught the problem: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1256 has been opened to revert the new checks while a proper fix is pursued.

Reverting so we can get payloads flowing again, the checks look great, just need to solve this before they can go in.

Comment 10 errata-xmlrpc 2022-03-10 16:29:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056