1967398 – authentication operator still uses previous deleted pod ip rather than the new created pod ip to do health check

Bug 1967398 - authentication operator still uses previous deleted pod ip rather than the new created pod ip to do health check

Summary: authentication operator still uses previous deleted pod ip rather than the ne...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Standa Laznicka
QA Contact:	pmali
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-03 06:02 UTC by liyao
Modified:	2021-07-27 23:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A stale condition might have cause the authentication operator to appear degraded after upgrade even though there were no problems. Consequence: False-positive cluster degradation. Fix: Remove old and unused conditions from the operator's status. Result: The authentication operator should correctly report as "Degraded" only when there is an actual problem.
Clone Of:
Environment:
Last Closed:	2021-07-27 23:11:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 449	0	None	open	Bug 1967398: operator: add OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded to stale conditions	2021-06-03 10:56:13 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:11:29 UTC

Description liyao 2021-06-03 06:02:08 UTC

Description of problem:
During the upgrade from 4.7.0-0.nightly-2021-05-17-040457 to 4.8.0-0.nightly-2021-05-19-092807, it fails with authentication degraded.


Version-Release number of selected component (if applicable):

4.7.0-0.nightly-2021-05-17-040457 to
4.8.0-0.nightly-2021-05-19-092807


How reproducible:
Not sure

Steps to Reproduce:
1.
2.
3.

Actual results:
Upgrade from 4.7.0-0.nightly-2021-05-17-040457 to 4.8.0-0.nightly-2021-05-19-092807 hangs with authentication degraded:
oc describe co authentication shows:
  Conditions:
    Last Transition Time:  2021-05-19T14:41:33Z
    Message:               OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.129.0.17:6443/healthz": context canceled
    Reason:                OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-05-19T14:42:48Z
    Message:               All is well
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-05-19T14:44:48Z
    Message:               All is well
    Reason:                AsExpected
    Status:                True
    Type:                  Available

Check the must gather log, 10.129.0.17:6443 is the ip of the pod which belongs to openshift-authentication but the pod is deleted at "May 19 14:40:35.293321" and new pods are created around "2021-05-19T14:40+" with new ips 10.130.0.38|10.128.0.53|10.129.0.56.

From the upgrade CI log, the health check happens at '[2021-05-19T17:12:11.721Z]', more than 2 hours later, but still uses the previous pod ip(10.129.0.17) not the new pod ip(10.130.0.38|10.128.0.53|10.129.0.56) to do health check. That's the failure reason for health check

must gather log link: http://file.rdu.redhat.com/~xxia/bug_1967398_must-gather.local.5095653185111688673.tar.gz

Expected results:
Upgrade from 4.7.0-0.nightly-2021-05-17-040457 to 4.8.0-0.nightly-2021-05-19-092807 successes.

Additional info:

matrix: 27_UPI on GCP with RHCOS && XPN

Comment 2 liyao 2021-06-08 05:56:29 UTC

Test upgrade from 4.7.0-0.nightly-2021-06-07-095830 to 4.8.0-0.nightly-2021-06-07-180258
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-06-07-180258 --force=true --allow-explicit-upgrade=true

During the upgrade process, force update the oauth configuration 5 times to redeploy new pods with new ips, original issue hanging with old pod's IP is gone
$ oc edit oauth cluster

Check cluster version after upgarde finished
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-07-180258   True        False         123m    Cluster version is 4.8.0-0.nightly-2021-06-07-180258

Comment 5 errata-xmlrpc 2021-07-27 23:11:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.