1878845 – 4.5 to 4.6.rc.4 upgrade failure: authentication operator health check connection refused for multitenant mode

Bug 1878845 - 4.5 to 4.6.rc.4 upgrade failure: authentication operator health check connection refused for multitenant mode

Summary: 4.5 to 4.6.rc.4 upgrade failure: authentication operator health check connect...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1890287 1890656
TreeView+	depends on / blocked

Reported:	2020-09-14 16:28 UTC by Mike Fiedler
Modified:	2021-02-24 15:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1890287 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:17:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 204	0	None	closed	Bug 1878845: Fix ruleversion	2021-01-26 13:26:27 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:18:22 UTC

Description Mike Fiedler 2020-09-14 16:28:58 UTC

Description of problem:

Upgrading 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938 in QE CI failed with the authentication-operator failing to upgrade.   Will attach must-gather location in a private message.

Cluster profile:  12_Disconnected UPI on vSphere with RHCOS (FIPS off)

Status:
[2020-09-14T07:28:06.960Z]   Conditions:
[2020-09-14T07:28:06.960Z]     Last Transition Time:  2020-09-14T06:09:55Z
[2020-09-14T07:28:06.960Z]     Message:               OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.29.85:443/healthz": dial tcp 172.30.29.85:443: connect: connection refused
[2020-09-14T07:28:06.960Z]     Reason:                OAuthServiceCheckEndpointAccessibleController_SyncError
[2020-09-14T07:28:06.960Z]     Status:                True
[2020-09-14T07:28:06.960Z]     Type:                  Degraded
[2020-09-14T07:28:06.960Z]     Last Transition Time:  2020-09-14T06:12:09Z
[2020-09-14T07:28:06.960Z]     Reason:                AsExpected
[2020-09-14T07:28:06.960Z]     Status:                False
[2020-09-14T07:28:06.960Z]     Type:                  Progressing
[2020-09-14T07:28:06.960Z]     Last Transition Time:  2020-09-14T06:07:55Z
[2020-09-14T07:28:06.960Z]     Message:               OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.29.85:443/healthz": dial tcp 172.30.29.85:443: connect: connection refused
[2020-09-14T07:28:06.960Z]     Reason:                OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable
[2020-09-14T07:28:06.960Z]     Status:                False
[2020-09-14T07:28:06.960Z]     Type:                  Available
[2020-09-14T07:28:06.960Z]     Last Transition Time:  2020-09-14T02:58:40Z
[2020-09-14T07:28:06.960Z]     Reason:                AsExpected
[2020-09-14T07:28:06.960Z]     Status:                True
[2020-09-14T07:28:06.960Z]     Type:                  Upgradeable


Version-Release number of selected component (if applicable): 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938

Comment 1 Mike Fiedler 2020-09-14 17:23:22 UTC

must-gather collection failed, cluster no longer available

Comment 6 Maru Newby 2020-10-20 02:50:51 UTC

Given that: 
 - the /healthz endpoint of both openshift-authentication/oauth-openshift pods were reachable by the authentication operator directly via the pods' ip's
 - the /healthz endpoint of both pods were not reachable by the authentication operator via the openshift-authentication/oauth-openshift route or service
 - the route was admitted but neither it nor the service could be reached from the authentication operator

I assume that:
 1. the authentication operator was not able to reach the oauth-openshift pods' ip's (10.130.0.50 and 10.129.0.48) via the oauth-openshift service's cluster ip (172.30.6.25)
 2. either the route relies on the service's cluster ip and was broken by the cluster ip not delivering traffic to the pod ip's, or the pod ip's were not accessible from the router pods 
   - the logs of the router pods are of insufficient verbosity to indicate connectivity issues with either the cluster ip or the pod ip's

Given the apparent issue with the oauth-openshift service, I'm reassigning to the SDN team for further investigation. 

-------------

insights gleaned from the must-gather linked to by #c5:

- Pod openshift-authentication/oauth-openshift-9b9f596dd-hqzg6 is reported as ready as of 2020-10-16T18:55:34Z at ip 10.130.0.50 on node 10.0.221.148  
  - No errors in the logs
- Pod openshift-authentication/oauth-openshift-9b9f596dd-nhxtn is reported as ready as of 2020-10-16T18:55:08Z at ip 10.129.0.48 on node 10.0.172.47  
  - No errors in the logs
- Endpoints openshift-authentication/oauth-openshift contains the ips of both pods as above as of 2020-10-16T18:55:34Z  
- Pod openshift-authentication-operator/authentication-operator-674754f47c-fzjvs is on a different node (10.0.137.175) than the oauth-openshift pods
- ClusterOperator authentication does not indicate OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded in its conditions
  - The absence of this condition indicates that the authentication operator successfully retrieved https://<pod id>:6443/healthz for both oauth-openshift pods
  - There is no indication in the authentication operator logs that that this degraded condition was present between 2020-10-16T18:55:34Z and 2020-10-16T20:36:46Z 
    - i.e. the pods' /healthz endpoint was always accessible once they became ready
- The ClusterOperator authentication indicates OAuthRouteCheckEndpointAccessibleControllerDegraded and OAuthServiceCheckEndpointAccessibleControllerDegraded from at least 2020-10-16T19:08:45Z
- Route openshift-authentication/oauth-openshift indicates in its status that it was admitted as of 2020-10-16T17:57:44Z

Comment 9 Ricardo Carrillo Cruz 2020-10-21 12:01:53 UTC

Been looking at service, endpoints and iptables and they all look fine.
Chatting with Juan, it seems he dealt with changes recently on multitenant and he missed doing a bump on
ovs flows ruleVersion.
He'll push a PR shortly.

Comment 10 Scott Dodson 2020-10-21 13:01:50 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 11 Juan Luis de Sousa-Valadas 2020-10-21 13:08:57 UTC

I think I know the root cause of it and if I'm right it's a trivial fix. It takes longer to test the fix than the fix itself.

> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
Customer using openshift-sdn in ovs-multitenant mode.

> What is the impact?  Is it serious enough to warrant blocking edges?
Services won't work reaching pods in a different node until the node is rebooted.

> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Admin needs to manually reboot

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
Yes. From any 4.5 to 4.6.0

Comment 12 Scott Dodson 2020-10-21 13:46:57 UTC

Given the limited scope of multitenant SDN configuration and the well understood workaround we will not block upgrades to 4.6.0 based on this bug, we should do everything possible to ensure that this makes the first 4.6.z which means the fix needs to merge by end of week.

Comment 13 Juan Luis de Sousa-Valadas 2020-10-21 17:53:07 UTC

Zhanqi, could you test an upgrade form 4.5.9 with multitenant to 4.6 with https://github.com/openshift/sdn/pull/204 ? I believe the problem is now fixed. Ping me if you need help to get a release built with this PR.

Comment 14 Scott Dodson 2020-10-22 00:54:06 UTC

If QE agrees to it, we'd like to see if we can get this into the master and release-4.6 branches tomorrow and include this in a 4.6.1 build that would target tomorrow. When testing please let us know if that seems ok to QE.

Comment 16 zhaozhanqi 2020-10-22 08:00:43 UTC

(In reply to Scott Dodson from comment #14)
> If QE agrees to it, we'd like to see if we can get this into the master and
> release-4.6 branches tomorrow and include this in a 4.6.1 build that would
> target tomorrow. When testing please let us know if that seems ok to QE.

it's ok since it's small change, but seems 4.6.1 build already come out.

Comment 18 Juan Luis de Sousa-Valadas 2020-10-22 11:05:06 UTC

Moving it to verified for a minute so that the bot cherry picks it automatically, then I'll move it back to post

Comment 19 Juan Luis de Sousa-Valadas 2020-10-22 11:05:39 UTC

Actually that won't work until it's merged. So moving back to post

Comment 21 zhaozhanqi 2020-10-23 09:50:14 UTC

Verified this bug on 4.7.0-0.nightly-2020-10-23-024149 upgrade from 4.5

Comment 24 errata-xmlrpc 2021-02-24 15:17:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.