Description of problem: Upgrading 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938 in QE CI failed with the authentication-operator failing to upgrade. Will attach must-gather location in a private message. Cluster profile: 12_Disconnected UPI on vSphere with RHCOS (FIPS off) Status: [2020-09-14T07:28:06.960Z] Conditions: [2020-09-14T07:28:06.960Z] Last Transition Time: 2020-09-14T06:09:55Z [2020-09-14T07:28:06.960Z] Message: OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.29.85:443/healthz": dial tcp 172.30.29.85:443: connect: connection refused [2020-09-14T07:28:06.960Z] Reason: OAuthServiceCheckEndpointAccessibleController_SyncError [2020-09-14T07:28:06.960Z] Status: True [2020-09-14T07:28:06.960Z] Type: Degraded [2020-09-14T07:28:06.960Z] Last Transition Time: 2020-09-14T06:12:09Z [2020-09-14T07:28:06.960Z] Reason: AsExpected [2020-09-14T07:28:06.960Z] Status: False [2020-09-14T07:28:06.960Z] Type: Progressing [2020-09-14T07:28:06.960Z] Last Transition Time: 2020-09-14T06:07:55Z [2020-09-14T07:28:06.960Z] Message: OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.29.85:443/healthz": dial tcp 172.30.29.85:443: connect: connection refused [2020-09-14T07:28:06.960Z] Reason: OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable [2020-09-14T07:28:06.960Z] Status: False [2020-09-14T07:28:06.960Z] Type: Available [2020-09-14T07:28:06.960Z] Last Transition Time: 2020-09-14T02:58:40Z [2020-09-14T07:28:06.960Z] Reason: AsExpected [2020-09-14T07:28:06.960Z] Status: True [2020-09-14T07:28:06.960Z] Type: Upgradeable Version-Release number of selected component (if applicable): 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938
must-gather collection failed, cluster no longer available
Given that: - the /healthz endpoint of both openshift-authentication/oauth-openshift pods were reachable by the authentication operator directly via the pods' ip's - the /healthz endpoint of both pods were not reachable by the authentication operator via the openshift-authentication/oauth-openshift route or service - the route was admitted but neither it nor the service could be reached from the authentication operator I assume that: 1. the authentication operator was not able to reach the oauth-openshift pods' ip's (10.130.0.50 and 10.129.0.48) via the oauth-openshift service's cluster ip (172.30.6.25) 2. either the route relies on the service's cluster ip and was broken by the cluster ip not delivering traffic to the pod ip's, or the pod ip's were not accessible from the router pods - the logs of the router pods are of insufficient verbosity to indicate connectivity issues with either the cluster ip or the pod ip's Given the apparent issue with the oauth-openshift service, I'm reassigning to the SDN team for further investigation. ------------- insights gleaned from the must-gather linked to by #c5: - Pod openshift-authentication/oauth-openshift-9b9f596dd-hqzg6 is reported as ready as of 2020-10-16T18:55:34Z at ip 10.130.0.50 on node 10.0.221.148 - No errors in the logs - Pod openshift-authentication/oauth-openshift-9b9f596dd-nhxtn is reported as ready as of 2020-10-16T18:55:08Z at ip 10.129.0.48 on node 10.0.172.47 - No errors in the logs - Endpoints openshift-authentication/oauth-openshift contains the ips of both pods as above as of 2020-10-16T18:55:34Z - Pod openshift-authentication-operator/authentication-operator-674754f47c-fzjvs is on a different node (10.0.137.175) than the oauth-openshift pods - ClusterOperator authentication does not indicate OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded in its conditions - The absence of this condition indicates that the authentication operator successfully retrieved https://<pod id>:6443/healthz for both oauth-openshift pods - There is no indication in the authentication operator logs that that this degraded condition was present between 2020-10-16T18:55:34Z and 2020-10-16T20:36:46Z - i.e. the pods' /healthz endpoint was always accessible once they became ready - The ClusterOperator authentication indicates OAuthRouteCheckEndpointAccessibleControllerDegraded and OAuthServiceCheckEndpointAccessibleControllerDegraded from at least 2020-10-16T19:08:45Z - Route openshift-authentication/oauth-openshift indicates in its status that it was admitted as of 2020-10-16T17:57:44Z
Been looking at service, endpoints and iptables and they all look fine. Chatting with Juan, it seems he dealt with changes recently on multitenant and he missed doing a bump on ovs flows ruleVersion. He'll push a PR shortly.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
I think I know the root cause of it and if I'm right it's a trivial fix. It takes longer to test the fix than the fix itself. > Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customer using openshift-sdn in ovs-multitenant mode. > What is the impact? Is it serious enough to warrant blocking edges? Services won't work reaching pods in a different node until the node is rebooted. > How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin needs to manually reboot > Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes. From any 4.5 to 4.6.0
Given the limited scope of multitenant SDN configuration and the well understood workaround we will not block upgrades to 4.6.0 based on this bug, we should do everything possible to ensure that this makes the first 4.6.z which means the fix needs to merge by end of week.
Zhanqi, could you test an upgrade form 4.5.9 with multitenant to 4.6 with https://github.com/openshift/sdn/pull/204 ? I believe the problem is now fixed. Ping me if you need help to get a release built with this PR.
If QE agrees to it, we'd like to see if we can get this into the master and release-4.6 branches tomorrow and include this in a 4.6.1 build that would target tomorrow. When testing please let us know if that seems ok to QE.
(In reply to Scott Dodson from comment #14) > If QE agrees to it, we'd like to see if we can get this into the master and > release-4.6 branches tomorrow and include this in a 4.6.1 build that would > target tomorrow. When testing please let us know if that seems ok to QE. it's ok since it's small change, but seems 4.6.1 build already come out.
Moving it to verified for a minute so that the bot cherry picks it automatically, then I'll move it back to post
Actually that won't work until it's merged. So moving back to post
Verified this bug on 4.7.0-0.nightly-2020-10-23-024149 upgrade from 4.5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633