Bug 1837575

Summary: Upgrade from 4.3.18 -> 4.4.x results in degraded Authentication Opertator (IngressStateEndpoints_UnhealthyAddresses)
Product: OpenShift Container Platform Reporter: oliver.bawler
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aconstan, alchan, aos-bugs, mfojtik, slaznick
Version: 4.4Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1841507 (view as bug list) Environment:
Last Closed: 2020-07-13 17:40:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1841507    

Description oliver.bawler 2020-05-19 16:42:44 UTC
Description of problem:

Upgrading an OpenShift cluster from 4.3.18 to 4.4.x results in a degraded authentication operator, although Oauth still appears to work correctly. The error reported by the authentication operator is:

    IngressStateEndpointsDegraded: Unhealthy addresses found: 172.30.2.146:Get https://172.30.2.146:6443/healthz: dial tcp 172.30.2.146:6443: connect: connection timed out,172.30.4.152:Get https://172.30.4.152:6443/healthz: dial tcp 172.30.4.152:6443: connect: connection timed out

I can curl these endpoints from the oauth pods and recieve an 'OK' back, but a curl from the authentication-operator pod times out (I think this is what may be causing the issue). The exact same behaviour is present in a 4.3 cluster with a healthy authentication operator, but I can only assume this /healthz check is not happening there. 

I can fix this issue by joining the openshift-authentication project to the openshift-authentication-operator project using this command:

oc adm pod-network join-projects --to=openshift-authentication-operator openshift-authentication

But I don't think it should be neccesary to do this.

Version-Release number of selected component (if applicable):
4.4.3/4.4.4

How reproducible:
Always

Steps to Reproduce:
1. Upgrade cluster from 4.3.18 > 4.4.3 or 4.4.4
2. Check Authentication operator

Actual results:
Authentication operator is "Degraded" although appears functional

Expected results:
Authentication operator is "Available: True"

Comment 1 Standa Laznicka 2020-05-20 06:54:51 UTC
Looks like an sdn issue. If it turns out to really be one, please look whether it's possible to make the sdn operator go degraded based on the root cause.

Comment 2 Ben Bennett 2020-05-20 13:30:46 UTC
Setting the target release to the development branch so we can identify the issue and fix it.  We can work out where we backport to after the fix has been identified.

Comment 3 oliver.bawler 2020-05-20 16:16:26 UTC
When the Authentication operator is degraded is seems to block other operators from upgrading. I've joined the authentication projects together so the health check passes, this has now allowed me to complete the 4.4.4 upgrade (from 4.3.18). I cannot seem to make the SDN/Network operator degrade, or find any clues in the sdn logs. 

The cluster state is now like this with the openshift-authentication and openshift-authentication-operator isolated:

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         True       57d
cloud-credential                           4.4.4     True        False         False      57d
cluster-autoscaler                         4.4.4     True        False         False      57d
console                                    4.4.4     True        False         False      50m
csi-snapshot-controller                    4.4.4     True        False         False      6d3h
dns                                        4.4.4     True        False         False      6d3h
etcd                                       4.4.4     True        False         False      116m
image-registry                             4.4.4     True        False         False      4h54m
ingress                                    4.4.4     True        False         False      133m
insights                                   4.4.4     True        False         False      57d
kube-apiserver                             4.4.4     True        False         False      57d
kube-controller-manager                    4.4.4     True        False         False      14d
kube-scheduler                             4.4.4     True        False         False      14d
kube-storage-version-migrator              4.4.4     True        False         False      7d9h
machine-api                                4.4.4     True        False         False      57d
machine-config                             4.4.4     True        False         False      43m
marketplace                                4.4.4     True        False         False      107m
monitoring                                 4.4.4     True        False         False      24h
network                                    4.4.4     True        False         False      57d
node-tuning                                4.4.4     True        False         False      24h
openshift-apiserver                        4.4.4     True        False         False      117m
openshift-controller-manager               4.4.4     True        False         False      24h
openshift-samples                          4.4.4     True        False         False      8m32s
operator-lifecycle-manager                 4.4.4     True        False         False      57d
operator-lifecycle-manager-catalog         4.4.4     True        False         False      57d
operator-lifecycle-manager-packageserver   4.4.4     True        False         False      50m
service-ca                                 4.4.4     True        False         False      57d
service-catalog-apiserver                  4.4.4     True        False         False      57d
service-catalog-controller-manager         4.4.4     True        False         False      57d
storage                                    4.4.4     True        False         False      24h

And after I join the openshift-authentication and openshift-authentication-operator projects it very quickly becomes available:

oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         False      57d
cloud-credential                           4.4.4     True        False         False      57d
cluster-autoscaler                         4.4.4     True        False         False      57d
console                                    4.4.4     True        False         False      54m
csi-snapshot-controller                    4.4.4     True        False         False      6d4h
dns                                        4.4.4     True        False         False      6d4h
etcd                                       4.4.4     True        False         False      121m
image-registry                             4.4.4     True        False         False      4h58m
ingress                                    4.4.4     True        False         False      138m
insights                                   4.4.4     True        False         False      57d
kube-apiserver                             4.4.4     True        False         False      57d
kube-controller-manager                    4.4.4     True        False         False      14d
kube-scheduler                             4.4.4     True        False         False      14d
kube-storage-version-migrator              4.4.4     True        False         False      7d9h
machine-api                                4.4.4     True        False         False      57d
machine-config                             4.4.4     True        False         False      47m
marketplace                                4.4.4     True        False         False      111m
monitoring                                 4.4.4     True        False         False      24h
network                                    4.4.4     True        False         False      57d
node-tuning                                4.4.4     True        False         False      25h
openshift-apiserver                        4.4.4     True        False         False      121m
openshift-controller-manager               4.4.4     True        False         False      24h
openshift-samples                          4.4.4     True        False         False      3m52s
operator-lifecycle-manager                 4.4.4     True        False         False      57d
operator-lifecycle-manager-catalog         4.4.4     True        False         False      57d
operator-lifecycle-manager-packageserver   4.4.4     True        False         False      55m
service-ca                                 4.4.4     True        False         False      57d
service-catalog-apiserver                  4.4.4     True        False         False      57d
service-catalog-controller-manager         4.4.4     True        False         False      57d
storage                                    4.4.4     True        False         False      24h

Comment 7 zhaozhanqi 2020-05-29 10:46:05 UTC
verified this bug on 4.5.0-0.nightly-2020-05-29-001153

authentication operator works well in openshift-ovs-multitenant mode

`oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant

 #oc get netnamespaces | grep auth
openshift-authentication                           1          
openshift-authentication-operator                  1

Comment 8 Maru Newby 2020-06-30 20:50:34 UTC
*** Bug 1851782 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2020-07-13 17:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409