Bug 1837575 - Upgrade from 4.3.18 -> 4.4.x results in degraded Authentication Opertator (IngressStateEndpoints_UnhealthyAddresses)
Summary: Upgrade from 4.3.18 -> 4.4.x results in degraded Authentication Opertator (In...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Ricardo Carrillo Cruz
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1851782 (view as bug list)
Depends On:
Blocks: 1841507
TreeView+ depends on / blocked
 
Reported: 2020-05-19 16:42 UTC by oliver.bawler
Modified: 2020-07-13 17:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1841507 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:40:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 650 None closed Bug 1837575: Allow connection between authentication net namespaces 2020-12-24 04:59:50 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:40:22 UTC

Description oliver.bawler 2020-05-19 16:42:44 UTC
Description of problem:

Upgrading an OpenShift cluster from 4.3.18 to 4.4.x results in a degraded authentication operator, although Oauth still appears to work correctly. The error reported by the authentication operator is:

    IngressStateEndpointsDegraded: Unhealthy addresses found: 172.30.2.146:Get https://172.30.2.146:6443/healthz: dial tcp 172.30.2.146:6443: connect: connection timed out,172.30.4.152:Get https://172.30.4.152:6443/healthz: dial tcp 172.30.4.152:6443: connect: connection timed out

I can curl these endpoints from the oauth pods and recieve an 'OK' back, but a curl from the authentication-operator pod times out (I think this is what may be causing the issue). The exact same behaviour is present in a 4.3 cluster with a healthy authentication operator, but I can only assume this /healthz check is not happening there. 

I can fix this issue by joining the openshift-authentication project to the openshift-authentication-operator project using this command:

oc adm pod-network join-projects --to=openshift-authentication-operator openshift-authentication

But I don't think it should be neccesary to do this.

Version-Release number of selected component (if applicable):
4.4.3/4.4.4

How reproducible:
Always

Steps to Reproduce:
1. Upgrade cluster from 4.3.18 > 4.4.3 or 4.4.4
2. Check Authentication operator

Actual results:
Authentication operator is "Degraded" although appears functional

Expected results:
Authentication operator is "Available: True"

Comment 1 Standa Laznicka 2020-05-20 06:54:51 UTC
Looks like an sdn issue. If it turns out to really be one, please look whether it's possible to make the sdn operator go degraded based on the root cause.

Comment 2 Ben Bennett 2020-05-20 13:30:46 UTC
Setting the target release to the development branch so we can identify the issue and fix it.  We can work out where we backport to after the fix has been identified.

Comment 3 oliver.bawler 2020-05-20 16:16:26 UTC
When the Authentication operator is degraded is seems to block other operators from upgrading. I've joined the authentication projects together so the health check passes, this has now allowed me to complete the 4.4.4 upgrade (from 4.3.18). I cannot seem to make the SDN/Network operator degrade, or find any clues in the sdn logs. 

The cluster state is now like this with the openshift-authentication and openshift-authentication-operator isolated:

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         True       57d
cloud-credential                           4.4.4     True        False         False      57d
cluster-autoscaler                         4.4.4     True        False         False      57d
console                                    4.4.4     True        False         False      50m
csi-snapshot-controller                    4.4.4     True        False         False      6d3h
dns                                        4.4.4     True        False         False      6d3h
etcd                                       4.4.4     True        False         False      116m
image-registry                             4.4.4     True        False         False      4h54m
ingress                                    4.4.4     True        False         False      133m
insights                                   4.4.4     True        False         False      57d
kube-apiserver                             4.4.4     True        False         False      57d
kube-controller-manager                    4.4.4     True        False         False      14d
kube-scheduler                             4.4.4     True        False         False      14d
kube-storage-version-migrator              4.4.4     True        False         False      7d9h
machine-api                                4.4.4     True        False         False      57d
machine-config                             4.4.4     True        False         False      43m
marketplace                                4.4.4     True        False         False      107m
monitoring                                 4.4.4     True        False         False      24h
network                                    4.4.4     True        False         False      57d
node-tuning                                4.4.4     True        False         False      24h
openshift-apiserver                        4.4.4     True        False         False      117m
openshift-controller-manager               4.4.4     True        False         False      24h
openshift-samples                          4.4.4     True        False         False      8m32s
operator-lifecycle-manager                 4.4.4     True        False         False      57d
operator-lifecycle-manager-catalog         4.4.4     True        False         False      57d
operator-lifecycle-manager-packageserver   4.4.4     True        False         False      50m
service-ca                                 4.4.4     True        False         False      57d
service-catalog-apiserver                  4.4.4     True        False         False      57d
service-catalog-controller-manager         4.4.4     True        False         False      57d
storage                                    4.4.4     True        False         False      24h

And after I join the openshift-authentication and openshift-authentication-operator projects it very quickly becomes available:

oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         False      57d
cloud-credential                           4.4.4     True        False         False      57d
cluster-autoscaler                         4.4.4     True        False         False      57d
console                                    4.4.4     True        False         False      54m
csi-snapshot-controller                    4.4.4     True        False         False      6d4h
dns                                        4.4.4     True        False         False      6d4h
etcd                                       4.4.4     True        False         False      121m
image-registry                             4.4.4     True        False         False      4h58m
ingress                                    4.4.4     True        False         False      138m
insights                                   4.4.4     True        False         False      57d
kube-apiserver                             4.4.4     True        False         False      57d
kube-controller-manager                    4.4.4     True        False         False      14d
kube-scheduler                             4.4.4     True        False         False      14d
kube-storage-version-migrator              4.4.4     True        False         False      7d9h
machine-api                                4.4.4     True        False         False      57d
machine-config                             4.4.4     True        False         False      47m
marketplace                                4.4.4     True        False         False      111m
monitoring                                 4.4.4     True        False         False      24h
network                                    4.4.4     True        False         False      57d
node-tuning                                4.4.4     True        False         False      25h
openshift-apiserver                        4.4.4     True        False         False      121m
openshift-controller-manager               4.4.4     True        False         False      24h
openshift-samples                          4.4.4     True        False         False      3m52s
operator-lifecycle-manager                 4.4.4     True        False         False      57d
operator-lifecycle-manager-catalog         4.4.4     True        False         False      57d
operator-lifecycle-manager-packageserver   4.4.4     True        False         False      55m
service-ca                                 4.4.4     True        False         False      57d
service-catalog-apiserver                  4.4.4     True        False         False      57d
service-catalog-controller-manager         4.4.4     True        False         False      57d
storage                                    4.4.4     True        False         False      24h

Comment 7 zhaozhanqi 2020-05-29 10:46:05 UTC
verified this bug on 4.5.0-0.nightly-2020-05-29-001153

authentication operator works well in openshift-ovs-multitenant mode

`oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant

 #oc get netnamespaces | grep auth
openshift-authentication                           1          
openshift-authentication-operator                  1

Comment 8 Maru Newby 2020-06-30 20:50:34 UTC
*** Bug 1851782 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2020-07-13 17:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.