Description of problem: Upgrading an OpenShift cluster from 4.3.18 to 4.4.x results in a degraded authentication operator, although Oauth still appears to work correctly. The error reported by the authentication operator is: IngressStateEndpointsDegraded: Unhealthy addresses found: 172.30.2.146:Get https://172.30.2.146:6443/healthz: dial tcp 172.30.2.146:6443: connect: connection timed out,172.30.4.152:Get https://172.30.4.152:6443/healthz: dial tcp 172.30.4.152:6443: connect: connection timed out I can curl these endpoints from the oauth pods and recieve an 'OK' back, but a curl from the authentication-operator pod times out (I think this is what may be causing the issue). The exact same behaviour is present in a 4.3 cluster with a healthy authentication operator, but I can only assume this /healthz check is not happening there. I can fix this issue by joining the openshift-authentication project to the openshift-authentication-operator project using this command: oc adm pod-network join-projects --to=openshift-authentication-operator openshift-authentication But I don't think it should be neccesary to do this. Version-Release number of selected component (if applicable): 4.4.3/4.4.4 How reproducible: Always Steps to Reproduce: 1. Upgrade cluster from 4.3.18 > 4.4.3 or 4.4.4 2. Check Authentication operator Actual results: Authentication operator is "Degraded" although appears functional Expected results: Authentication operator is "Available: True"
Looks like an sdn issue. If it turns out to really be one, please look whether it's possible to make the sdn operator go degraded based on the root cause.
Setting the target release to the development branch so we can identify the issue and fix it. We can work out where we backport to after the fix has been identified.
When the Authentication operator is degraded is seems to block other operators from upgrading. I've joined the authentication projects together so the health check passes, this has now allowed me to complete the 4.4.4 upgrade (from 4.3.18). I cannot seem to make the SDN/Network operator degrade, or find any clues in the sdn logs. The cluster state is now like this with the openshift-authentication and openshift-authentication-operator isolated: $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False True 57d cloud-credential 4.4.4 True False False 57d cluster-autoscaler 4.4.4 True False False 57d console 4.4.4 True False False 50m csi-snapshot-controller 4.4.4 True False False 6d3h dns 4.4.4 True False False 6d3h etcd 4.4.4 True False False 116m image-registry 4.4.4 True False False 4h54m ingress 4.4.4 True False False 133m insights 4.4.4 True False False 57d kube-apiserver 4.4.4 True False False 57d kube-controller-manager 4.4.4 True False False 14d kube-scheduler 4.4.4 True False False 14d kube-storage-version-migrator 4.4.4 True False False 7d9h machine-api 4.4.4 True False False 57d machine-config 4.4.4 True False False 43m marketplace 4.4.4 True False False 107m monitoring 4.4.4 True False False 24h network 4.4.4 True False False 57d node-tuning 4.4.4 True False False 24h openshift-apiserver 4.4.4 True False False 117m openshift-controller-manager 4.4.4 True False False 24h openshift-samples 4.4.4 True False False 8m32s operator-lifecycle-manager 4.4.4 True False False 57d operator-lifecycle-manager-catalog 4.4.4 True False False 57d operator-lifecycle-manager-packageserver 4.4.4 True False False 50m service-ca 4.4.4 True False False 57d service-catalog-apiserver 4.4.4 True False False 57d service-catalog-controller-manager 4.4.4 True False False 57d storage 4.4.4 True False False 24h And after I join the openshift-authentication and openshift-authentication-operator projects it very quickly becomes available: oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False False 57d cloud-credential 4.4.4 True False False 57d cluster-autoscaler 4.4.4 True False False 57d console 4.4.4 True False False 54m csi-snapshot-controller 4.4.4 True False False 6d4h dns 4.4.4 True False False 6d4h etcd 4.4.4 True False False 121m image-registry 4.4.4 True False False 4h58m ingress 4.4.4 True False False 138m insights 4.4.4 True False False 57d kube-apiserver 4.4.4 True False False 57d kube-controller-manager 4.4.4 True False False 14d kube-scheduler 4.4.4 True False False 14d kube-storage-version-migrator 4.4.4 True False False 7d9h machine-api 4.4.4 True False False 57d machine-config 4.4.4 True False False 47m marketplace 4.4.4 True False False 111m monitoring 4.4.4 True False False 24h network 4.4.4 True False False 57d node-tuning 4.4.4 True False False 25h openshift-apiserver 4.4.4 True False False 121m openshift-controller-manager 4.4.4 True False False 24h openshift-samples 4.4.4 True False False 3m52s operator-lifecycle-manager 4.4.4 True False False 57d operator-lifecycle-manager-catalog 4.4.4 True False False 57d operator-lifecycle-manager-packageserver 4.4.4 True False False 55m service-ca 4.4.4 True False False 57d service-catalog-apiserver 4.4.4 True False False 57d service-catalog-controller-manager 4.4.4 True False False 57d storage 4.4.4 True False False 24h
verified this bug on 4.5.0-0.nightly-2020-05-29-001153 authentication operator works well in openshift-ovs-multitenant mode `oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant #oc get netnamespaces | grep auth openshift-authentication 1 openshift-authentication-operator 1
*** Bug 1851782 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409