Description of problem: In post-upgrade environment co authentication Available condition oscillates between True and False. Version-Release number of selected component (if applicable): 4.5.6 nightly to 4.6.0-0.nightly-2020-08-18-165040 nightly How reproducible: This is the first time we hit this. Steps to Reproduce: 1.Launched a 4.5.6 nightly env. The env matrix is IPI_Azure FIPS, Etcd Encryption on, Then upgrade it 4.6.0-0.nightly-2020-08-18-165040 nightly successfully. All pods and nodes and operators are well. Except co/authentication: $ while true; do echo "`date '+%Y-%m-%dT%H:%M:%S%:z'` $(oc get co authentication --no-headers)"; sleep 2; done ... 2020-08-20T13:48:32+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m20s 2020-08-20T13:48:35+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m23s 2020-08-20T13:48:38+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m26s 2020-08-20T13:48:41+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 0s 2020-08-20T13:48:44+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 3s 2020-08-20T13:48:47+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 6s From CAO pod logs, found intermittent timeout for connections to master-0 (10.0.0.6) and master-2 (10.0.0.8). Did not find such timeout to master-1 (10.0.0.7). The CAO pod is on master-1. $ oc logs authentication-operator-749c8b4b56-427q4 -n openshift-authentication-operator ... I0820 08:39:42.266185 1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-08-20T04:36:41Z","message":"WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-08-19T11:39:04Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-08-20T08:39:42Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2020-08-19T02:22:28Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0820 08:39:42.297570 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "" to "WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout" 2020-08-20T08:39:44.758737686Z E0820 08:39:44.758657 1 base_controller.go:220] "Deployment" controller failed to sync "key", err: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again 2020-08-20T08:39:44.759550279Z I0820 08:39:44.759495 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'AuthenticationUpdateStatusFailed' Failed to update authentication operator status: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again <-- whenever this message is seen, co/authentication is observed with AGE starting again from 0s ... Actual results: 1. CO authentication oscillate between True and False. Expected results: 2 All COs should be True False False Additional info:
It seems that the KAS is having some kind of trouble with connectivity, the conncheck container shows ``` 2020-08-20T11:27:02.626596944Z W0820 11:27:02.626549 1 connection_checker.go:123] Unable to update status of kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-external: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2986286 vs. 2097152) ``` quite regularly, as well as ``` 2020-08-20T11:27:03.279927499Z I0820 11:27:03.279898 1 connection_checker.go:216] Failure | TCPConnectError | 9.996883073s | Failed to establish a TCP connection to api-int.zhsunupgrade819.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout ``` which appears to match the following error from the KAS container: ``` 2020-08-20T11:27:02.230108683Z E0820 11:27:02.230029 17 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), Code:8, Message:"trying to send message larger than max (2980240 vs. 2097152)", Details:[]*anypb.Any(nil)} 2020-08-20T11:27:02.230308195Z I0820 11:27:02.230264 17 trace.go:205] Trace[1183161086]: "Update" url:/apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-kube-apiserver/podnetworkconnectivitychecks/kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-internal/status,user-agent:cluster-kube-apiserver-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:::1 (20-Aug-2020 11:27:00.240) (total time: 1990ms): 2020-08-20T11:27:02.230308195Z Trace[1183161086]: ---"Conversion done" 165ms (11:27:00.412) 2020-08-20T11:27:02.230308195Z Trace[1183161086]: [1.990242374s] [1.990242374s] END ``` It's possible that the CAO is hitting the KAS when it's incapable of a response, yet I am not sure what to make out of the above.
IPI installed one 4.5.7 cluster on azure with fips on and encryption etcd, upgraded to 4.6 nightly build successfully. $ oc get infrastructures.config.openshift.io cluster -o yaml ... platformSpec: type: Azure ... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-31-194600 True False 10s Cluster version is 4.6.0-0.nightly-2020-08-31-194600 $ oc get clusterversion -o json|jq ".items[0].status.history" [ { "completionTime": "2020-09-01T05:13:18Z", "image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600", "startedTime": "2020-09-01T04:11:57Z", "state": "Completed", "verified": false, "version": "4.6.0-0.nightly-2020-08-31-194600" }, { "completionTime": "2020-09-01T03:46:27Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:776b7e8158edf64c82f18f5ec4d6ef378ac3de81ba0dc2700b885ceb62e71279", "startedTime": "2020-09-01T03:15:16Z", "state": "Completed", "verified": false, "version": "4.5.7" } ] $ oc get nodes NAME STATUS ROLES AGE VERSION kewang0151-h4szj-master-0 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty kewang0151-h4szj-master-1 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty kewang0151-h4szj-master-2 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty kewang0151-h4szj-worker-northcentralus-9fkpz Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty kewang0151-h4szj-worker-northcentralus-t6kpr Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty kewang0151-h4szj-worker-northcentralus-thnhm Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-31-194600 True False False 3m5s cloud-credential 4.6.0-0.nightly-2020-08-31-194600 True False False 126m cluster-autoscaler 4.6.0-0.nightly-2020-08-31-194600 True False False 114m config-operator 4.6.0-0.nightly-2020-08-31-194600 True False False 115m console 4.6.0-0.nightly-2020-08-31-194600 True False False 13m csi-snapshot-controller 4.6.0-0.nightly-2020-08-31-194600 True False False 17m dns 4.6.0-0.nightly-2020-08-31-194600 True False False 30m etcd 4.6.0-0.nightly-2020-08-31-194600 True False False 119m image-registry 4.6.0-0.nightly-2020-08-31-194600 True False False 104m ingress 4.6.0-0.nightly-2020-08-31-194600 True False False 47m insights 4.6.0-0.nightly-2020-08-31-194600 True False False 115m kube-apiserver 4.6.0-0.nightly-2020-08-31-194600 True False False 119m kube-controller-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 119m kube-scheduler 4.6.0-0.nightly-2020-08-31-194600 True False False 118m kube-storage-version-migrator 4.6.0-0.nightly-2020-08-31-194600 True False False 17m machine-api 4.6.0-0.nightly-2020-08-31-194600 True False False 112m machine-approver 4.6.0-0.nightly-2020-08-31-194600 True False False 119m machine-config 4.6.0-0.nightly-2020-08-31-194600 True False False 10m marketplace 4.6.0-0.nightly-2020-08-31-194600 True False False 17m monitoring 4.6.0-0.nightly-2020-08-31-194600 True False False 16m network 4.6.0-0.nightly-2020-08-31-194600 True False False 121m node-tuning 4.6.0-0.nightly-2020-08-31-194600 True False False 46m openshift-apiserver 4.6.0-0.nightly-2020-08-31-194600 True False False 21m openshift-controller-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 45m openshift-samples 4.6.0-0.nightly-2020-08-31-194600 True False False 46m operator-lifecycle-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 120m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-31-194600 True False False 120m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-31-194600 True False False 13m service-ca 4.6.0-0.nightly-2020-08-31-194600 True False False 121m storage 4.6.0-0.nightly-2020-08-31-194600 True False False 47m All be fine, move the bug Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196