Bug 1870565
| Summary: | [4.6] co authentication status oscillates with error "WellKnownReadyControllerDegraded failed to GET well-known https://a.b.c.d:6443/.well-known/oauth-authorization-server: dial tcp a.b.c.d:6443: i/o timeout" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | pmali |
| Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.6 | CC: | aos-bugs, mfojtik, sttts, wsun, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:29:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It seems that the KAS is having some kind of trouble with connectivity, the conncheck container shows
```
2020-08-20T11:27:02.626596944Z W0820 11:27:02.626549 1 connection_checker.go:123] Unable to update status of kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-external: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2986286 vs. 2097152)
```
quite regularly, as well as
```
2020-08-20T11:27:03.279927499Z I0820 11:27:03.279898 1 connection_checker.go:216] Failure | TCPConnectError | 9.996883073s | Failed to establish a TCP connection to api-int.zhsunupgrade819.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout
```
which appears to match the following error from the KAS container:
```
2020-08-20T11:27:02.230108683Z E0820 11:27:02.230029 17 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), Code:8, Message:"trying to send message larger than max (2980240 vs. 2097152)", Details:[]*anypb.Any(nil)}
2020-08-20T11:27:02.230308195Z I0820 11:27:02.230264 17 trace.go:205] Trace[1183161086]: "Update" url:/apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-kube-apiserver/podnetworkconnectivitychecks/kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-internal/status,user-agent:cluster-kube-apiserver-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:::1 (20-Aug-2020 11:27:00.240) (total time: 1990ms):
2020-08-20T11:27:02.230308195Z Trace[1183161086]: ---"Conversion done" 165ms (11:27:00.412)
2020-08-20T11:27:02.230308195Z Trace[1183161086]: [1.990242374s] [1.990242374s] END
```
It's possible that the CAO is hitting the KAS when it's incapable of a response, yet I am not sure what to make out of the above.
IPI installed one 4.5.7 cluster on azure with fips on and encryption etcd, upgraded to 4.6 nightly build successfully.
$ oc get infrastructures.config.openshift.io cluster -o yaml
...
platformSpec:
type: Azure
...
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-08-31-194600 True False 10s Cluster version is 4.6.0-0.nightly-2020-08-31-194600
$ oc get clusterversion -o json|jq ".items[0].status.history"
[
{
"completionTime": "2020-09-01T05:13:18Z",
"image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600",
"startedTime": "2020-09-01T04:11:57Z",
"state": "Completed",
"verified": false,
"version": "4.6.0-0.nightly-2020-08-31-194600"
},
{
"completionTime": "2020-09-01T03:46:27Z",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:776b7e8158edf64c82f18f5ec4d6ef378ac3de81ba0dc2700b885ceb62e71279",
"startedTime": "2020-09-01T03:15:16Z",
"state": "Completed",
"verified": false,
"version": "4.5.7"
}
]
$ oc get nodes
NAME STATUS ROLES AGE VERSION
kewang0151-h4szj-master-0 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-1 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-2 Ready master 122m v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-9fkpz Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-t6kpr Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-thnhm Ready worker 106m v1.19.0-rc.2+f71a7ab-dirty
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.0-0.nightly-2020-08-31-194600 True False False 3m5s
cloud-credential 4.6.0-0.nightly-2020-08-31-194600 True False False 126m
cluster-autoscaler 4.6.0-0.nightly-2020-08-31-194600 True False False 114m
config-operator 4.6.0-0.nightly-2020-08-31-194600 True False False 115m
console 4.6.0-0.nightly-2020-08-31-194600 True False False 13m
csi-snapshot-controller 4.6.0-0.nightly-2020-08-31-194600 True False False 17m
dns 4.6.0-0.nightly-2020-08-31-194600 True False False 30m
etcd 4.6.0-0.nightly-2020-08-31-194600 True False False 119m
image-registry 4.6.0-0.nightly-2020-08-31-194600 True False False 104m
ingress 4.6.0-0.nightly-2020-08-31-194600 True False False 47m
insights 4.6.0-0.nightly-2020-08-31-194600 True False False 115m
kube-apiserver 4.6.0-0.nightly-2020-08-31-194600 True False False 119m
kube-controller-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 119m
kube-scheduler 4.6.0-0.nightly-2020-08-31-194600 True False False 118m
kube-storage-version-migrator 4.6.0-0.nightly-2020-08-31-194600 True False False 17m
machine-api 4.6.0-0.nightly-2020-08-31-194600 True False False 112m
machine-approver 4.6.0-0.nightly-2020-08-31-194600 True False False 119m
machine-config 4.6.0-0.nightly-2020-08-31-194600 True False False 10m
marketplace 4.6.0-0.nightly-2020-08-31-194600 True False False 17m
monitoring 4.6.0-0.nightly-2020-08-31-194600 True False False 16m
network 4.6.0-0.nightly-2020-08-31-194600 True False False 121m
node-tuning 4.6.0-0.nightly-2020-08-31-194600 True False False 46m
openshift-apiserver 4.6.0-0.nightly-2020-08-31-194600 True False False 21m
openshift-controller-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 45m
openshift-samples 4.6.0-0.nightly-2020-08-31-194600 True False False 46m
operator-lifecycle-manager 4.6.0-0.nightly-2020-08-31-194600 True False False 120m
operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-31-194600 True False False 120m
operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-31-194600 True False False 13m
service-ca 4.6.0-0.nightly-2020-08-31-194600 True False False 121m
storage 4.6.0-0.nightly-2020-08-31-194600 True False False 47m
All be fine, move the bug Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Description of problem: In post-upgrade environment co authentication Available condition oscillates between True and False. Version-Release number of selected component (if applicable): 4.5.6 nightly to 4.6.0-0.nightly-2020-08-18-165040 nightly How reproducible: This is the first time we hit this. Steps to Reproduce: 1.Launched a 4.5.6 nightly env. The env matrix is IPI_Azure FIPS, Etcd Encryption on, Then upgrade it 4.6.0-0.nightly-2020-08-18-165040 nightly successfully. All pods and nodes and operators are well. Except co/authentication: $ while true; do echo "`date '+%Y-%m-%dT%H:%M:%S%:z'` $(oc get co authentication --no-headers)"; sleep 2; done ... 2020-08-20T13:48:32+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m20s 2020-08-20T13:48:35+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m23s 2020-08-20T13:48:38+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 True False False 2m26s 2020-08-20T13:48:41+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 0s 2020-08-20T13:48:44+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 3s 2020-08-20T13:48:47+05:30 authentication 4.6.0-0.nightly-2020-08-18-165040 False False False 6s From CAO pod logs, found intermittent timeout for connections to master-0 (10.0.0.6) and master-2 (10.0.0.8). Did not find such timeout to master-1 (10.0.0.7). The CAO pod is on master-1. $ oc logs authentication-operator-749c8b4b56-427q4 -n openshift-authentication-operator ... I0820 08:39:42.266185 1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-08-20T04:36:41Z","message":"WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-08-19T11:39:04Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-08-20T08:39:42Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2020-08-19T02:22:28Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} I0820 08:39:42.297570 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "" to "WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout" 2020-08-20T08:39:44.758737686Z E0820 08:39:44.758657 1 base_controller.go:220] "Deployment" controller failed to sync "key", err: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again 2020-08-20T08:39:44.759550279Z I0820 08:39:44.759495 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'AuthenticationUpdateStatusFailed' Failed to update authentication operator status: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again <-- whenever this message is seen, co/authentication is observed with AGE starting again from 0s ... Actual results: 1. CO authentication oscillate between True and False. Expected results: 2 All COs should be True False False Additional info: