Bug 1870565 - [4.6] co authentication status oscillates with error "WellKnownReadyControllerDegraded failed to GET well-known https://a.b.c.d:6443/.well-known/oauth-authorization-server: dial tcp a.b.c.d:6443: i/o timeout"
Summary: [4.6] co authentication status oscillates with error "WellKnownReadyControlle...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.6
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.6.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-20 11:35 UTC by pmali
Modified: 2020-09-01 05:27 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 328 None closed Bug 1870565: deployment: avoid degraded due to conflict error 2020-09-21 11:24:47 UTC

Description pmali 2020-08-20 11:35:34 UTC
Description of problem:

In post-upgrade environment co authentication Available condition oscillates between True and False.

Version-Release number of selected component (if applicable):
4.5.6 nightly to 4.6.0-0.nightly-2020-08-18-165040 nightly

How reproducible:
This is the first time we hit this.

Steps to Reproduce:

1.Launched a 4.5.6 nightly env. The env matrix is IPI_Azure FIPS, Etcd Encryption on, Then upgrade it 4.6.0-0.nightly-2020-08-18-165040 nightly successfully. All pods and nodes and operators are well. Except co/authentication:

$ while true; do echo "`date '+%Y-%m-%dT%H:%M:%S%:z'` $(oc get co authentication --no-headers)"; sleep 2; done
...
2020-08-20T13:48:32+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m20s
2020-08-20T13:48:35+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m23s
2020-08-20T13:48:38+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m26s
2020-08-20T13:48:41+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   0s
2020-08-20T13:48:44+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   3s
2020-08-20T13:48:47+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   6s


From CAO pod logs, found intermittent timeout for connections to master-0 (10.0.0.6) and master-2 (10.0.0.8). Did not find such timeout to master-1 (10.0.0.7). The CAO pod is on master-1.
$ oc logs authentication-operator-749c8b4b56-427q4 -n openshift-authentication-operator
...
I0820 08:39:42.266185       1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-08-20T04:36:41Z","message":"WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-08-19T11:39:04Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-08-20T08:39:42Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2020-08-19T02:22:28Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0820 08:39:42.297570       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "" to "WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout"
2020-08-20T08:39:44.758737686Z E0820 08:39:44.758657       1 base_controller.go:220] "Deployment" controller failed to sync "key", err: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
2020-08-20T08:39:44.759550279Z I0820 08:39:44.759495       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'AuthenticationUpdateStatusFailed' Failed to update authentication operator status: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again <-- whenever this message is seen, co/authentication is observed with AGE starting again from 0s
...

Actual results:
1. CO authentication oscillate between True and False. 

Expected results:
2 All COs should be True False False

Additional info:

Comment 4 Standa Laznicka 2020-08-26 12:41:33 UTC
It seems that the KAS is having some kind of trouble with connectivity, the conncheck container shows
```
2020-08-20T11:27:02.626596944Z W0820 11:27:02.626549       1 connection_checker.go:123] Unable to update status of kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-external: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2986286 vs. 2097152)
```
quite regularly, as well as
```
2020-08-20T11:27:03.279927499Z I0820 11:27:03.279898       1 connection_checker.go:216] Failure | TCPConnectError | 9.996883073s | Failed to establish a TCP connection to api-int.zhsunupgrade819.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout
```

which appears to match the following error from the KAS container:
```
2020-08-20T11:27:02.230108683Z E0820 11:27:02.230029      17 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), Code:8, Message:"trying to send message larger than max (2980240 vs. 2097152)", Details:[]*anypb.Any(nil)}
2020-08-20T11:27:02.230308195Z I0820 11:27:02.230264      17 trace.go:205] Trace[1183161086]: "Update" url:/apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-kube-apiserver/podnetworkconnectivitychecks/kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-internal/status,user-agent:cluster-kube-apiserver-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:::1 (20-Aug-2020 11:27:00.240) (total time: 1990ms):
2020-08-20T11:27:02.230308195Z Trace[1183161086]: ---"Conversion done" 165ms (11:27:00.412)
2020-08-20T11:27:02.230308195Z Trace[1183161086]: [1.990242374s] [1.990242374s] END
```

It's possible that the CAO is hitting the KAS when it's incapable of a response, yet I am not sure what to make out of the above.

Comment 8 Ke Wang 2020-09-01 05:27:02 UTC
IPI installed one 4.5.7 cluster on azure with fips on and encryption etcd, upgraded to 4.6 nightly build successfully.

$ oc get infrastructures.config.openshift.io cluster -o yaml
...
  platformSpec:
    type: Azure
...

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-31-194600   True        False         10s     Cluster version is 4.6.0-0.nightly-2020-08-31-194600

$ oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": "2020-09-01T05:13:18Z",
    "image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600",
    "startedTime": "2020-09-01T04:11:57Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.0-0.nightly-2020-08-31-194600"
  },
  {
    "completionTime": "2020-09-01T03:46:27Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:776b7e8158edf64c82f18f5ec4d6ef378ac3de81ba0dc2700b885ceb62e71279",
    "startedTime": "2020-09-01T03:15:16Z",
    "state": "Completed",
    "verified": false,
    "version": "4.5.7"
  }
]

$ oc get nodes
NAME                                           STATUS   ROLES    AGE    VERSION
kewang0151-h4szj-master-0                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-1                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-2                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-9fkpz   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-t6kpr   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-thnhm   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      3m5s
cloud-credential                           4.6.0-0.nightly-2020-08-31-194600   True        False         False      126m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-31-194600   True        False         False      114m
config-operator                            4.6.0-0.nightly-2020-08-31-194600   True        False         False      115m
console                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      13m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
dns                                        4.6.0-0.nightly-2020-08-31-194600   True        False         False      30m
etcd                                       4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
image-registry                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      104m
ingress                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      47m
insights                                   4.6.0-0.nightly-2020-08-31-194600   True        False         False      115m
kube-apiserver                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
kube-controller-manager                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
kube-scheduler                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      118m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
machine-api                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      112m
machine-approver                           4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
machine-config                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      10m
marketplace                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
monitoring                                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      16m
network                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      121m
node-tuning                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      46m
openshift-apiserver                        4.6.0-0.nightly-2020-08-31-194600   True        False         False      21m
openshift-controller-manager               4.6.0-0.nightly-2020-08-31-194600   True        False         False      45m
openshift-samples                          4.6.0-0.nightly-2020-08-31-194600   True        False         False      46m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      120m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-31-194600   True        False         False      120m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-31-194600   True        False         False      13m
service-ca                                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      121m
storage                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      47m

All be fine, move the bug Verified.


Note You need to log in before you can comment on or make changes to this bug.