Bug 1870565

Summary: [4.6] co authentication status oscillates with error "WellKnownReadyControllerDegraded failed to GET well-known https://a.b.c.d:6443/.well-known/oauth-authorization-server: dial tcp a.b.c.d:6443: i/o timeout"
Product: OpenShift Container Platform Reporter: pmali
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, mfojtik, sttts, wsun, xxia
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:29:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pmali 2020-08-20 11:35:34 UTC
Description of problem:

In post-upgrade environment co authentication Available condition oscillates between True and False.

Version-Release number of selected component (if applicable):
4.5.6 nightly to 4.6.0-0.nightly-2020-08-18-165040 nightly

How reproducible:
This is the first time we hit this.

Steps to Reproduce:

1.Launched a 4.5.6 nightly env. The env matrix is IPI_Azure FIPS, Etcd Encryption on, Then upgrade it 4.6.0-0.nightly-2020-08-18-165040 nightly successfully. All pods and nodes and operators are well. Except co/authentication:

$ while true; do echo "`date '+%Y-%m-%dT%H:%M:%S%:z'` $(oc get co authentication --no-headers)"; sleep 2; done
...
2020-08-20T13:48:32+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m20s
2020-08-20T13:48:35+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m23s
2020-08-20T13:48:38+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   True   False   False   2m26s
2020-08-20T13:48:41+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   0s
2020-08-20T13:48:44+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   3s
2020-08-20T13:48:47+05:30 authentication   4.6.0-0.nightly-2020-08-18-165040   False   False   False   6s


From CAO pod logs, found intermittent timeout for connections to master-0 (10.0.0.6) and master-2 (10.0.0.8). Did not find such timeout to master-1 (10.0.0.7). The CAO pod is on master-1.
$ oc logs authentication-operator-749c8b4b56-427q4 -n openshift-authentication-operator
...
I0820 08:39:42.266185       1 status_controller.go:172] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2020-08-20T04:36:41Z","message":"WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-08-19T11:39:04Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-08-20T08:39:42Z","message":"WellKnownAvailable: The well-known endpoint is not yet available: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout","reason":"WellKnown_NotReady","status":"False","type":"Available"},{"lastTransitionTime":"2020-08-19T02:22:28Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I0820 08:39:42.297570       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/authentication changed: Degraded message changed from "" to "WellKnownReadyControllerDegraded: failed to GET well-known https://10.0.0.8:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.8:6443: i/o timeout"
2020-08-20T08:39:44.758737686Z E0820 08:39:44.758657       1 base_controller.go:220] "Deployment" controller failed to sync "key", err: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
2020-08-20T08:39:44.759550279Z I0820 08:39:44.759495       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"authentication-operator", UID:"fe6b486e-e00f-4d97-8f93-9165f5079010", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'AuthenticationUpdateStatusFailed' Failed to update authentication operator status: Operation cannot be fulfilled on authentications.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again <-- whenever this message is seen, co/authentication is observed with AGE starting again from 0s
...

Actual results:
1. CO authentication oscillate between True and False. 

Expected results:
2 All COs should be True False False

Additional info:

Comment 4 Standa Laznicka 2020-08-26 12:41:33 UTC
It seems that the KAS is having some kind of trouble with connectivity, the conncheck container shows
```
2020-08-20T11:27:02.626596944Z W0820 11:27:02.626549       1 connection_checker.go:123] Unable to update status of kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-external: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2986286 vs. 2097152)
```
quite regularly, as well as
```
2020-08-20T11:27:03.279927499Z I0820 11:27:03.279898       1 connection_checker.go:216] Failure | TCPConnectError | 9.996883073s | Failed to establish a TCP connection to api-int.zhsunupgrade819.qe.azure.devcluster.openshift.com:6443: dial tcp 10.0.0.4:6443: i/o timeout
```

which appears to match the following error from the KAS container:
```
2020-08-20T11:27:02.230108683Z E0820 11:27:02.230029      17 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), Code:8, Message:"trying to send message larger than max (2980240 vs. 2097152)", Details:[]*anypb.Any(nil)}
2020-08-20T11:27:02.230308195Z I0820 11:27:02.230264      17 trace.go:205] Trace[1183161086]: "Update" url:/apis/controlplane.operator.openshift.io/v1alpha1/namespaces/openshift-kube-apiserver/podnetworkconnectivitychecks/kube-apiserver-zhsunupgrade819-ctcx8-master-0-to-load-balancer-api-internal/status,user-agent:cluster-kube-apiserver-operator/v0.0.0 (linux/amd64) kubernetes/$Format,client:::1 (20-Aug-2020 11:27:00.240) (total time: 1990ms):
2020-08-20T11:27:02.230308195Z Trace[1183161086]: ---"Conversion done" 165ms (11:27:00.412)
2020-08-20T11:27:02.230308195Z Trace[1183161086]: [1.990242374s] [1.990242374s] END
```

It's possible that the CAO is hitting the KAS when it's incapable of a response, yet I am not sure what to make out of the above.

Comment 8 Ke Wang 2020-09-01 05:27:02 UTC
IPI installed one 4.5.7 cluster on azure with fips on and encryption etcd, upgraded to 4.6 nightly build successfully.

$ oc get infrastructures.config.openshift.io cluster -o yaml
...
  platformSpec:
    type: Azure
...

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-31-194600   True        False         10s     Cluster version is 4.6.0-0.nightly-2020-08-31-194600

$ oc get clusterversion -o json|jq ".items[0].status.history"
[
  {
    "completionTime": "2020-09-01T05:13:18Z",
    "image": "registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600",
    "startedTime": "2020-09-01T04:11:57Z",
    "state": "Completed",
    "verified": false,
    "version": "4.6.0-0.nightly-2020-08-31-194600"
  },
  {
    "completionTime": "2020-09-01T03:46:27Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:776b7e8158edf64c82f18f5ec4d6ef378ac3de81ba0dc2700b885ceb62e71279",
    "startedTime": "2020-09-01T03:15:16Z",
    "state": "Completed",
    "verified": false,
    "version": "4.5.7"
  }
]

$ oc get nodes
NAME                                           STATUS   ROLES    AGE    VERSION
kewang0151-h4szj-master-0                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-1                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-master-2                      Ready    master   122m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-9fkpz   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-t6kpr   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty
kewang0151-h4szj-worker-northcentralus-thnhm   Ready    worker   106m   v1.19.0-rc.2+f71a7ab-dirty

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      3m5s
cloud-credential                           4.6.0-0.nightly-2020-08-31-194600   True        False         False      126m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-31-194600   True        False         False      114m
config-operator                            4.6.0-0.nightly-2020-08-31-194600   True        False         False      115m
console                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      13m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
dns                                        4.6.0-0.nightly-2020-08-31-194600   True        False         False      30m
etcd                                       4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
image-registry                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      104m
ingress                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      47m
insights                                   4.6.0-0.nightly-2020-08-31-194600   True        False         False      115m
kube-apiserver                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
kube-controller-manager                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
kube-scheduler                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      118m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
machine-api                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      112m
machine-approver                           4.6.0-0.nightly-2020-08-31-194600   True        False         False      119m
machine-config                             4.6.0-0.nightly-2020-08-31-194600   True        False         False      10m
marketplace                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      17m
monitoring                                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      16m
network                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      121m
node-tuning                                4.6.0-0.nightly-2020-08-31-194600   True        False         False      46m
openshift-apiserver                        4.6.0-0.nightly-2020-08-31-194600   True        False         False      21m
openshift-controller-manager               4.6.0-0.nightly-2020-08-31-194600   True        False         False      45m
openshift-samples                          4.6.0-0.nightly-2020-08-31-194600   True        False         False      46m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      120m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-31-194600   True        False         False      120m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-31-194600   True        False         False      13m
service-ca                                 4.6.0-0.nightly-2020-08-31-194600   True        False         False      121m
storage                                    4.6.0-0.nightly-2020-08-31-194600   True        False         False      47m

All be fine, move the bug Verified.

Comment 10 errata-xmlrpc 2020-10-27 16:29:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196