Description of problem: 2 weeks after installation some alerts was fired, such as KubeAPILatencyHigh, etcdMemberCommunicationSlow...etc After the event happened without any manual operation, few hours past, the alerts keep firing, finally, we can't login the cluster anymore, all oc commands failed Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Unconfirmed, From out testing seems like need a long-running cluster(2 weeks), it will be fired some alerts, we saw the following warning in the cluster * KubeAPILatencyHigh * etcdMemberCommunicationSlow * time drift log in etcd Steps to Reproduce: 1. Install OCP cluster 2. Waiting for the alert fired, eg: etcdMemberCommunicationSlow, KubeAPILatencyHigh...etc 3. Still can perform most of the actions but some intermittent issue happened, eg: oc command time out/failed, but retry can fix 4. In our testing, if shut down one of the masters, the cluster will quite weird, eg: only about 20% `oc whoami` executed successfully, can't log in the cluster ``` {"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"888bfbf6"} ``` Actual results: Investigating the log, we found 2020-05-15 14:48:47.640740 W | rafthttp: the clock difference against peer d7523d07ecd5d55e is too high [10.545065453s > 1s] (prober "ROUND_TRIPPER_RAFT_MESSAGE") After the cu apply this kb[2] time drift event seems to disappear, but we don't think this the root cause since we had manually set a high drift time to one of the nodes, we see the control plane status keep changing between Ready and Degraded on the web console, but the cluster seems fine, we can deploy a pod to the cluster Continue Investigating, we found... 2020-05-15T13:27:49.8684449Z 2020-05-15 13:27:49.868296 W | rafthttp: health check for peer b366cbcd316c915f could not connect: dial tcp 10.5.3.5:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE") So that we decide to test disk performance for etcd, we using this approach[3] Also see some leader changed events $ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: leader changed' 22 $ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: request timed out' 105 Using Premium SSD on the master nodes, the following below is fio testing for etcd performance -- time unit: usec master-2: | 99.00th=[ 7898], 99.50th=[ 9241], 99.90th=[13042], 99.95th=[18744], | 99.99th=[23987] 7898-23987 microseconds 7.898-23.987 milliseconds master-1: | 99.00th=[ 16057], 99.50th=[ 18220], 99.90th=[ 24511], 99.95th=[ 30278], | 99.99th=[291505] 16057-291505 microseconds 16.057-291.505 milliseconds <=== master-0: | 99.00th=[ 4621], 99.50th=[ 5735], 99.90th=[ 8717], 99.95th=[10421], | 99.99th=[12125] 4621-12125 microseconds 4.621-12.125 milliseconds -- Try to recover the cluster from this doc[1], but in the step 4b found some files missing on master0, master1 ``` [root@bacchus-master-0 manifests]# ls kube-apiserver-pod.yaml [root@bacchus-master-1 manifests]# ls <empty> [root@bacchus-master-2 manifests]# ls etcd-pod.yaml kube-apiserver-pod.yaml kube-controller-manager-pod.yaml kube-scheduler-pod.yaml ``` Expected results: A healthful OCP/etcd cluster Additional info: In the same region (Southeast Asia) had rafthttp issue on Openshift v3.11(seldom occur), but even system occur APIHighLatency alert never meet cluster crash, files lost [1] https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state
Moving to 4.6 to continue support and triage.
Hi Sam, I checked the latest 4.6 payload: 4.6.0-0.nightly-2020-06-16-214732, the fix have not be merged, so I will try it tomorrow if new build ready.
Verified with 4.6.0-0.nightly-2020-06-23-002207
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days