Bug 1842431
| Summary: | Unresponsive OpenShift 4.4 cluster on Azure (UPI) | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Seasonny Huang <weihuang> | |
| Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> | |
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.4 | CC: | marobrie, mfojtik, moddi, sbatsche, skolicha, vjaypurk, weihuang, wili | |
| Target Milestone: | --- | Flags: | skolicha:
needinfo-
|
|
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Enhancement | ||
| Doc Text: |
Feature: The original problem reported was due to running the recovery on all 3 nodes, instead of a single node.
Reason: Enhanced the cluster-backup and cluster-restore scripts to provide better feedback to the user.
Result: The user has better information to understand why the restore failed.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1848516 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 16:02:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1847769 | |||
Moving to 4.6 to continue support and triage. Hi Sam, I checked the latest 4.6 payload: 4.6.0-0.nightly-2020-06-16-214732, the fix have not be merged, so I will try it tomorrow if new build ready. Verified with 4.6.0-0.nightly-2020-06-23-002207 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
Description of problem: 2 weeks after installation some alerts was fired, such as KubeAPILatencyHigh, etcdMemberCommunicationSlow...etc After the event happened without any manual operation, few hours past, the alerts keep firing, finally, we can't login the cluster anymore, all oc commands failed Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Unconfirmed, From out testing seems like need a long-running cluster(2 weeks), it will be fired some alerts, we saw the following warning in the cluster * KubeAPILatencyHigh * etcdMemberCommunicationSlow * time drift log in etcd Steps to Reproduce: 1. Install OCP cluster 2. Waiting for the alert fired, eg: etcdMemberCommunicationSlow, KubeAPILatencyHigh...etc 3. Still can perform most of the actions but some intermittent issue happened, eg: oc command time out/failed, but retry can fix 4. In our testing, if shut down one of the masters, the cluster will quite weird, eg: only about 20% `oc whoami` executed successfully, can't log in the cluster ``` {"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"888bfbf6"} ``` Actual results: Investigating the log, we found 2020-05-15 14:48:47.640740 W | rafthttp: the clock difference against peer d7523d07ecd5d55e is too high [10.545065453s > 1s] (prober "ROUND_TRIPPER_RAFT_MESSAGE") After the cu apply this kb[2] time drift event seems to disappear, but we don't think this the root cause since we had manually set a high drift time to one of the nodes, we see the control plane status keep changing between Ready and Degraded on the web console, but the cluster seems fine, we can deploy a pod to the cluster Continue Investigating, we found... 2020-05-15T13:27:49.8684449Z 2020-05-15 13:27:49.868296 W | rafthttp: health check for peer b366cbcd316c915f could not connect: dial tcp 10.5.3.5:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE") So that we decide to test disk performance for etcd, we using this approach[3] Also see some leader changed events $ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: leader changed' 22 $ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: request timed out' 105 Using Premium SSD on the master nodes, the following below is fio testing for etcd performance -- time unit: usec master-2: | 99.00th=[ 7898], 99.50th=[ 9241], 99.90th=[13042], 99.95th=[18744], | 99.99th=[23987] 7898-23987 microseconds 7.898-23.987 milliseconds master-1: | 99.00th=[ 16057], 99.50th=[ 18220], 99.90th=[ 24511], 99.95th=[ 30278], | 99.99th=[291505] 16057-291505 microseconds 16.057-291.505 milliseconds <=== master-0: | 99.00th=[ 4621], 99.50th=[ 5735], 99.90th=[ 8717], 99.95th=[10421], | 99.99th=[12125] 4621-12125 microseconds 4.621-12.125 milliseconds -- Try to recover the cluster from this doc[1], but in the step 4b found some files missing on master0, master1 ``` [root@bacchus-master-0 manifests]# ls kube-apiserver-pod.yaml [root@bacchus-master-1 manifests]# ls <empty> [root@bacchus-master-2 manifests]# ls etcd-pod.yaml kube-apiserver-pod.yaml kube-controller-manager-pod.yaml kube-scheduler-pod.yaml ``` Expected results: A healthful OCP/etcd cluster Additional info: In the same region (Southeast Asia) had rafthttp issue on Openshift v3.11(seldom occur), but even system occur APIHighLatency alert never meet cluster crash, files lost [1] https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state