Bug 1842431 - Unresponsive OpenShift 4.4 cluster on Azure (UPI) [NEEDINFO]
Summary: Unresponsive OpenShift 4.4 cluster on Azure (UPI)
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Suresh Kolichala
QA Contact: ge liu
Depends On:
Blocks: 1847769
TreeView+ depends on / blocked
Reported: 2020-06-01 09:04 UTC by Seasonny Huang
Modified: 2020-12-21 20:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: The original problem reported was due to running the recovery on all 3 nodes, instead of a single node. Reason: Enhanced the cluster-backup and cluster-restore scripts to provide better feedback to the user. Result: The user has better information to understand why the restore failed.
Clone Of:
: 1848516 (view as bug list)
Last Closed: 2020-10-27 16:02:58 UTC
Target Upstream Version:
skolicha: needinfo-
geliu: needinfo? (sbatsche)

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 371 0 None closed Bug 1842431: cluster-backup.sh: error if script dependencies are not met 2020-12-21 20:47:06 UTC
Github openshift openshift-docs pull 22807 0 None closed Bug 1842431: backup_and_restore: clarify backup process 2020-12-21 20:47:06 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:03:28 UTC

Description Seasonny Huang 2020-06-01 09:04:11 UTC
Description of problem:

2 weeks after installation some alerts was fired, such as KubeAPILatencyHigh, etcdMemberCommunicationSlow...etc

After the event happened without any manual operation, few hours past, the alerts keep firing, finally, we can't login the cluster anymore, all oc commands failed

Version-Release number of selected component (if applicable):

How reproducible:
From out testing seems like need a long-running cluster(2 weeks), it will be fired some alerts, we saw the following warning in the cluster
* KubeAPILatencyHigh
* etcdMemberCommunicationSlow
* time drift log in etcd

Steps to Reproduce:
1. Install OCP cluster
2. Waiting for the alert fired, eg: etcdMemberCommunicationSlow, KubeAPILatencyHigh...etc
3. Still can perform most of the actions but some intermittent issue happened, eg: oc command time out/failed, but retry can fix
4. In our testing, if shut down one of the masters, the cluster will quite weird, eg: only about 20% `oc whoami` executed successfully, can't log in the cluster
{"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"888bfbf6"}

Actual results:

Investigating the log, we found
2020-05-15 14:48:47.640740 W | rafthttp: the clock difference against peer d7523d07ecd5d55e is too high [10.545065453s > 1s] (prober "ROUND_TRIPPER_RAFT_MESSAGE")
After the cu apply this kb[2] time drift event seems to disappear, but we don't think this the root cause since we had manually set a high drift time to one of the nodes, we see the control plane status keep changing between Ready and Degraded on the web console, but the cluster seems fine, we can deploy a pod to the cluster
Continue Investigating, we found...
2020-05-15T13:27:49.8684449Z 2020-05-15 13:27:49.868296 W | rafthttp: health check for peer b366cbcd316c915f could not connect: dial tcp i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
So that we decide to test disk performance for etcd, we using this approach[3]

Also see some leader changed events
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: leader changed'
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: request timed out'

Using Premium SSD on the master nodes, the following below is fio testing for etcd performance

time unit: usec
     | 99.00th=[ 7898], 99.50th=[ 9241], 99.90th=[13042], 99.95th=[18744],
     | 99.99th=[23987]
7898-23987 microseconds
7.898-23.987 milliseconds

     | 99.00th=[ 16057], 99.50th=[ 18220], 99.90th=[ 24511], 99.95th=[ 30278],
     | 99.99th=[291505]
16057-291505 microseconds
16.057-291.505 milliseconds  <===

     | 99.00th=[ 4621], 99.50th=[ 5735], 99.90th=[ 8717], 99.95th=[10421],
     | 99.99th=[12125]
4621-12125 microseconds
4.621-12.125 milliseconds

Try to recover the cluster from this doc[1], but in the step 4b found some files missing on master0, master1

[root@bacchus-master-0 manifests]# ls

[root@bacchus-master-1 manifests]# ls

[root@bacchus-master-2 manifests]# ls
etcd-pod.yaml  kube-apiserver-pod.yaml  kube-controller-manager-pod.yaml  kube-scheduler-pod.yaml

Expected results:
A healthful OCP/etcd cluster

Additional info:
In the same region (Southeast Asia) had rafthttp issue on Openshift v3.11(seldom occur), but even system occur APIHighLatency alert never meet cluster crash, files lost

[1] https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state

Comment 5 Sam Batschelet 2020-06-03 12:32:20 UTC
Moving to 4.6 to continue support and triage.

Comment 11 ge liu 2020-06-17 10:13:29 UTC
Hi Sam, I checked the latest 4.6 payload: 4.6.0-0.nightly-2020-06-16-214732, the fix have not be merged, so I will try it tomorrow if new build ready.

Comment 18 ge liu 2020-06-24 03:09:34 UTC
Verified with 4.6.0-0.nightly-2020-06-23-002207

Comment 26 errata-xmlrpc 2020-10-27 16:02:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.