1842431 – Unresponsive OpenShift 4.4 cluster on Azure (UPI)

Bug 1842431 - Unresponsive OpenShift 4.4 cluster on Azure (UPI)

Summary: Unresponsive OpenShift 4.4 cluster on Azure (UPI)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Suresh Kolichala
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1847769
TreeView+	depends on / blocked

Reported:	2020-06-01 09:04 UTC by Seasonny Huang
Modified:	2023-12-15 18:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: The original problem reported was due to running the recovery on all 3 nodes, instead of a single node. Reason: Enhanced the cluster-backup and cluster-restore scripts to provide better feedback to the user. Result: The user has better information to understand why the restore failed.
Clone Of:
Clones:	1848516 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:02:58 UTC
Target Upstream Version:
Embargoed:
Flags:	skolicha: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 371	None	closed	Bug 1842431: cluster-backup.sh: error if script dependencies are not met	2020-12-21 20:47:06 UTC
Github	openshift openshift-docs pull 22807	None	closed	Bug 1842431: backup_and_restore: clarify backup process	2020-12-21 20:47:06 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:03:28 UTC

Description Seasonny Huang 2020-06-01 09:04:11 UTC

Description of problem:

2 weeks after installation some alerts was fired, such as KubeAPILatencyHigh, etcdMemberCommunicationSlow...etc

After the event happened without any manual operation, few hours past, the alerts keep firing, finally, we can't login the cluster anymore, all oc commands failed


Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Unconfirmed, 
From out testing seems like need a long-running cluster(2 weeks), it will be fired some alerts, we saw the following warning in the cluster
* KubeAPILatencyHigh
* etcdMemberCommunicationSlow
* time drift log in etcd

Steps to Reproduce:
1. Install OCP cluster
2. Waiting for the alert fired, eg: etcdMemberCommunicationSlow, KubeAPILatencyHigh...etc
3. Still can perform most of the actions but some intermittent issue happened, eg: oc command time out/failed, but retry can fix
4. In our testing, if shut down one of the masters, the cluster will quite weird, eg: only about 20% `oc whoami` executed successfully, can't log in the cluster
```
{"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"888bfbf6"}
```

Actual results:

Investigating the log, we found
2020-05-15 14:48:47.640740 W | rafthttp: the clock difference against peer d7523d07ecd5d55e is too high [10.545065453s > 1s] (prober "ROUND_TRIPPER_RAFT_MESSAGE")
After the cu apply this kb[2] time drift event seems to disappear, but we don't think this the root cause since we had manually set a high drift time to one of the nodes, we see the control plane status keep changing between Ready and Degraded on the web console, but the cluster seems fine, we can deploy a pod to the cluster
 
Continue Investigating, we found...
2020-05-15T13:27:49.8684449Z 2020-05-15 13:27:49.868296 W | rafthttp: health check for peer b366cbcd316c915f could not connect: dial tcp 10.5.3.5:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
So that we decide to test disk performance for etcd, we using this approach[3]

Also see some leader changed events
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: leader changed'
22
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: request timed out'
105

Using Premium SSD on the master nodes, the following below is fio testing for etcd performance

--
time unit: usec
master-2:
     | 99.00th=[ 7898], 99.50th=[ 9241], 99.90th=[13042], 99.95th=[18744],
     | 99.99th=[23987]
7898-23987 microseconds
7.898-23.987 milliseconds

master-1:
     | 99.00th=[ 16057], 99.50th=[ 18220], 99.90th=[ 24511], 99.95th=[ 30278],
     | 99.99th=[291505]
16057-291505 microseconds
16.057-291.505 milliseconds  <===

master-0:
     | 99.00th=[ 4621], 99.50th=[ 5735], 99.90th=[ 8717], 99.95th=[10421],
     | 99.99th=[12125]
4621-12125 microseconds
4.621-12.125 milliseconds
--

Try to recover the cluster from this doc[1], but in the step 4b found some files missing on master0, master1

```
[root@bacchus-master-0 manifests]# ls
kube-apiserver-pod.yaml

[root@bacchus-master-1 manifests]# ls
<empty>

[root@bacchus-master-2 manifests]# ls
etcd-pod.yaml  kube-apiserver-pod.yaml  kube-controller-manager-pod.yaml  kube-scheduler-pod.yaml
```

Expected results:
A healthful OCP/etcd cluster

Additional info:
In the same region (Southeast Asia) had rafthttp issue on Openshift v3.11(seldom occur), but even system occur APIHighLatency alert never meet cluster crash, files lost

[1] https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state

Comment 5 Sam Batschelet 2020-06-03 12:32:20 UTC

Moving to 4.6 to continue support and triage.

Comment 11 ge liu 2020-06-17 10:13:29 UTC

Hi Sam, I checked the latest 4.6 payload: 4.6.0-0.nightly-2020-06-16-214732, the fix have not be merged, so I will try it tomorrow if new build ready.

Comment 18 ge liu 2020-06-24 03:09:34 UTC

Verified with 4.6.0-0.nightly-2020-06-23-002207

Comment 26 errata-xmlrpc 2020-10-27 16:02:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 28 Red Hat Bugzilla 2023-09-15 00:32:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.