Bug 1842431

Summary: Unresponsive OpenShift 4.4 cluster on Azure (UPI)
Product: OpenShift Container Platform Reporter: Seasonny Huang <weihuang>
Component: EtcdAssignee: Suresh Kolichala <skolicha>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: marobrie, mfojtik, moddi, sbatsche, skolicha, vjaypurk, weihuang, wili
Target Milestone: ---Flags: skolicha: needinfo-
geliu: needinfo? (sbatsche)
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: The original problem reported was due to running the recovery on all 3 nodes, instead of a single node. Reason: Enhanced the cluster-backup and cluster-restore scripts to provide better feedback to the user. Result: The user has better information to understand why the restore failed.
Story Points: ---
Clone Of:
: 1848516 (view as bug list) Environment:
Last Closed: 2020-10-27 16:02:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1847769    

Description Seasonny Huang 2020-06-01 09:04:11 UTC
Description of problem:

2 weeks after installation some alerts was fired, such as KubeAPILatencyHigh, etcdMemberCommunicationSlow...etc

After the event happened without any manual operation, few hours past, the alerts keep firing, finally, we can't login the cluster anymore, all oc commands failed


Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Unconfirmed, 
From out testing seems like need a long-running cluster(2 weeks), it will be fired some alerts, we saw the following warning in the cluster
* KubeAPILatencyHigh
* etcdMemberCommunicationSlow
* time drift log in etcd

Steps to Reproduce:
1. Install OCP cluster
2. Waiting for the alert fired, eg: etcdMemberCommunicationSlow, KubeAPILatencyHigh...etc
3. Still can perform most of the actions but some intermittent issue happened, eg: oc command time out/failed, but retry can fix
4. In our testing, if shut down one of the masters, the cluster will quite weird, eg: only about 20% `oc whoami` executed successfully, can't log in the cluster
```
{"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"888bfbf6"}
```

Actual results:

Investigating the log, we found
2020-05-15 14:48:47.640740 W | rafthttp: the clock difference against peer d7523d07ecd5d55e is too high [10.545065453s > 1s] (prober "ROUND_TRIPPER_RAFT_MESSAGE")
After the cu apply this kb[2] time drift event seems to disappear, but we don't think this the root cause since we had manually set a high drift time to one of the nodes, we see the control plane status keep changing between Ready and Degraded on the web console, but the cluster seems fine, we can deploy a pod to the cluster
 
Continue Investigating, we found...
2020-05-15T13:27:49.8684449Z 2020-05-15 13:27:49.868296 W | rafthttp: health check for peer b366cbcd316c915f could not connect: dial tcp 10.5.3.5:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
So that we decide to test disk performance for etcd, we using this approach[3]

Also see some leader changed events
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: leader changed'
22
$ cat crictl_logs_-t_32507a496f39226c9b50ba5db64e0207e14a47d38b4f6d4cf4b8b6257d250ace | sort | uniq | grep -c 'etcdserver: request timed out'
105

Using Premium SSD on the master nodes, the following below is fio testing for etcd performance

--
time unit: usec
master-2:
     | 99.00th=[ 7898], 99.50th=[ 9241], 99.90th=[13042], 99.95th=[18744],
     | 99.99th=[23987]
7898-23987 microseconds
7.898-23.987 milliseconds

master-1:
     | 99.00th=[ 16057], 99.50th=[ 18220], 99.90th=[ 24511], 99.95th=[ 30278],
     | 99.99th=[291505]
16057-291505 microseconds
16.057-291.505 milliseconds  <===

master-0:
     | 99.00th=[ 4621], 99.50th=[ 5735], 99.90th=[ 8717], 99.95th=[10421],
     | 99.99th=[12125]
4621-12125 microseconds
4.621-12.125 milliseconds
--

Try to recover the cluster from this doc[1], but in the step 4b found some files missing on master0, master1

```
[root@bacchus-master-0 manifests]# ls
kube-apiserver-pod.yaml

[root@bacchus-master-1 manifests]# ls
<empty>

[root@bacchus-master-2 manifests]# ls
etcd-pod.yaml  kube-apiserver-pod.yaml  kube-controller-manager-pod.yaml  kube-scheduler-pod.yaml
```

Expected results:
A healthful OCP/etcd cluster

Additional info:
In the same region (Southeast Asia) had rafthttp issue on Openshift v3.11(seldom occur), but even system occur APIHighLatency alert never meet cluster crash, files lost

[1] https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html#dr-restoring-cluster-state

Comment 5 Sam Batschelet 2020-06-03 12:32:20 UTC
Moving to 4.6 to continue support and triage.

Comment 11 ge liu 2020-06-17 10:13:29 UTC
Hi Sam, I checked the latest 4.6 payload: 4.6.0-0.nightly-2020-06-16-214732, the fix have not be merged, so I will try it tomorrow if new build ready.

Comment 18 ge liu 2020-06-24 03:09:34 UTC
Verified with 4.6.0-0.nightly-2020-06-23-002207

Comment 26 errata-xmlrpc 2020-10-27 16:02:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196