Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1994483

Summary:	OCP 4.8 etcd unhealthy
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	wking
Target Milestone:	---
Target Release:	4.8.z
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-21 08:01:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1993757
Bug Blocks:

Description OpenShift BugZilla Robot 2021-08-17 11:18:42 UTC

+++ This bug was initially created as a clone of Bug #1993757 +++

Description of problem:

hello, I have found the OCP 4.8 etcd is unhealth even all the cluster operator and node is ready

in this bug report, we also show OCP 4.7.21 etcd status without 4.8 etcd's unhealthy issue, so is it a bug for OCP 4.8? 

Version-Release number of selected component (if applicable):

OCP 4.8.2

How reproducible:

enter into etcd pod, and show the cluster endpoint health status

Steps to Reproduce:

# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.2     True        False         False      14m
baremetal                                  4.8.2     True        False         False      37m
cloud-credential                           4.8.2     True        False         False      44m
cluster-autoscaler                         4.8.2     True        False         False      37m
config-operator                            4.8.2     True        False         False      38m
console                                    4.8.2     True        False         False      15m
csi-snapshot-controller                    4.8.2     True        False         False      38m
dns                                        4.8.2     True        False         False      37m
etcd                                       4.8.2     True        False         False      36m
image-registry                             4.8.2     True        False         False      31m
ingress                                    4.8.2     True        False         False      31m
insights                                   4.8.2     True        False         False      33m
kube-apiserver                             4.8.2     True        False         False      34m
kube-controller-manager                    4.8.2     True        False         False      36m
kube-scheduler                             4.8.2     True        False         False      35m
kube-storage-version-migrator              4.8.2     True        False         False      38m
machine-api                                4.8.2     True        False         False      35m
machine-approver                           4.8.2     True        False         False      38m
machine-config                             4.8.2     True        False         False      37m
marketplace                                4.8.2     True        False         False      37m
monitoring                                 4.8.2     True        False         False      29m
network                                    4.8.2     True        False         False      38m
node-tuning                                4.8.2     True        False         False      37m
openshift-apiserver                        4.8.2     True        False         False      32m
openshift-controller-manager               4.8.2     True        False         False      37m
openshift-samples                          4.8.2     True        False         False      28m
operator-lifecycle-manager                 4.8.2     True        False         False      37m
operator-lifecycle-manager-catalog         4.8.2     True        False         False      38m
operator-lifecycle-manager-packageserver   4.8.2     True        False         False      32m
service-ca                                 4.8.2     True        False         False      39m
storage                                    4.8.2     True        False         False      30m

# oc get nodes
NAME                      STATUS   ROLES           AGE   VERSION
cluster3-wpg5w-master-0   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-1   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-2   Ready    master,worker   42m   v1.21.1+051ac4f
[root@support cluster3]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.2     True        False         15m     Cluster version is 4.8.2

# oc rsh -n openshift-etcd etcd-cluster3-wpg5w-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-cluster3-wpg5w-master-0 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcd --version
etcd Version: 3.4.14
Git SHA: 302184b
Go Version: go1.12.12
Go OS/Arch: linux/amd64

sh-4.4# etcdctl member list -w table
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
|        ID        | STATUS  |          NAME           |         PEER ADDRS          |                     CLIENT ADDRS                     | IS LEARNER |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
| 6a853d515add7524 | started | cluster3-wpg5w-master-1 | https://192.168.30.101:2380 | https://192.168.30.101:2379,unixs://192.168.30.101:0 |      false |
| 7499dbce65c3d0e5 | started | cluster3-wpg5w-master-2 | https://192.168.30.102:2380 | https://192.168.30.102:2379,unixs://192.168.30.102:0 |      false |
| eed9a82b756a5949 | started | cluster3-wpg5w-master-0 | https://192.168.30.103:2380 | https://192.168.30.103:2379,unixs://192.168.30.103:0 |      false |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0e6d9929-8264-4698-b424-f669cc0427ac/192.168.30.103:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.103:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-38d42427-5288-4098-a4ce-9708a0fec0c1/192.168.30.101:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.101:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.646Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6c1259e8-4939-40fe-b0e0-040b65ef2dd8/192.168.30.102:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.102:0: connect: no such file or directory\""}
https://192.168.30.103:2379 is healthy: successfully committed proposal: took = 32.628699ms
https://192.168.30.101:2379 is healthy: successfully committed proposal: took = 37.677222ms
https://192.168.30.102:2379 is healthy: successfully committed proposal: took = 41.539623ms
unixs://192.168.30.103:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.101:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.102:0 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster


Actual results:

the etcd status report unthealthy

Expected results:

Blow is OCP 4.7.21 etcd status, 4.7 etcd without 4.8 etcd's unhealthy issue

# oc rsh -n openshift-etcd etcd-master-01
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-master-01 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcdctl member list -w table
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|  ff46b5088927f7e | started | master-01 | https://192.168.30.47:2380 | https://192.168.30.47:2379 |      false |
| 2d1b0e1ae4152bff | started | master-03 | https://192.168.30.49:2380 | https://192.168.30.49:2379 |      false |
| 914d42e671b50c2c | started | master-02 | https://192.168.30.48:2380 | https://192.168.30.48:2379 |      false |
+------------------+---------+-----------+----------------------------+----------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
https://192.168.30.48:2379 is healthy: successfully committed proposal: took = 20.090987ms
https://192.168.30.47:2379 is healthy: successfully committed proposal: took = 21.686083ms
https://192.168.30.49:2379 is healthy: successfully committed proposal: took = 21.788874ms

sh-4.4# etcd --version
etcd Version: 3.4.9
Git SHA: 9d1c40d
Go Version: go1.12.12
Go OS/Arch: linux/amd64

Additional info:

--- Additional comment from sbatsche on 2021-08-16 14:11:28 UTC ---

This issue is cosmetic, the workaround for now would be to drop the --cluster flag from etcdctl command.

```
 etcdctl endpoint health
```

Comment 5 errata-xmlrpc 2021-09-21 08:01:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3511