Bug 1994483 - OCP 4.8 etcd unhealthy
Summary: OCP 4.8 etcd unhealthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.8
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.z
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1993757
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-17 11:18 UTC by OpenShift BugZilla Robot
Modified: 2021-09-21 08:01 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-21 08:01:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 642 0 None None None 2021-08-27 01:02:23 UTC
Red Hat Product Errata RHBA-2021:3511 0 None None None 2021-09-21 08:01:45 UTC

Description OpenShift BugZilla Robot 2021-08-17 11:18:42 UTC
+++ This bug was initially created as a clone of Bug #1993757 +++

Description of problem:

hello, I have found the OCP 4.8 etcd is unhealth even all the cluster operator and node is ready

in this bug report, we also show OCP 4.7.21 etcd status without 4.8 etcd's unhealthy issue, so is it a bug for OCP 4.8? 

Version-Release number of selected component (if applicable):

OCP 4.8.2

How reproducible:

enter into etcd pod, and show the cluster endpoint health status

Steps to Reproduce:

# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.2     True        False         False      14m
baremetal                                  4.8.2     True        False         False      37m
cloud-credential                           4.8.2     True        False         False      44m
cluster-autoscaler                         4.8.2     True        False         False      37m
config-operator                            4.8.2     True        False         False      38m
console                                    4.8.2     True        False         False      15m
csi-snapshot-controller                    4.8.2     True        False         False      38m
dns                                        4.8.2     True        False         False      37m
etcd                                       4.8.2     True        False         False      36m
image-registry                             4.8.2     True        False         False      31m
ingress                                    4.8.2     True        False         False      31m
insights                                   4.8.2     True        False         False      33m
kube-apiserver                             4.8.2     True        False         False      34m
kube-controller-manager                    4.8.2     True        False         False      36m
kube-scheduler                             4.8.2     True        False         False      35m
kube-storage-version-migrator              4.8.2     True        False         False      38m
machine-api                                4.8.2     True        False         False      35m
machine-approver                           4.8.2     True        False         False      38m
machine-config                             4.8.2     True        False         False      37m
marketplace                                4.8.2     True        False         False      37m
monitoring                                 4.8.2     True        False         False      29m
network                                    4.8.2     True        False         False      38m
node-tuning                                4.8.2     True        False         False      37m
openshift-apiserver                        4.8.2     True        False         False      32m
openshift-controller-manager               4.8.2     True        False         False      37m
openshift-samples                          4.8.2     True        False         False      28m
operator-lifecycle-manager                 4.8.2     True        False         False      37m
operator-lifecycle-manager-catalog         4.8.2     True        False         False      38m
operator-lifecycle-manager-packageserver   4.8.2     True        False         False      32m
service-ca                                 4.8.2     True        False         False      39m
storage                                    4.8.2     True        False         False      30m

# oc get nodes
NAME                      STATUS   ROLES           AGE   VERSION
cluster3-wpg5w-master-0   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-1   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-2   Ready    master,worker   42m   v1.21.1+051ac4f
[root@support cluster3]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.2     True        False         15m     Cluster version is 4.8.2

# oc rsh -n openshift-etcd etcd-cluster3-wpg5w-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-cluster3-wpg5w-master-0 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcd --version
etcd Version: 3.4.14
Git SHA: 302184b
Go Version: go1.12.12
Go OS/Arch: linux/amd64

sh-4.4# etcdctl member list -w table
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
|        ID        | STATUS  |          NAME           |         PEER ADDRS          |                     CLIENT ADDRS                     | IS LEARNER |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
| 6a853d515add7524 | started | cluster3-wpg5w-master-1 | https://192.168.30.101:2380 | https://192.168.30.101:2379,unixs://192.168.30.101:0 |      false |
| 7499dbce65c3d0e5 | started | cluster3-wpg5w-master-2 | https://192.168.30.102:2380 | https://192.168.30.102:2379,unixs://192.168.30.102:0 |      false |
| eed9a82b756a5949 | started | cluster3-wpg5w-master-0 | https://192.168.30.103:2380 | https://192.168.30.103:2379,unixs://192.168.30.103:0 |      false |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0e6d9929-8264-4698-b424-f669cc0427ac/192.168.30.103:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.103:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-38d42427-5288-4098-a4ce-9708a0fec0c1/192.168.30.101:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.101:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.646Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6c1259e8-4939-40fe-b0e0-040b65ef2dd8/192.168.30.102:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.102:0: connect: no such file or directory\""}
https://192.168.30.103:2379 is healthy: successfully committed proposal: took = 32.628699ms
https://192.168.30.101:2379 is healthy: successfully committed proposal: took = 37.677222ms
https://192.168.30.102:2379 is healthy: successfully committed proposal: took = 41.539623ms
unixs://192.168.30.103:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.101:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.102:0 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster


Actual results:

the etcd status report unthealthy

Expected results:

Blow is OCP 4.7.21 etcd status, 4.7 etcd without 4.8 etcd's unhealthy issue

# oc rsh -n openshift-etcd etcd-master-01
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-master-01 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcdctl member list -w table
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|  ff46b5088927f7e | started | master-01 | https://192.168.30.47:2380 | https://192.168.30.47:2379 |      false |
| 2d1b0e1ae4152bff | started | master-03 | https://192.168.30.49:2380 | https://192.168.30.49:2379 |      false |
| 914d42e671b50c2c | started | master-02 | https://192.168.30.48:2380 | https://192.168.30.48:2379 |      false |
+------------------+---------+-----------+----------------------------+----------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
https://192.168.30.48:2379 is healthy: successfully committed proposal: took = 20.090987ms
https://192.168.30.47:2379 is healthy: successfully committed proposal: took = 21.686083ms
https://192.168.30.49:2379 is healthy: successfully committed proposal: took = 21.788874ms

sh-4.4# etcd --version
etcd Version: 3.4.9
Git SHA: 9d1c40d
Go Version: go1.12.12
Go OS/Arch: linux/amd64

Additional info:

--- Additional comment from sbatsche on 2021-08-16 14:11:28 UTC ---

This issue is cosmetic, the workaround for now would be to drop the --cluster flag from etcdctl command.

```
 etcdctl endpoint health
```

Comment 5 errata-xmlrpc 2021-09-21 08:01:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3511


Note You need to log in before you can comment on or make changes to this bug.