Bug 1993757 - OCP 4.8 etcd unhealthy
Summary: OCP 4.8 etcd unhealthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.8
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.9.0
Assignee: Sam Batschelet
QA Contact: Sandeep
URL:
Whiteboard:
Depends On:
Blocks: 1994483
TreeView+ depends on / blocked
 
Reported: 2021-08-16 05:42 UTC by kevin
Modified: 2021-10-18 17:46 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:46:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 640 0 None None None 2021-08-16 14:08:19 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:46:34 UTC

Description kevin 2021-08-16 05:42:22 UTC
Description of problem:

hello, I have found the OCP 4.8 etcd is unhealth even all the cluster operator and node is ready

in this bug report, we also show OCP 4.7.21 etcd status without 4.8 etcd's unhealthy issue, so is it a bug for OCP 4.8? 

Version-Release number of selected component (if applicable):

OCP 4.8.2

How reproducible:

enter into etcd pod, and show the cluster endpoint health status

Steps to Reproduce:

# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.2     True        False         False      14m
baremetal                                  4.8.2     True        False         False      37m
cloud-credential                           4.8.2     True        False         False      44m
cluster-autoscaler                         4.8.2     True        False         False      37m
config-operator                            4.8.2     True        False         False      38m
console                                    4.8.2     True        False         False      15m
csi-snapshot-controller                    4.8.2     True        False         False      38m
dns                                        4.8.2     True        False         False      37m
etcd                                       4.8.2     True        False         False      36m
image-registry                             4.8.2     True        False         False      31m
ingress                                    4.8.2     True        False         False      31m
insights                                   4.8.2     True        False         False      33m
kube-apiserver                             4.8.2     True        False         False      34m
kube-controller-manager                    4.8.2     True        False         False      36m
kube-scheduler                             4.8.2     True        False         False      35m
kube-storage-version-migrator              4.8.2     True        False         False      38m
machine-api                                4.8.2     True        False         False      35m
machine-approver                           4.8.2     True        False         False      38m
machine-config                             4.8.2     True        False         False      37m
marketplace                                4.8.2     True        False         False      37m
monitoring                                 4.8.2     True        False         False      29m
network                                    4.8.2     True        False         False      38m
node-tuning                                4.8.2     True        False         False      37m
openshift-apiserver                        4.8.2     True        False         False      32m
openshift-controller-manager               4.8.2     True        False         False      37m
openshift-samples                          4.8.2     True        False         False      28m
operator-lifecycle-manager                 4.8.2     True        False         False      37m
operator-lifecycle-manager-catalog         4.8.2     True        False         False      38m
operator-lifecycle-manager-packageserver   4.8.2     True        False         False      32m
service-ca                                 4.8.2     True        False         False      39m
storage                                    4.8.2     True        False         False      30m

# oc get nodes
NAME                      STATUS   ROLES           AGE   VERSION
cluster3-wpg5w-master-0   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-1   Ready    master,worker   43m   v1.21.1+051ac4f
cluster3-wpg5w-master-2   Ready    master,worker   42m   v1.21.1+051ac4f
[root@support cluster3]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.2     True        False         15m     Cluster version is 4.8.2

# oc rsh -n openshift-etcd etcd-cluster3-wpg5w-master-0
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-cluster3-wpg5w-master-0 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcd --version
etcd Version: 3.4.14
Git SHA: 302184b
Go Version: go1.12.12
Go OS/Arch: linux/amd64

sh-4.4# etcdctl member list -w table
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
|        ID        | STATUS  |          NAME           |         PEER ADDRS          |                     CLIENT ADDRS                     | IS LEARNER |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+
| 6a853d515add7524 | started | cluster3-wpg5w-master-1 | https://192.168.30.101:2380 | https://192.168.30.101:2379,unixs://192.168.30.101:0 |      false |
| 7499dbce65c3d0e5 | started | cluster3-wpg5w-master-2 | https://192.168.30.102:2380 | https://192.168.30.102:2379,unixs://192.168.30.102:0 |      false |
| eed9a82b756a5949 | started | cluster3-wpg5w-master-0 | https://192.168.30.103:2380 | https://192.168.30.103:2379,unixs://192.168.30.103:0 |      false |
+------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0e6d9929-8264-4698-b424-f669cc0427ac/192.168.30.103:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.103:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-38d42427-5288-4098-a4ce-9708a0fec0c1/192.168.30.101:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.101:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-16T05:29:33.646Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6c1259e8-4939-40fe-b0e0-040b65ef2dd8/192.168.30.102:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.102:0: connect: no such file or directory\""}
https://192.168.30.103:2379 is healthy: successfully committed proposal: took = 32.628699ms
https://192.168.30.101:2379 is healthy: successfully committed proposal: took = 37.677222ms
https://192.168.30.102:2379 is healthy: successfully committed proposal: took = 41.539623ms
unixs://192.168.30.103:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.101:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://192.168.30.102:0 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster


Actual results:

the etcd status report unthealthy

Expected results:

Blow is OCP 4.7.21 etcd status, 4.7 etcd without 4.8 etcd's unhealthy issue

# oc rsh -n openshift-etcd etcd-master-01
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-master-01 -n openshift-etcd' to see all of the containers in this pod.

sh-4.4# etcdctl member list -w table
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |   NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+-----------+----------------------------+----------------------------+------------+
|  ff46b5088927f7e | started | master-01 | https://192.168.30.47:2380 | https://192.168.30.47:2379 |      false |
| 2d1b0e1ae4152bff | started | master-03 | https://192.168.30.49:2380 | https://192.168.30.49:2379 |      false |
| 914d42e671b50c2c | started | master-02 | https://192.168.30.48:2380 | https://192.168.30.48:2379 |      false |
+------------------+---------+-----------+----------------------------+----------------------------+------------+

sh-4.4# etcdctl endpoint health --cluster
https://192.168.30.48:2379 is healthy: successfully committed proposal: took = 20.090987ms
https://192.168.30.47:2379 is healthy: successfully committed proposal: took = 21.686083ms
https://192.168.30.49:2379 is healthy: successfully committed proposal: took = 21.788874ms

sh-4.4# etcd --version
etcd Version: 3.4.9
Git SHA: 9d1c40d
Go Version: go1.12.12
Go OS/Arch: linux/amd64

Additional info:

Comment 1 Sam Batschelet 2021-08-16 14:11:28 UTC
This issue is cosmetic, the workaround for now would be to drop the --cluster flag from etcdctl command.

```
 etcdctl endpoint health
```

Comment 3 Sandeep 2021-08-24 15:05:13 UTC
Please find the below steps and the observations on 4.8 cluster:

$ oc get nodes
NAME                                                        STATUS   ROLES    AGE     VERSION
skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal         Ready    master   3h23m   v1.21.1+9807387
skundu-ver-1-g8d9k-master-1.c.openshift-qe.internal         Ready    master   3h23m   v1.21.1+9807387
skundu-ver-1-g8d9k-master-2.c.openshift-qe.internal         Ready    master   3h23m   v1.21.1+9807387
skundu-ver-1-g8d9k-worker-a-pc869.c.openshift-qe.internal   Ready    worker   3h16m   v1.21.1+9807387
skundu-ver-1-g8d9k-worker-b-z4r5c.c.openshift-qe.internal   Ready    worker   3h16m   v1.21.1+9807387
skundu-ver-1-g8d9k-worker-c-2xnvk.c.openshift-qe.internal   Ready    worker   3h17m   v1.21.1+9807387

____________________________________________________________________________________________________________________________________________________________________________________

$ oc rsh -n openshift-etcd etcd-skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal
Defaulting container name to etcdctl.
Use 'oc describe pod/etcd-skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal -n openshift-etcd' to see all of the containers in this pod.
sh-4.4# 
sh-4.4# etcd --version
etcd Version: 3.4.14
Git SHA: 95a9769
Go Version: go1.12.12
Go OS/Arch: linux/amd64

sh-4.4# etcdctl member list -w table
+------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+
|        ID        | STATUS  |                        NAME                         |      PEER ADDRS       |               CLIENT ADDRS               | IS LEARNER |
+------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+
| 44f821c73f39e4fc | started | skundu-ver-1-g8d9k-master-2.c.openshift-qe.internal | https://10.0.0.2:2380 | https://10.0.0.2:2379,unixs://10.0.0.2:0 |      false |
| 6b00e473bd74e3cb | started | skundu-ver-1-g8d9k-master-1.c.openshift-qe.internal | https://10.0.0.5:2380 | https://10.0.0.5:2379,unixs://10.0.0.5:0 |      false |
| a1a7b97340cb643c | started | skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal | https://10.0.0.4:2380 | https://10.0.0.4:2379,unixs://10.0.0.4:0 |      false |
+------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+

______________________________________________________________________________________________________________________________________________________________________________________


sh-4.4# etcdctl endpoint health --cluster
{"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-8d09a2a4-911c-45f1-9053-6f6115b2551f/10.0.0.5:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.5:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-8c368e3c-5ebb-4e70-9b9d-f8ba25bbab1c/10.0.0.4:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.4:0: connect: no such file or directory\""}
{"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-aeabbe4e-9930-4a77-ba02-795b82c8541d/10.0.0.2:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.2:0: connect: no such file or directory\""}
https://10.0.0.4:2379 is healthy: successfully committed proposal: took = 16.253078ms
https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 20.409062ms
https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 21.275478ms
unixs://10.0.0.5:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://10.0.0.4:0 is unhealthy: failed to commit proposal: context deadline exceeded
unixs://10.0.0.2:0 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster
sh-4.4# 

______________________________________________________________________________________________________________________________________________________________________________________

The issue as reported in the bug continues to exist on 4.8
______________________________________________________________________________________________________________________________________________________________________________________

sh-4.4# etcdctl endpoint health          
https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 29.146886ms
https://10.0.0.4:2379 is healthy: successfully committed proposal: took = 29.136991ms
https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 41.413365ms
______________________________________________________________________________________________________________________________________________________________________________________

The workaround as mentioned above works correctly. (without the --cluster flag)

Comment 4 ge liu 2021-08-26 03:28:17 UTC
@skundu, I will change status for you based on 3, for you have not access right to change bug status.

Comment 7 errata-xmlrpc 2021-10-18 17:46:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.