Description of problem: hello, I have found the OCP 4.8 etcd is unhealth even all the cluster operator and node is ready in this bug report, we also show OCP 4.7.21 etcd status without 4.8 etcd's unhealthy issue, so is it a bug for OCP 4.8? Version-Release number of selected component (if applicable): OCP 4.8.2 How reproducible: enter into etcd pod, and show the cluster endpoint health status Steps to Reproduce: # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.2 True False False 14m baremetal 4.8.2 True False False 37m cloud-credential 4.8.2 True False False 44m cluster-autoscaler 4.8.2 True False False 37m config-operator 4.8.2 True False False 38m console 4.8.2 True False False 15m csi-snapshot-controller 4.8.2 True False False 38m dns 4.8.2 True False False 37m etcd 4.8.2 True False False 36m image-registry 4.8.2 True False False 31m ingress 4.8.2 True False False 31m insights 4.8.2 True False False 33m kube-apiserver 4.8.2 True False False 34m kube-controller-manager 4.8.2 True False False 36m kube-scheduler 4.8.2 True False False 35m kube-storage-version-migrator 4.8.2 True False False 38m machine-api 4.8.2 True False False 35m machine-approver 4.8.2 True False False 38m machine-config 4.8.2 True False False 37m marketplace 4.8.2 True False False 37m monitoring 4.8.2 True False False 29m network 4.8.2 True False False 38m node-tuning 4.8.2 True False False 37m openshift-apiserver 4.8.2 True False False 32m openshift-controller-manager 4.8.2 True False False 37m openshift-samples 4.8.2 True False False 28m operator-lifecycle-manager 4.8.2 True False False 37m operator-lifecycle-manager-catalog 4.8.2 True False False 38m operator-lifecycle-manager-packageserver 4.8.2 True False False 32m service-ca 4.8.2 True False False 39m storage 4.8.2 True False False 30m # oc get nodes NAME STATUS ROLES AGE VERSION cluster3-wpg5w-master-0 Ready master,worker 43m v1.21.1+051ac4f cluster3-wpg5w-master-1 Ready master,worker 43m v1.21.1+051ac4f cluster3-wpg5w-master-2 Ready master,worker 42m v1.21.1+051ac4f [root@support cluster3]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.2 True False 15m Cluster version is 4.8.2 # oc rsh -n openshift-etcd etcd-cluster3-wpg5w-master-0 Defaulting container name to etcdctl. Use 'oc describe pod/etcd-cluster3-wpg5w-master-0 -n openshift-etcd' to see all of the containers in this pod. sh-4.4# etcd --version etcd Version: 3.4.14 Git SHA: 302184b Go Version: go1.12.12 Go OS/Arch: linux/amd64 sh-4.4# etcdctl member list -w table +------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+ | 6a853d515add7524 | started | cluster3-wpg5w-master-1 | https://192.168.30.101:2380 | https://192.168.30.101:2379,unixs://192.168.30.101:0 | false | | 7499dbce65c3d0e5 | started | cluster3-wpg5w-master-2 | https://192.168.30.102:2380 | https://192.168.30.102:2379,unixs://192.168.30.102:0 | false | | eed9a82b756a5949 | started | cluster3-wpg5w-master-0 | https://192.168.30.103:2380 | https://192.168.30.103:2379,unixs://192.168.30.103:0 | false | +------------------+---------+-------------------------+-----------------------------+------------------------------------------------------+------------+ sh-4.4# etcdctl endpoint health --cluster {"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0e6d9929-8264-4698-b424-f669cc0427ac/192.168.30.103:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.103:0: connect: no such file or directory\""} {"level":"warn","ts":"2021-08-16T05:29:33.645Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-38d42427-5288-4098-a4ce-9708a0fec0c1/192.168.30.101:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.101:0: connect: no such file or directory\""} {"level":"warn","ts":"2021-08-16T05:29:33.646Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6c1259e8-4939-40fe-b0e0-040b65ef2dd8/192.168.30.102:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 192.168.30.102:0: connect: no such file or directory\""} https://192.168.30.103:2379 is healthy: successfully committed proposal: took = 32.628699ms https://192.168.30.101:2379 is healthy: successfully committed proposal: took = 37.677222ms https://192.168.30.102:2379 is healthy: successfully committed proposal: took = 41.539623ms unixs://192.168.30.103:0 is unhealthy: failed to commit proposal: context deadline exceeded unixs://192.168.30.101:0 is unhealthy: failed to commit proposal: context deadline exceeded unixs://192.168.30.102:0 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster Actual results: the etcd status report unthealthy Expected results: Blow is OCP 4.7.21 etcd status, 4.7 etcd without 4.8 etcd's unhealthy issue # oc rsh -n openshift-etcd etcd-master-01 Defaulting container name to etcdctl. Use 'oc describe pod/etcd-master-01 -n openshift-etcd' to see all of the containers in this pod. sh-4.4# etcdctl member list -w table +------------------+---------+-----------+----------------------------+----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------+----------------------------+----------------------------+------------+ | ff46b5088927f7e | started | master-01 | https://192.168.30.47:2380 | https://192.168.30.47:2379 | false | | 2d1b0e1ae4152bff | started | master-03 | https://192.168.30.49:2380 | https://192.168.30.49:2379 | false | | 914d42e671b50c2c | started | master-02 | https://192.168.30.48:2380 | https://192.168.30.48:2379 | false | +------------------+---------+-----------+----------------------------+----------------------------+------------+ sh-4.4# etcdctl endpoint health --cluster https://192.168.30.48:2379 is healthy: successfully committed proposal: took = 20.090987ms https://192.168.30.47:2379 is healthy: successfully committed proposal: took = 21.686083ms https://192.168.30.49:2379 is healthy: successfully committed proposal: took = 21.788874ms sh-4.4# etcd --version etcd Version: 3.4.9 Git SHA: 9d1c40d Go Version: go1.12.12 Go OS/Arch: linux/amd64 Additional info:
This issue is cosmetic, the workaround for now would be to drop the --cluster flag from etcdctl command. ``` etcdctl endpoint health ```
Please find the below steps and the observations on 4.8 cluster: $ oc get nodes NAME STATUS ROLES AGE VERSION skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal Ready master 3h23m v1.21.1+9807387 skundu-ver-1-g8d9k-master-1.c.openshift-qe.internal Ready master 3h23m v1.21.1+9807387 skundu-ver-1-g8d9k-master-2.c.openshift-qe.internal Ready master 3h23m v1.21.1+9807387 skundu-ver-1-g8d9k-worker-a-pc869.c.openshift-qe.internal Ready worker 3h16m v1.21.1+9807387 skundu-ver-1-g8d9k-worker-b-z4r5c.c.openshift-qe.internal Ready worker 3h16m v1.21.1+9807387 skundu-ver-1-g8d9k-worker-c-2xnvk.c.openshift-qe.internal Ready worker 3h17m v1.21.1+9807387 ____________________________________________________________________________________________________________________________________________________________________________________ $ oc rsh -n openshift-etcd etcd-skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal Defaulting container name to etcdctl. Use 'oc describe pod/etcd-skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal -n openshift-etcd' to see all of the containers in this pod. sh-4.4# sh-4.4# etcd --version etcd Version: 3.4.14 Git SHA: 95a9769 Go Version: go1.12.12 Go OS/Arch: linux/amd64 sh-4.4# etcdctl member list -w table +------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+ | 44f821c73f39e4fc | started | skundu-ver-1-g8d9k-master-2.c.openshift-qe.internal | https://10.0.0.2:2380 | https://10.0.0.2:2379,unixs://10.0.0.2:0 | false | | 6b00e473bd74e3cb | started | skundu-ver-1-g8d9k-master-1.c.openshift-qe.internal | https://10.0.0.5:2380 | https://10.0.0.5:2379,unixs://10.0.0.5:0 | false | | a1a7b97340cb643c | started | skundu-ver-1-g8d9k-master-0.c.openshift-qe.internal | https://10.0.0.4:2380 | https://10.0.0.4:2379,unixs://10.0.0.4:0 | false | +------------------+---------+-----------------------------------------------------+-----------------------+------------------------------------------+------------+ ______________________________________________________________________________________________________________________________________________________________________________________ sh-4.4# etcdctl endpoint health --cluster {"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-8d09a2a4-911c-45f1-9053-6f6115b2551f/10.0.0.5:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.5:0: connect: no such file or directory\""} {"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-8c368e3c-5ebb-4e70-9b9d-f8ba25bbab1c/10.0.0.4:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.4:0: connect: no such file or directory\""} {"level":"warn","ts":"2021-08-24T14:54:12.808Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-aeabbe4e-9930-4a77-ba02-795b82c8541d/10.0.0.2:0","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix 10.0.0.2:0: connect: no such file or directory\""} https://10.0.0.4:2379 is healthy: successfully committed proposal: took = 16.253078ms https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 20.409062ms https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 21.275478ms unixs://10.0.0.5:0 is unhealthy: failed to commit proposal: context deadline exceeded unixs://10.0.0.4:0 is unhealthy: failed to commit proposal: context deadline exceeded unixs://10.0.0.2:0 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster sh-4.4# ______________________________________________________________________________________________________________________________________________________________________________________ The issue as reported in the bug continues to exist on 4.8 ______________________________________________________________________________________________________________________________________________________________________________________ sh-4.4# etcdctl endpoint health https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 29.146886ms https://10.0.0.4:2379 is healthy: successfully committed proposal: took = 29.136991ms https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 41.413365ms ______________________________________________________________________________________________________________________________________________________________________________________ The workaround as mentioned above works correctly. (without the --cluster flag)
@skundu, I will change status for you based on 3, for you have not access right to change bug status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759