Description of problem: After installing fresh 4.4.3 OCP cluster, there is one extra endpoint mentioned in etcd which is not member actually. oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-ip-10-0-134-104.us-east-2.compute.internal 3/3 Running 3 23m 10.0.134.104 ip-10-0-134-104.us-east-2.compute.internal <none> <none> etcd-ip-10-0-136-69.us-east-2.compute.internal 3/3 Running 0 20m 10.0.136.69 ip-10-0-136-69.us-east-2.compute.internal <none> <none> etcd-ip-10-0-154-215.us-east-2.compute.internal 3/3 Running 0 21m 10.0.154.215 ip-10-0-154-215.us-east-2.compute.internal <none> <none> installer-2-ip-10-0-134-104.us-east-2.compute.internal 0/1 Completed 0 23m 10.130.0.6 ip-10-0-134-104.us-east-2.compute.internal <none> <none> installer-2-ip-10-0-136-69.us-east-2.compute.internal 0/1 Completed 0 20m 10.129.0.27 ip-10-0-136-69.us-east-2.compute.internal <none> <none> installer-2-ip-10-0-154-215.us-east-2.compute.internal 0/1 Completed 0 21m 10.128.0.6 ip-10-0-154-215.us-east-2.compute.internal <none> <none> revision-pruner-2-ip-10-0-134-104.us-east-2.compute.internal 0/1 Completed 0 19m 10.130.0.15 ip-10-0-134-104.us-east-2.compute.internal <none> <none> revision-pruner-2-ip-10-0-136-69.us-east-2.compute.internal 0/1 Completed 0 19m 10.129.0.32 ip-10-0-136-69.us-east-2.compute.internal <none> <none> revision-pruner-2-ip-10-0-154-215.us-east-2.compute.internal 0/1 Completed 0 19m 10.128.0.11 ip-10-0-154-215.us-east-2.compute.internal <none> <none> #oc rsh etcd-ip-10-0-134-104.us-east-2.compute.internal Defaulting container name to etcdctl. Use 'oc describe pod/etcd-ip-10-0-134-104.us-east-2.compute.internal -n openshift-etcd' to see all of the containers in this pod. sh-4.2# etcdctl member list -w table +------------------+---------+--------------------------------------------+---------------------------+---------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------------------------+---------------------------+---------------------------+ | 6c79af56ee86e867 | started | ip-10-0-136-69.us-east-2.compute.internal | https://10.0.136.69:2380 | https://10.0.136.69:2379 | | b75d787d69c362a0 | started | ip-10-0-134-104.us-east-2.compute.internal | https://10.0.134.104:2380 | https://10.0.134.104:2379 | | c7b7878df925a3ac | started | ip-10-0-154-215.us-east-2.compute.internal | https://10.0.154.215:2380 | https://10.0.154.215:2379 | +------------------+---------+--------------------------------------------+---------------------------+---------------------------+ sh-4.2# etcdctl endpoint health {"level":"warn","ts":"2020-05-13T13:04:47.313Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-7daea3d5-be74-4875-b074-2bfbc87a52ee/10.0.9.185:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.0.134.104:2379 is healthy: successfully committed proposal: took = 17.095487ms https://10.0.136.69:2379 is healthy: successfully committed proposal: took = 19.99018ms https://10.0.154.215:2379 is healthy: successfully committed proposal: took = 20.144878ms https://10.0.9.185:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster Here, 10.0.9.185:2379 endpoint is extra where I have three master node environment. # oc describe pod etcd-ip-10-0-134-104.us-east-2.compute.internal ~~~ Environment: ALL_ETCD_ENDPOINTS: https://10.0.134.104:2379,https://10.0.154.215:2379,https://10.0.136.69:2379,https://10.0.9.185:2379 [..] ETCDCTL_ENDPOINTS: https://10.0.134.104:2379,https://10.0.154.215:2379,https://10.0.136.69:2379,https://10.0.9.185:2379 [..] ETCD_IMAGE: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8f9978516adb30da807b5b30551348223827419ad0666905a6f8792bf51462c ETCD_INITIAL_CLUSTER_STATE: existing ETCD_QUOTA_BACKEND_BYTES: 7516192768 NODE_ip_10_0_134_104_us_east_2_compute_internal_ETCD_NAME: ip-10-0-134-104.us-east-2.compute.internal NODE_ip_10_0_134_104_us_east_2_compute_internal_ETCD_URL_HOST: 10.0.134.104 NODE_ip_10_0_134_104_us_east_2_compute_internal_IP: 10.0.134.104 NODE_ip_10_0_136_69_us_east_2_compute_internal_ETCD_NAME: ip-10-0-136-69.us-east-2.compute.internal NODE_ip_10_0_136_69_us_east_2_compute_internal_ETCD_URL_HOST: 10.0.136.69 NODE_ip_10_0_136_69_us_east_2_compute_internal_IP: 10.0.136.69 NODE_ip_10_0_154_215_us_east_2_compute_internal_ETCD_NAME: ip-10-0-154-215.us-east-2.compute.internal NODE_ip_10_0_154_215_us_east_2_compute_internal_ETCD_URL_HOST: 10.0.154.215 NODE_ip_10_0_154_215_us_east_2_compute_internal_IP: 10.0.154.215 NODE_IP: (v1:status.podIP) ~~~ # oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-134-104.us-east-2.compute.internal Ready master 31m v1.17.1 10.0.134.104 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 ip-10-0-134-206.us-east-2.compute.internal Ready worker 23m v1.17.1 10.0.134.206 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 ip-10-0-136-69.us-east-2.compute.internal Ready master 32m v1.17.1 10.0.136.69 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 ip-10-0-142-242.us-east-2.compute.internal Ready worker 22m v1.17.1 10.0.142.242 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 ip-10-0-154-215.us-east-2.compute.internal Ready master 32m v1.17.1 10.0.154.215 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 ip-10-0-154-99.us-east-2.compute.internal Ready worker 22m v1.17.1 10.0.154.99 <none> Red Hat Enterprise Linux CoreOS 44.81.202004260825-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-8.dev.rhaos4.4.git5f5c5e4.el8 Version-Release number of selected component (if applicable): oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.3 True False 11m Cluster version is 4.4.3 How reproducible: Always Steps to Reproduce: 1. Create new OCP 4.4.3 cluster 2. Check describe of etcd pod Actual results: It is showing additional endpoint which is not master node. But member list shows only three nodes. Expected results: There should be three endpoints of etcd in three master node environment. Additional info:
*** Bug 1836345 has been marked as a duplicate of this bug. ***
*** Bug 1832923 has been marked as a duplicate of this bug. ***
There is no functional impact to the cluster but as we do populate the ENV for the etcdctl container the results can be unexpected as observed. Rolling a new revision after bootstrap is dropped does not make sense. So the general question is should bootstrap ever be added to this list? Possible Solutions: - drop bootstrap endpoint from list - document `--cluster` flag
Looked into this a little more. We consume the ETCDCTL_ ENV in the etcd container[1] with etcdctl member list. If we instead passed $ALL_ETCD_ENDPOINTS via flag. so etcdctl member list --endpoints $ALL_ETCD_ENDPOINTS this would allow us to keep the ETCDCTL_ENDPOINTS variable clean of bootstrap. The client balancer would handle any failover gracefully. [1] https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L124
*** Bug 1815049 has been marked as a duplicate of this bug. ***
We want to fix this, but because we're closing out the 4.5 release and there's no functional impact from the leftover environment, we're deferring the fix to 4.6.
(In reply to Dan Mace from comment #11) > We want to fix this, but because we're closing out the 4.5 release and > there's no functional impact from the leftover environment, we're deferring > the fix to 4.6. I disagree, what we're seeing in the console is from this imo (please correct me if not) [root@tatooine ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-master01.ocp4.lab.msp.redhat.com 3/3 Running 0 3d23h 10.15.108.83 master01.ocp4.lab.msp.redhat.com <none> <none> etcd-master02.ocp4.lab.msp.redhat.com 3/3 Running 0 3d23h 10.15.108.84 master02.ocp4.lab.msp.redhat.com <none> <none> etcd-master03.ocp4.lab.msp.redhat.com 3/3 Running 3 3d23h 10.15.108.85 master03.ocp4.lab.msp.redhat.com <none> <none> installer-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.129.0.14 master01.ocp4.lab.msp.redhat.com <none> <none> installer-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.128.0.29 master02.ocp4.lab.msp.redhat.com <none> <none> installer-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.130.0.3 master03.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.129.0.34 master01.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.128.0.41 master02.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.130.0.27 master03.ocp4.lab.msp.redhat.com <none> <none> [root@tatooine ~]# oc rsh etcd-master01.ocp4.lab.msp.redhat.com Defaulting container name to etcdctl. Use 'oc describe pod/etcd-master01.ocp4.lab.msp.redhat.com -n openshift-etcd' to see all of the containers in this pod. sh-4.2# etcdctl endpoint health {"level":"warn","ts":"2020-06-01T17:15:04.758Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-2c677a27-a498-48d3-a321-01b5938ba5d3/10.15.108.175:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.15.108.83:2379 is healthy: successfully committed proposal: took = 25.468676ms https://10.15.108.85:2379 is healthy: successfully committed proposal: took = 26.224467ms https://10.15.108.84:2379 is healthy: successfully committed proposal: took = 27.442727ms https://10.15.108.175:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster *************** THIS *************** </snip> The console is spitting out messages that the `etcd operator shows all 3 master nodes as unhealthy members` [1] - even though there's only one that is unhealthy which was the bootstrap node. We need to have this cleaned up in the console. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1838630
(In reply to Sam Yangsao from comment #12) > (In reply to Dan Mace from comment #11) > > We want to fix this, but because we're closing out the 4.5 release and > > there's no functional impact from the leftover environment, we're deferring > > the fix to 4.6. > > I disagree, what we're seeing in the console is from this imo (please > correct me if not) > > [root@tatooine ~]# oc get pods -o wide > NAME READY STATUS > RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > etcd-master01.ocp4.lab.msp.redhat.com 3/3 Running 0 > 3d23h 10.15.108.83 master01.ocp4.lab.msp.redhat.com <none> > <none> > etcd-master02.ocp4.lab.msp.redhat.com 3/3 Running 0 > 3d23h 10.15.108.84 master02.ocp4.lab.msp.redhat.com <none> > <none> > etcd-master03.ocp4.lab.msp.redhat.com 3/3 Running 3 > 3d23h 10.15.108.85 master03.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.129.0.14 master01.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.128.0.29 master02.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.130.0.3 master03.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.129.0.34 master01.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.128.0.41 master02.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.130.0.27 master03.ocp4.lab.msp.redhat.com <none> > <none> > > [root@tatooine ~]# oc rsh etcd-master01.ocp4.lab.msp.redhat.com > Defaulting container name to etcdctl. > Use 'oc describe pod/etcd-master01.ocp4.lab.msp.redhat.com -n > openshift-etcd' to see all of the containers in this pod. > > sh-4.2# etcdctl endpoint health > {"level":"warn","ts":"2020-06-01T17:15:04.758Z","caller":"clientv3/ > retry_interceptor.go:61","msg":"retrying of unary invoker > failed","target":"endpoint://client-2c677a27-a498-48d3-a321-01b5938ba5d3/10. > 15.108.175:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded > desc = context deadline exceeded"} > https://10.15.108.83:2379 is healthy: successfully committed proposal: took > = 25.468676ms > https://10.15.108.85:2379 is healthy: successfully committed proposal: took > = 26.224467ms > https://10.15.108.84:2379 is healthy: successfully committed proposal: took > = 27.442727ms > https://10.15.108.175:2379 is unhealthy: failed to commit proposal: context > deadline exceeded > Error: unhealthy cluster *************** THIS *************** > > </snip> Try `etcdctl endpoint health --cluster` instead, to ignore the environment variables. Apply the `--cluster` option to other command executions. The current docs should probably mention this given the stale info in the environment variable. I'm not sure I see what you mean regarding the console. You said: > The console is spitting out messages that the `etcd operator shows all 3 > master nodes as unhealthy members` [1] Are you saying the console is reporting 3 unhealthy master nodes? If all three master nodes were unhealthy I'd expect the entire cluster to be dead. Can you clarify how that relates to the bootstrap member in the environment of the etcdctl utility container? > even though there's only one that > is unhealthy which was the bootstrap node. I don't think this is exactly true. There is no more bootstrap member at this point, so it has no status according to the system. I do agree that without the `--cluster` arg etcdctl command by default will try to talk to the defunct bootstrap node, but that doesn't have anything to do with how openshift views the etcd cluster. Can you help me understand what functional impact you're describing outside the etcdctl utility container? All that said, https://github.com/openshift/cluster-etcd-operator/pull/370 is likely the fix for removing the bootstrap variable from environment.
*** Bug 1846250 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196