Bug 1835238
Summary: | Extra endpoint in etcd [4.4.3] | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Aditya Deshpande <adeshpan> | |
Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> | |
Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.4 | CC: | aprajapa, dahernan, dmace, enorling, geliu, jlee, jmalde, pamoedom, rugouvei, sbatsche, skolicha, susuresh, syangsao | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Bootstrap endpoint in ETCDCTL_ENDPOINTS is not removed after bootstrap node is removed.
Consequence: etcdctl commands show unexpected errors.
Fix: Do not include bootstrap endpoint in ETCDCTL_ENDPOINTS at all.
Result: etcdctl commands do not show errors on bootstrap endpoint.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1859684 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 15:59:18 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1859684 |
Description
Aditya Deshpande
2020-05-13 13:09:39 UTC
*** Bug 1836345 has been marked as a duplicate of this bug. *** *** Bug 1832923 has been marked as a duplicate of this bug. *** There is no functional impact to the cluster but as we do populate the ENV for the etcdctl container the results can be unexpected as observed. Rolling a new revision after bootstrap is dropped does not make sense. So the general question is should bootstrap ever be added to this list? Possible Solutions: - drop bootstrap endpoint from list - document `--cluster` flag Looked into this a little more. We consume the ETCDCTL_ ENV in the etcd container[1] with etcdctl member list. If we instead passed $ALL_ETCD_ENDPOINTS via flag. so etcdctl member list --endpoints $ALL_ETCD_ENDPOINTS this would allow us to keep the ETCDCTL_ENDPOINTS variable clean of bootstrap. The client balancer would handle any failover gracefully. [1] https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L124 *** Bug 1815049 has been marked as a duplicate of this bug. *** We want to fix this, but because we're closing out the 4.5 release and there's no functional impact from the leftover environment, we're deferring the fix to 4.6. (In reply to Dan Mace from comment #11) > We want to fix this, but because we're closing out the 4.5 release and > there's no functional impact from the leftover environment, we're deferring > the fix to 4.6. I disagree, what we're seeing in the console is from this imo (please correct me if not) [root@tatooine ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-master01.ocp4.lab.msp.redhat.com 3/3 Running 0 3d23h 10.15.108.83 master01.ocp4.lab.msp.redhat.com <none> <none> etcd-master02.ocp4.lab.msp.redhat.com 3/3 Running 0 3d23h 10.15.108.84 master02.ocp4.lab.msp.redhat.com <none> <none> etcd-master03.ocp4.lab.msp.redhat.com 3/3 Running 3 3d23h 10.15.108.85 master03.ocp4.lab.msp.redhat.com <none> <none> installer-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.129.0.14 master01.ocp4.lab.msp.redhat.com <none> <none> installer-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.128.0.29 master02.ocp4.lab.msp.redhat.com <none> <none> installer-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.130.0.3 master03.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.129.0.34 master01.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.128.0.41 master02.ocp4.lab.msp.redhat.com <none> <none> revision-pruner-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 3d23h 10.130.0.27 master03.ocp4.lab.msp.redhat.com <none> <none> [root@tatooine ~]# oc rsh etcd-master01.ocp4.lab.msp.redhat.com Defaulting container name to etcdctl. Use 'oc describe pod/etcd-master01.ocp4.lab.msp.redhat.com -n openshift-etcd' to see all of the containers in this pod. sh-4.2# etcdctl endpoint health {"level":"warn","ts":"2020-06-01T17:15:04.758Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-2c677a27-a498-48d3-a321-01b5938ba5d3/10.15.108.175:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} https://10.15.108.83:2379 is healthy: successfully committed proposal: took = 25.468676ms https://10.15.108.85:2379 is healthy: successfully committed proposal: took = 26.224467ms https://10.15.108.84:2379 is healthy: successfully committed proposal: took = 27.442727ms https://10.15.108.175:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster *************** THIS *************** </snip> The console is spitting out messages that the `etcd operator shows all 3 master nodes as unhealthy members` [1] - even though there's only one that is unhealthy which was the bootstrap node. We need to have this cleaned up in the console. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1838630 (In reply to Sam Yangsao from comment #12) > (In reply to Dan Mace from comment #11) > > We want to fix this, but because we're closing out the 4.5 release and > > there's no functional impact from the leftover environment, we're deferring > > the fix to 4.6. > > I disagree, what we're seeing in the console is from this imo (please > correct me if not) > > [root@tatooine ~]# oc get pods -o wide > NAME READY STATUS > RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > etcd-master01.ocp4.lab.msp.redhat.com 3/3 Running 0 > 3d23h 10.15.108.83 master01.ocp4.lab.msp.redhat.com <none> > <none> > etcd-master02.ocp4.lab.msp.redhat.com 3/3 Running 0 > 3d23h 10.15.108.84 master02.ocp4.lab.msp.redhat.com <none> > <none> > etcd-master03.ocp4.lab.msp.redhat.com 3/3 Running 3 > 3d23h 10.15.108.85 master03.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.129.0.14 master01.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.128.0.29 master02.ocp4.lab.msp.redhat.com <none> > <none> > installer-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.130.0.3 master03.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master01.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.129.0.34 master01.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master02.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.128.0.41 master02.ocp4.lab.msp.redhat.com <none> > <none> > revision-pruner-2-master03.ocp4.lab.msp.redhat.com 0/1 Completed 0 > 3d23h 10.130.0.27 master03.ocp4.lab.msp.redhat.com <none> > <none> > > [root@tatooine ~]# oc rsh etcd-master01.ocp4.lab.msp.redhat.com > Defaulting container name to etcdctl. > Use 'oc describe pod/etcd-master01.ocp4.lab.msp.redhat.com -n > openshift-etcd' to see all of the containers in this pod. > > sh-4.2# etcdctl endpoint health > {"level":"warn","ts":"2020-06-01T17:15:04.758Z","caller":"clientv3/ > retry_interceptor.go:61","msg":"retrying of unary invoker > failed","target":"endpoint://client-2c677a27-a498-48d3-a321-01b5938ba5d3/10. > 15.108.175:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded > desc = context deadline exceeded"} > https://10.15.108.83:2379 is healthy: successfully committed proposal: took > = 25.468676ms > https://10.15.108.85:2379 is healthy: successfully committed proposal: took > = 26.224467ms > https://10.15.108.84:2379 is healthy: successfully committed proposal: took > = 27.442727ms > https://10.15.108.175:2379 is unhealthy: failed to commit proposal: context > deadline exceeded > Error: unhealthy cluster *************** THIS *************** > > </snip> Try `etcdctl endpoint health --cluster` instead, to ignore the environment variables. Apply the `--cluster` option to other command executions. The current docs should probably mention this given the stale info in the environment variable. I'm not sure I see what you mean regarding the console. You said: > The console is spitting out messages that the `etcd operator shows all 3 > master nodes as unhealthy members` [1] Are you saying the console is reporting 3 unhealthy master nodes? If all three master nodes were unhealthy I'd expect the entire cluster to be dead. Can you clarify how that relates to the bootstrap member in the environment of the etcdctl utility container? > even though there's only one that > is unhealthy which was the bootstrap node. I don't think this is exactly true. There is no more bootstrap member at this point, so it has no status according to the system. I do agree that without the `--cluster` arg etcdctl command by default will try to talk to the defunct bootstrap node, but that doesn't have anything to do with how openshift views the etcd cluster. Can you help me understand what functional impact you're describing outside the etcdctl utility container? All that said, https://github.com/openshift/cluster-etcd-operator/pull/370 is likely the fix for removing the bootstrap variable from environment. *** Bug 1846250 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |