Bug 1814576 - Bootstrap stuck on waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True
Summary: Bootstrap stuck on waiting on condition EtcdRunningInCluster in etcd CR /clus...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd Operator
Version: 4.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.5.0
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard:
: 1832120 1833050 (view as bug list)
Depends On:
Blocks: 1837152
TreeView+ depends on / blocked
 
Reported: 2020-03-18 10:04 UTC by Steven Ellis
Modified: 2021-01-05 05:30 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:22:24 UTC
Target Upstream Version:


Attachments (Terms of Use)
Output from oc adm must-gather (18.52 MB, application/gzip)
2020-03-18 10:04 UTC, Steven Ellis
no flags Details
Output from oc adm must-gather on a clean ocp 4.4rc6 install (19.41 MB, application/x-bzip)
2020-04-03 00:50 UTC, Steven Ellis
no flags Details
etcd-1-pod (20.13 KB, text/plain)
2020-04-03 02:41 UTC, Sam Batschelet
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift etcd pull 48 0 None closed Bug 1814576: make evaluation of targetMember strict 2021-02-08 16:57:59 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:22:45 UTC

Internal Links: 1833160

Description Steven Ellis 2020-03-18 10:04:04 UTC
Created attachment 1671020 [details]
Output from oc adm must-gather

Description of problem:

Openshift openshift-install-linux-4.4.0-rc.1 on bare metal. Bootstrap never completes and the following error appears in the bootstrap logs.

openshift-install-linux-4.4.0
openshift-client-linux-4.4.0-rc.1.tar.gz-rc.1 


How reproducible:

Steps to Reproduce:
1. openshift-install --dir=baremetal wait-for bootstrap-complete \
      --log-level=info
2. ssh core@bootstrap
3. journalctl -b -f -u bootkube.service


Actual results:

Mar 18 09:56:18 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:18.486144       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Mar 18 09:56:21 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:21.672658       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.
Mar 18 09:56:32 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:32.143051       1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True.


Expected results:

Bootstrap should complete and not time out.

Additional info:

Attached logs form oc adm must-gather

Cluster ID - f9cd09b8-8454-4bac-8ed1-70c712bd2a66

Comment 1 Steven Ellis 2020-03-18 10:07:13 UTC
Deployment environment is 3 node converged master/worker

The install config is

apiVersion: v1
baseDomain: bionode.io
compute:
- hyperthreading: Enabled   
  name: worker
  replicas: 0 
controlPlane:
  hyperthreading: Enabled   
  name: master 
  replicas: 3 
metadata:
  name: test 
networking:
  machineNetwork:
  - cidr: 10.1.10.0/24
  clusterNetwork:
  - cidr: 10.128.0.0/16 
    hostPrefix: 24 
  networkType: OpenShiftSDN
  serviceNetwork: 
  - 172.30.0.0/1


The machineNetwork entry is recent and the error occurs even if it isn't specified.

As I've got limited bare metal resources I leave the masters schedulable as workers.

Comment 2 Steven Ellis 2020-03-18 10:08:37 UTC
For the initial UEFI PXE bootstrap I'm specifying rhcos-4.4.0-rc.1-x86_64-metal.x86_64.raw.gz for the bootstrap and master/worker nodes.

Comment 3 Steven Ellis 2020-03-18 14:57:17 UTC
oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd
Error: open /etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt: no such file or directory
memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io
#### attempt 0
      member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[]
      member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379]
      member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379]
      target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil>
mv: cannot stat '/etc/kubernetes/manifests/etcd-member.yaml': No such file or directory
Waiting for ports 2379, 2380 and 9978 to be released.ETCD_PORT_9979_TCP=tcp://172.30.9.230:9979
ETCDCTL_CERT=/etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt
ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3826f672aacf734fafbdc81165dd9c81a3c9c1e89ab6f75477a83569b0538a0e
ETCD_PORT_9979_TCP_PORT=9979
ETCD_PORT_2379_TCP_PORT=2379
ETCD_PORT_2379_TCP_PROTO=tcp
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_PORT_9979_TCP_ADDR=172.30.9.230
ALL_ETCD_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379
ETCDCTL_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379
ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380
ETCD_ELECTION_TIMEOUT=1000
ETCD_SERVICE_PORT_ETCD=2379
ETCD_SERVICE_PORT_ETCD_METRICS=9979
ETCDCTL_CACERT=/etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt
ETCD_NAME=etcd-1.test.bionode.io
ETCD_QUOTA_BACKEND_BYTES=7516192768
ETCD_SERVICE_PORT=2379
ETCD_PORT_2379_TCP_ADDR=172.30.9.230
ETCDCTL_API=3
ETCD_DATA_DIR=/var/lib/etcd
ETCD_PORT_9979_TCP_PROTO=tcp
ETCD_PORT_2379_TCP=tcp://172.30.9.230:2379
ETCD_PORT=tcp://172.30.9.230:2379
ETCD_HEARTBEAT_INTERVAL=100
ETCDCTL_KEY=/etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key
ETCD_SERVICE_HOST=172.30.9.230
+ exec etcd --initial-advertise-peer-urls=https://10.1.10.3:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-etcd-1.test.bionode.io.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-etcd-1.test.bionode.io.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.1.10.3:2379 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --listen-metrics-urls=https://0.0.0.0:9978
2020-03-18 14:13:19.019386 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2020-03-18 14:13:19.019432 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1000
2020-03-18 14:13:19.019443 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=100
2020-03-18 14:13:19.019448 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380
2020-03-18 14:13:19.019451 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2020-03-18 14:13:19.019462 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd-1.test.bionode.io
2020-03-18 14:13:19.019473 I | pkg/flags: recognized and used environment variable ETCD_QUOTA_BACKEND_BYTES=7516192768
2020-03-18 14:13:19.019484 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP=tcp://172.30.9.230:9979
2020-03-18 14:13:19.019490 W | pkg/flags: unrecognized environment variable ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3826f672aacf734fafbdc81165dd9c81a3c9c1e89ab6f75477a83569b0538a0e
2020-03-18 14:13:19.019493 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PORT=9979
2020-03-18 14:13:19.019496 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_PORT=2379
2020-03-18 14:13:19.019498 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_PROTO=tcp
2020-03-18 14:13:19.019502 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_ADDR=172.30.9.230
2020-03-18 14:13:19.019505 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD=2379
2020-03-18 14:13:19.019508 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD_METRICS=9979
2020-03-18 14:13:19.019513 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379
2020-03-18 14:13:19.019516 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_ADDR=172.30.9.230
2020-03-18 14:13:19.019519 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp
2020-03-18 14:13:19.019521 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.9.230:2379
2020-03-18 14:13:19.019524 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.9.230:2379
2020-03-18 14:13:19.019527 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.9.230
2020-03-18 14:13:19.019544 I | etcdmain: etcd Version: 3.3.18
2020-03-18 14:13:19.019548 I | etcdmain: Git SHA: 68628e0
2020-03-18 14:13:19.019550 I | etcdmain: Go Version: go1.13.4
2020-03-18 14:13:19.019553 I | etcdmain: Go OS/Arch: linux/amd64
2020-03-18 14:13:19.019556 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2020-03-18 14:13:19.019584 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-03-18 14:13:19.019598 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 
2020-03-18 14:13:19.020014 I | embed: listening for peers on https://0.0.0.0:2380
2020-03-18 14:13:19.020063 I | embed: listening for client requests on 0.0.0.0:2379
2020-03-18 14:13:19.023321 I | embed: rejected connection from "10.1.10.31:34132" (error "set tcp 10.1.10.3:2380: use of closed network connection", ServerName "")
2020-03-18 14:13:19.023338 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration

Comment 4 Steven Ellis 2020-03-24 01:16:28 UTC
Reproduced issue with latest RC2 build

Same errors on etc-1

DNS SRV records appear to be ok

ig srv _etcd-server-ssl._tcp.test.bionode.io @10.1.10.10

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el8 <<>> srv _etcd-server-ssl._tcp.test.bionode.io @10.1.10.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36481
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;_etcd-server-ssl._tcp.test.bionode.io. IN SRV

;; ANSWER SECTION:
_etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV	0 10 2380 etcd-1.test.bionode.io.
_etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV	0 10 2380 etcd-0.test.bionode.io.
_etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV	0 10 2380 etcd-2.test.bionode.io.

;; Query time: 1 msec
;; SERVER: 10.1.10.10#53(10.1.10.10)
;; WHEN: Tue Mar 24 01:06:18 UTC 2020
;; MSG SIZE  rcvd: 192

Comment 5 Steven Ellis 2020-04-03 00:50:12 UTC
Created attachment 1675879 [details]
Output from oc adm must-gather on a clean ocp 4.4rc6 install

Latest must-gather output from a clean OCP 4.4rc6 install

ETCD issues.

Comment 6 Sam Batschelet 2020-04-03 02:39:05 UTC
Thanks for the logs, looks like we do have an issue.

> 2020-04-03T00:32:18.551450638Z 2020-04-03 00:32:18.551374 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration

```
2020-04-03T00:32:16.863837672Z 27f66a9ce3a0a40f, started, etcd-0.test.bionode.io, https://10.1.10.2:2380, https://10.1.10.2:2379
2020-04-03T00:32:16.863837672Z f3414fc56bfb0248, unstarted, , https://10.1.10.3:2380, 
2020-04-03T00:32:16.863837672Z f374a1615f0d67db, started, etcd-bootstrap, https://10.1.10.31:2380, https://10.1.10.31:2379
2020-04-03T00:32:16.867024818Z memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io
2020-04-03T00:32:16.874900709Z #### attempt 0
2020-04-03T00:32:16.875950072Z       member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379]
2020-04-03T00:32:16.875950072Z       member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[]
2020-04-03T00:32:16.875950072Z       member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379]
2020-04-03T00:32:16.875950072Z       target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil>
2020-04-03T00:32:16.877939518Z mv: cannot stat '/etc/kubernetes/manifests/etcd-member.yaml': No such file or directory
2020-04-03T00:32:16.878124481Z Waiting for ports 2379, 2380 and 9978 to be released.2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP=tcp://172.30.66.5:9979
2020-04-03T00:32:18.541623527Z ETCDCTL_CERT=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt
2020-04-03T00:32:18.541623527Z ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff2c9cbdbc5b34e917f45846e1346c45e259e4d890fa8f5eb4b588d96b3b5d8a
2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_PORT=9979
2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_PORT=2379
2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_PROTO=tcp
2020-04-03T00:32:18.541623527Z ETCD_INITIAL_CLUSTER_STATE=existing
2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_ADDR=172.30.66.5
2020-04-03T00:32:18.541623527Z ALL_ETCD_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379
2020-04-03T00:32:18.541623527Z ETCDCTL_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379
2020-04-03T00:32:18.541623527Z ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380
2020-04-03T00:32:18.541623527Z ETCD_ELECTION_TIMEOUT=1000
2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT_ETCD=2379
2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT_ETCD_METRICS=9979
2020-04-03T00:32:18.541623527Z ETCDCTL_CACERT=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt
2020-04-03T00:32:18.541623527Z ETCD_NAME=etcd-1.test.bionode.io
2020-04-03T00:32:18.541623527Z ETCD_QUOTA_BACKEND_BYTES=7516192768
2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT=2379
2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_ADDR=172.30.66.5
2020-04-03T00:32:18.541623527Z ETCDCTL_API=3
2020-04-03T00:32:18.541623527Z ETCD_DATA_DIR=/var/lib/etcd
2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP=tcp://172.30.66.5:2379
2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_PROTO=tcp
2020-04-03T00:32:18.541623527Z ETCD_PORT=tcp://172.30.66.5:2379
2020-04-03T00:32:18.541623527Z ETCDCTL_KEY=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key
2020-04-03T00:32:18.541623527Z ETCD_HEARTBEAT_INTERVAL=100
2020-04-03T00:32:18.541623527Z ETCD_SERVICE_HOST=172.30.66.5
```

Comment 7 Sam Batschelet 2020-04-03 02:41:55 UTC
Created attachment 1675892 [details]
etcd-1-pod

Comment 9 Steven Ellis 2020-04-03 02:49:36 UTC
Here is my current cluster ID

oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}'
5a6288ed-388e-40c0-9bc2-d9ed47d17f96

If I roll back to OCP 4.3.5 on the same hardware the deployment completes

Comment 10 Steven Ellis 2020-04-03 02:50:18 UTC
Looking at ETCD on node1

oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd

2020-04-03 00:47:43.533770 I | etcdmain: etcd Version: 3.3.18
2020-04-03 00:47:43.533774 I | etcdmain: Git SHA: 00e3e1c
2020-04-03 00:47:43.533776 I | etcdmain: Go Version: go1.13.4
2020-04-03 00:47:43.533779 I | etcdmain: Go OS/Arch: linux/amd64
2020-04-03 00:47:43.533782 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2020-04-03 00:47:43.533810 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-04-03 00:47:43.533834 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 
2020-04-03 00:47:43.534223 I | embed: listening for peers on https://0.0.0.0:2380
2020-04-03 00:47:43.534272 I | embed: listening for client requests on 0.0.0.0:2379
2020-04-03 00:47:43.538514 I | embed: rejected connection from "10.1.10.31:35278" (error "set tcp 10.1.10.3:2380: use of closed network connection", ServerName "")
2020-04-03 00:47:43.538531 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration


and on node0

2020-04-03 00:50:36.035179 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-04-03 00:50:36.035212 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-04-03 00:50:41.035324 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-04-03 00:50:41.035359 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-04-03 00:50:46.035471 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-04-03 00:50:46.035509 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-04-03 00:50:51.035582 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-04-03 00:50:51.035620 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-04-03 00:50:56.035696 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-04-03 00:50:56.035718 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")


Some odd naming issues with the etcd hosts

oc get pods -n openshift-etdc
No resources found in openshift-etdc namespace.
[root@lb ocp44]# oc get pods -n openshift-etcd
NAME                                 READY   STATUS             RESTARTS   AGE
etcd-etcd-0.test.bionode.io          3/3     Running            3          79m
etcd-etcd-1.test.bionode.io          2/3     CrashLoopBackOff   19         76m
installer-2-etcd-0.test.bionode.io   0/1     Completed          0          79m
installer-2-etcd-1.test.bionode.io   0/1     Completed          0          77m


We don't appear to have an ETCD pod on node-2

oc describe node etcd-2.test.bionode.io | grep openshift-etc

Comment 11 Suresh Kolichala 2020-04-03 12:27:28 UTC
This output from etcd-1 looks strange:

```
# oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd
memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io
#### attempt 0
      member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[]
      member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379]
      member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379]
      target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil>
```

If this is output from etcd-1, then, the target should etcd-0.test.bionode.io, but the last line above strangely shows the target name as etcd-bootstrap. Will dig into it.

Comment 12 Suresh Kolichala 2020-04-03 12:28:28 UTC
Sorry, if this is the output from etcd-1, then the target should be `etcd-1.test.bionode.io`, but the last line shows the target name as `etcd-bootstrap`, which is unexpected.

Comment 13 Steven Ellis 2020-04-03 21:49:44 UTC
bootstrap is 10.1.10.31

The three master/worker bare metal nodes are
etcd-0 - 10.1.10.2
etcd-1 - 10.1.10.3
etcd-2 - 10.1.10.4

Comment 16 W. Trevor King 2020-05-07 04:39:24 UTC
*** Bug 1832120 has been marked as a duplicate of this bug. ***

Comment 17 Steven Ellis 2020-05-07 23:31:26 UTC
I've had a different issue with UPI and OCP 4.3.15 documented under
 - https://bugzilla.redhat.com/show_bug.cgi?id=1833160

Comment 18 Steven Ellis 2020-05-11 05:36:01 UTC
*** Bug 1832120 has been marked as a duplicate of this bug. ***

Comment 34 Suresh Kolichala 2020-05-19 15:44:21 UTC
Steven, this PR is merged into 4.5. Can you retest? -- Thanks, Suresh.

Comment 38 Suresh Kolichala 2020-05-20 13:00:55 UTC
*** Bug 1833050 has been marked as a duplicate of this bug. ***

Comment 43 Pedro Amoedo 2020-06-05 08:28:41 UTC
Hi all, can someone please confirm if this IP pattern-matching bug can also occur during an upgrade from 4.3 to 4.4 due to the presence of the new ETCD Operator or is only related with initial ETCD cluster creation?

Thanks and regards.

Comment 44 ge liu 2020-06-05 10:30:16 UTC
@Pedro Amoedo, regrading to comments 40, do you have env which may config ipaddress as expected? if yes, QE may take the test on your env. thanks.

Comment 45 Pedro Amoedo 2020-06-05 11:25:14 UTC
(In reply to ge liu from comment #44)
> @Pedro Amoedo, regrading to comments 40, do you have env which may config
> ipaddress as expected? if yes, QE may take the test on your env. thanks.

I'm sorry but I don't have my own lab where to reproduce this, the comment is related to customer environment.

If needed I can ask them to test a patch or extract some logs, just tell me.

Regards.

Comment 46 Sam Batschelet 2020-06-05 12:16:50 UTC
I feel this bug does not need a reproduction ENV to verify the logic is very straight forward. Can I simply write a test to prove the solution Ge?

Comment 47 kit 2020-06-05 14:06:47 UTC
I am hitting this same bug it seems when provisioning a new 4.4.6 setup.    If I can help by providing logs etc let me know

Comment 50 Pedro Amoedo 2020-06-15 08:32:37 UTC
Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks.

Comment 51 Steven Ellis 2020-06-16 07:08:28 UTC
I'm no longer hitting this issue with 4.3/4.4

It appears that one way this issue occurs is if you have unreliable DNS during intial bootstrap. I've switched from a RouterOS DNS to dnsmasq.

Secondly I've made sure my bootstrap IP doesn't conflict with the master/worker IPs

For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31

Comment 52 W. Trevor King 2020-06-17 02:22:50 UTC
> For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31

That is the bug we're fixing here [1]. But the 4.4 backport is still in flight [2] (bug 1837152), and needs this bug to be VERIFIED in 4.5 before it can land.

This bug does not address unreliable DNS.  Does something there need to get spun out into a new bug?

[1]: https://github.com/openshift/etcd/pull/48/files
[2]: https://github.com/openshift/etcd/pull/49

Comment 53 Pedro Amoedo 2020-06-19 08:11:47 UTC
(In reply to W. Trevor King from comment #52)
> > For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31
> 
> That is the bug we're fixing here [1]. But the 4.4 backport is still in
> flight [2] (bug 1837152), and needs this bug to be VERIFIED in 4.5 before it
> can land.
> 
> This bug does not address unreliable DNS.  Does something there need to get
> spun out into a new bug?
> 
> [1]: https://github.com/openshift/etcd/pull/48/files
> [2]: https://github.com/openshift/etcd/pull/49

Hi Trevor, thanks for the PR and the 4.4 backport BZ links, appreciated.

Apart from that, could be possible to confirm if this bug only affects during initial etcd cluster installation and not during 4.3 to 4.4 upgrades? I suppose it shouldn't because the bootstrap node is no longer present but I don't know which procedure the new operator follows during an upgrade, thanks.

Best Regards.

Comment 54 Sam Batschelet 2020-06-22 12:40:45 UTC
> Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks.

I believe this could be a problem with upgrades as well, essentially any 2 master nodes which have IP addresses that could invalidate a contains call could cause etcd to become confused. This is a bug[1],[2] and it needs to be merged Ge please verify ASAP.

https://play.golang.org/p/ZTbgAlVBz0X

[2]https://github.com/openshift/etcd/blob/openshift-4.4/openshift-tools/pkg/discover-etcd-initial-cluster/initial-cluster.go#L244

Comment 55 Pedro Amoedo 2020-06-22 14:11:09 UTC
(In reply to Sam Batschelet from comment #54)
> > Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks.
> 
> I believe this could be a problem with upgrades as well, essentially any 2
> master nodes which have IP addresses that could invalidate a contains call
> could cause etcd to become confused. This is a bug[1],[2] and it needs to be
> merged Ge please verify ASAP.
> 
> https://play.golang.org/p/ZTbgAlVBz0X
> 
> [2]https://github.com/openshift/etcd/blob/openshift-4.4/openshift-tools/pkg/
> discover-etcd-initial-cluster/initial-cluster.go#L244

Thanks for the confirmation Sam, appreciated!

Comment 60 errata-xmlrpc 2020-07-13 17:22:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 61 Steven Ellis 2021-01-05 05:30:13 UTC
Issued appear to be resolved. Main problem was a bad DNS resolver having problems during build time.


Note You need to log in before you can comment on or make changes to this bug.