Created attachment 1671020 [details] Output from oc adm must-gather Description of problem: Openshift openshift-install-linux-4.4.0-rc.1 on bare metal. Bootstrap never completes and the following error appears in the bootstrap logs. openshift-install-linux-4.4.0 openshift-client-linux-4.4.0-rc.1.tar.gz-rc.1 How reproducible: Steps to Reproduce: 1. openshift-install --dir=baremetal wait-for bootstrap-complete \ --log-level=info 2. ssh core@bootstrap 3. journalctl -b -f -u bootkube.service Actual results: Mar 18 09:56:18 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:18.486144 1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True. Mar 18 09:56:21 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:21.672658 1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True. Mar 18 09:56:32 bootstrap.test.bionode.io bootkube.sh[2015]: I0318 09:56:32.143051 1 waitforceo.go:67] waiting on condition EtcdRunningInCluster in etcd CR /cluster to be True. Expected results: Bootstrap should complete and not time out. Additional info: Attached logs form oc adm must-gather Cluster ID - f9cd09b8-8454-4bac-8ed1-70c712bd2a66
Deployment environment is 3 node converged master/worker The install config is apiVersion: v1 baseDomain: bionode.io compute: - hyperthreading: Enabled name: worker replicas: 0 controlPlane: hyperthreading: Enabled name: master replicas: 3 metadata: name: test networking: machineNetwork: - cidr: 10.1.10.0/24 clusterNetwork: - cidr: 10.128.0.0/16 hostPrefix: 24 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/1 The machineNetwork entry is recent and the error occurs even if it isn't specified. As I've got limited bare metal resources I leave the masters schedulable as workers.
For the initial UEFI PXE bootstrap I'm specifying rhcos-4.4.0-rc.1-x86_64-metal.x86_64.raw.gz for the bootstrap and master/worker nodes.
oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd Error: open /etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt: no such file or directory memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io #### attempt 0 member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[] member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379] member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379] target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil> mv: cannot stat '/etc/kubernetes/manifests/etcd-member.yaml': No such file or directory Waiting for ports 2379, 2380 and 9978 to be released.ETCD_PORT_9979_TCP=tcp://172.30.9.230:9979 ETCDCTL_CERT=/etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3826f672aacf734fafbdc81165dd9c81a3c9c1e89ab6f75477a83569b0538a0e ETCD_PORT_9979_TCP_PORT=9979 ETCD_PORT_2379_TCP_PORT=2379 ETCD_PORT_2379_TCP_PROTO=tcp ETCD_INITIAL_CLUSTER_STATE=existing ETCD_PORT_9979_TCP_ADDR=172.30.9.230 ALL_ETCD_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379 ETCDCTL_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379 ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380 ETCD_ELECTION_TIMEOUT=1000 ETCD_SERVICE_PORT_ETCD=2379 ETCD_SERVICE_PORT_ETCD_METRICS=9979 ETCDCTL_CACERT=/etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt ETCD_NAME=etcd-1.test.bionode.io ETCD_QUOTA_BACKEND_BYTES=7516192768 ETCD_SERVICE_PORT=2379 ETCD_PORT_2379_TCP_ADDR=172.30.9.230 ETCDCTL_API=3 ETCD_DATA_DIR=/var/lib/etcd ETCD_PORT_9979_TCP_PROTO=tcp ETCD_PORT_2379_TCP=tcp://172.30.9.230:2379 ETCD_PORT=tcp://172.30.9.230:2379 ETCD_HEARTBEAT_INTERVAL=100 ETCDCTL_KEY=/etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key ETCD_SERVICE_HOST=172.30.9.230 + exec etcd --initial-advertise-peer-urls=https://10.1.10.3:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-etcd-1.test.bionode.io.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-serving/etcd-serving-etcd-1.test.bionode.io.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.1.10.3:2379 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --listen-metrics-urls=https://0.0.0.0:9978 2020-03-18 14:13:19.019386 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd 2020-03-18 14:13:19.019432 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1000 2020-03-18 14:13:19.019443 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=100 2020-03-18 14:13:19.019448 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380 2020-03-18 14:13:19.019451 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing 2020-03-18 14:13:19.019462 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd-1.test.bionode.io 2020-03-18 14:13:19.019473 I | pkg/flags: recognized and used environment variable ETCD_QUOTA_BACKEND_BYTES=7516192768 2020-03-18 14:13:19.019484 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP=tcp://172.30.9.230:9979 2020-03-18 14:13:19.019490 W | pkg/flags: unrecognized environment variable ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3826f672aacf734fafbdc81165dd9c81a3c9c1e89ab6f75477a83569b0538a0e 2020-03-18 14:13:19.019493 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PORT=9979 2020-03-18 14:13:19.019496 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_PORT=2379 2020-03-18 14:13:19.019498 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_PROTO=tcp 2020-03-18 14:13:19.019502 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_ADDR=172.30.9.230 2020-03-18 14:13:19.019505 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD=2379 2020-03-18 14:13:19.019508 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD_METRICS=9979 2020-03-18 14:13:19.019513 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379 2020-03-18 14:13:19.019516 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_ADDR=172.30.9.230 2020-03-18 14:13:19.019519 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp 2020-03-18 14:13:19.019521 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.9.230:2379 2020-03-18 14:13:19.019524 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.9.230:2379 2020-03-18 14:13:19.019527 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.9.230 2020-03-18 14:13:19.019544 I | etcdmain: etcd Version: 3.3.18 2020-03-18 14:13:19.019548 I | etcdmain: Git SHA: 68628e0 2020-03-18 14:13:19.019550 I | etcdmain: Go Version: go1.13.4 2020-03-18 14:13:19.019553 I | etcdmain: Go OS/Arch: linux/amd64 2020-03-18 14:13:19.019556 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2020-03-18 14:13:19.019584 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-03-18 14:13:19.019598 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 2020-03-18 14:13:19.020014 I | embed: listening for peers on https://0.0.0.0:2380 2020-03-18 14:13:19.020063 I | embed: listening for client requests on 0.0.0.0:2379 2020-03-18 14:13:19.023321 I | embed: rejected connection from "10.1.10.31:34132" (error "set tcp 10.1.10.3:2380: use of closed network connection", ServerName "") 2020-03-18 14:13:19.023338 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration
Reproduced issue with latest RC2 build Same errors on etc-1 DNS SRV records appear to be ok ig srv _etcd-server-ssl._tcp.test.bionode.io @10.1.10.10 ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el8 <<>> srv _etcd-server-ssl._tcp.test.bionode.io @10.1.10.10 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36481 ;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;_etcd-server-ssl._tcp.test.bionode.io. IN SRV ;; ANSWER SECTION: _etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV 0 10 2380 etcd-1.test.bionode.io. _etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV 0 10 2380 etcd-0.test.bionode.io. _etcd-server-ssl._tcp.test.bionode.io. 0 IN SRV 0 10 2380 etcd-2.test.bionode.io. ;; Query time: 1 msec ;; SERVER: 10.1.10.10#53(10.1.10.10) ;; WHEN: Tue Mar 24 01:06:18 UTC 2020 ;; MSG SIZE rcvd: 192
Created attachment 1675879 [details] Output from oc adm must-gather on a clean ocp 4.4rc6 install Latest must-gather output from a clean OCP 4.4rc6 install ETCD issues.
Thanks for the logs, looks like we do have an issue. > 2020-04-03T00:32:18.551450638Z 2020-04-03 00:32:18.551374 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration ``` 2020-04-03T00:32:16.863837672Z 27f66a9ce3a0a40f, started, etcd-0.test.bionode.io, https://10.1.10.2:2380, https://10.1.10.2:2379 2020-04-03T00:32:16.863837672Z f3414fc56bfb0248, unstarted, , https://10.1.10.3:2380, 2020-04-03T00:32:16.863837672Z f374a1615f0d67db, started, etcd-bootstrap, https://10.1.10.31:2380, https://10.1.10.31:2379 2020-04-03T00:32:16.867024818Z memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io 2020-04-03T00:32:16.874900709Z #### attempt 0 2020-04-03T00:32:16.875950072Z member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379] 2020-04-03T00:32:16.875950072Z member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[] 2020-04-03T00:32:16.875950072Z member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379] 2020-04-03T00:32:16.875950072Z target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil> 2020-04-03T00:32:16.877939518Z mv: cannot stat '/etc/kubernetes/manifests/etcd-member.yaml': No such file or directory 2020-04-03T00:32:16.878124481Z Waiting for ports 2379, 2380 and 9978 to be released.2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP=tcp://172.30.66.5:9979 2020-04-03T00:32:18.541623527Z ETCDCTL_CERT=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt 2020-04-03T00:32:18.541623527Z ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff2c9cbdbc5b34e917f45846e1346c45e259e4d890fa8f5eb4b588d96b3b5d8a 2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_PORT=9979 2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_PORT=2379 2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_PROTO=tcp 2020-04-03T00:32:18.541623527Z ETCD_INITIAL_CLUSTER_STATE=existing 2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_ADDR=172.30.66.5 2020-04-03T00:32:18.541623527Z ALL_ETCD_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379 2020-04-03T00:32:18.541623527Z ETCDCTL_ENDPOINTS=https://10.1.10.2:2379,https://10.1.10.3:2379,https://10.1.10.4:2379,https://10.1.10.31:2379 2020-04-03T00:32:18.541623527Z ETCD_INITIAL_CLUSTER=etcd-0.test.bionode.io=https://10.1.10.2:2380,etcd-bootstrap=https://10.1.10.31:2380 2020-04-03T00:32:18.541623527Z ETCD_ELECTION_TIMEOUT=1000 2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT_ETCD=2379 2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT_ETCD_METRICS=9979 2020-04-03T00:32:18.541623527Z ETCDCTL_CACERT=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt 2020-04-03T00:32:18.541623527Z ETCD_NAME=etcd-1.test.bionode.io 2020-04-03T00:32:18.541623527Z ETCD_QUOTA_BACKEND_BYTES=7516192768 2020-04-03T00:32:18.541623527Z ETCD_SERVICE_PORT=2379 2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP_ADDR=172.30.66.5 2020-04-03T00:32:18.541623527Z ETCDCTL_API=3 2020-04-03T00:32:18.541623527Z ETCD_DATA_DIR=/var/lib/etcd 2020-04-03T00:32:18.541623527Z ETCD_PORT_2379_TCP=tcp://172.30.66.5:2379 2020-04-03T00:32:18.541623527Z ETCD_PORT_9979_TCP_PROTO=tcp 2020-04-03T00:32:18.541623527Z ETCD_PORT=tcp://172.30.66.5:2379 2020-04-03T00:32:18.541623527Z ETCDCTL_KEY=/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key 2020-04-03T00:32:18.541623527Z ETCD_HEARTBEAT_INTERVAL=100 2020-04-03T00:32:18.541623527Z ETCD_SERVICE_HOST=172.30.66.5 ```
Created attachment 1675892 [details] etcd-1-pod
Here is my current cluster ID oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' 5a6288ed-388e-40c0-9bc2-d9ed47d17f96 If I roll back to OCP 4.3.5 on the same hardware the deployment completes
Looking at ETCD on node1 oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd 2020-04-03 00:47:43.533770 I | etcdmain: etcd Version: 3.3.18 2020-04-03 00:47:43.533774 I | etcdmain: Git SHA: 00e3e1c 2020-04-03 00:47:43.533776 I | etcdmain: Go Version: go1.13.4 2020-04-03 00:47:43.533779 I | etcdmain: Go OS/Arch: linux/amd64 2020-04-03 00:47:43.533782 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2020-04-03 00:47:43.533810 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-04-03 00:47:43.533834 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-etcd-1.test.bionode.io.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 2020-04-03 00:47:43.534223 I | embed: listening for peers on https://0.0.0.0:2380 2020-04-03 00:47:43.534272 I | embed: listening for client requests on 0.0.0.0:2379 2020-04-03 00:47:43.538514 I | embed: rejected connection from "10.1.10.31:35278" (error "set tcp 10.1.10.3:2380: use of closed network connection", ServerName "") 2020-04-03 00:47:43.538531 C | etcdmain: couldn't find local name "etcd-1.test.bionode.io" in the initial cluster configuration and on node0 2020-04-03 00:50:36.035179 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-04-03 00:50:36.035212 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 2020-04-03 00:50:41.035324 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-04-03 00:50:41.035359 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 2020-04-03 00:50:46.035471 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-04-03 00:50:46.035509 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 2020-04-03 00:50:51.035582 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-04-03 00:50:51.035620 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 2020-04-03 00:50:56.035696 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 2020-04-03 00:50:56.035718 W | rafthttp: health check for peer f3414fc56bfb0248 could not connect: dial tcp 10.1.10.3:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") Some odd naming issues with the etcd hosts oc get pods -n openshift-etdc No resources found in openshift-etdc namespace. [root@lb ocp44]# oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-etcd-0.test.bionode.io 3/3 Running 3 79m etcd-etcd-1.test.bionode.io 2/3 CrashLoopBackOff 19 76m installer-2-etcd-0.test.bionode.io 0/1 Completed 0 79m installer-2-etcd-1.test.bionode.io 0/1 Completed 0 77m We don't appear to have an ETCD pod on node-2 oc describe node etcd-2.test.bionode.io | grep openshift-etc
This output from etcd-1 looks strange: ``` # oc logs pod/etcd-etcd-1.test.bionode.io -c etcd -n openshift-etcd memberDir /var/lib/etcd/member is present on etcd-1.test.bionode.io #### attempt 0 member={name="", peerURLs=[https://10.1.10.3:2380}, clientURLs=[] member={name="etcd-0.test.bionode.io", peerURLs=[https://10.1.10.2:2380}, clientURLs=[https://10.1.10.2:2379] member={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379] target={name="etcd-bootstrap", peerURLs=[https://10.1.10.31:2380}, clientURLs=[https://10.1.10.31:2379], err=<nil> ``` If this is output from etcd-1, then, the target should etcd-0.test.bionode.io, but the last line above strangely shows the target name as etcd-bootstrap. Will dig into it.
Sorry, if this is the output from etcd-1, then the target should be `etcd-1.test.bionode.io`, but the last line shows the target name as `etcd-bootstrap`, which is unexpected.
bootstrap is 10.1.10.31 The three master/worker bare metal nodes are etcd-0 - 10.1.10.2 etcd-1 - 10.1.10.3 etcd-2 - 10.1.10.4
*** Bug 1832120 has been marked as a duplicate of this bug. ***
I've had a different issue with UPI and OCP 4.3.15 documented under - https://bugzilla.redhat.com/show_bug.cgi?id=1833160
Steven, this PR is merged into 4.5. Can you retest? -- Thanks, Suresh.
*** Bug 1833050 has been marked as a duplicate of this bug. ***
Hi all, can someone please confirm if this IP pattern-matching bug can also occur during an upgrade from 4.3 to 4.4 due to the presence of the new ETCD Operator or is only related with initial ETCD cluster creation? Thanks and regards.
@Pedro Amoedo, regrading to comments 40, do you have env which may config ipaddress as expected? if yes, QE may take the test on your env. thanks.
(In reply to ge liu from comment #44) > @Pedro Amoedo, regrading to comments 40, do you have env which may config > ipaddress as expected? if yes, QE may take the test on your env. thanks. I'm sorry but I don't have my own lab where to reproduce this, the comment is related to customer environment. If needed I can ask them to test a patch or extract some logs, just tell me. Regards.
I feel this bug does not need a reproduction ENV to verify the logic is very straight forward. Can I simply write a test to prove the solution Ge?
I am hitting this same bug it seems when provisioning a new 4.4.6 setup. If I can help by providing logs etc let me know
Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks.
I'm no longer hitting this issue with 4.3/4.4 It appears that one way this issue occurs is if you have unreliable DNS during intial bootstrap. I've switched from a RouterOS DNS to dnsmasq. Secondly I've made sure my bootstrap IP doesn't conflict with the master/worker IPs For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31
> For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31 That is the bug we're fixing here [1]. But the 4.4 backport is still in flight [2] (bug 1837152), and needs this bug to be VERIFIED in 4.5 before it can land. This bug does not address unreliable DNS. Does something there need to get spun out into a new bug? [1]: https://github.com/openshift/etcd/pull/48/files [2]: https://github.com/openshift/etcd/pull/49
(In reply to W. Trevor King from comment #52) > > For some reason a master of 10.1.1.3 would conflict with a bootstrap of 10.1.1.31 > > That is the bug we're fixing here [1]. But the 4.4 backport is still in > flight [2] (bug 1837152), and needs this bug to be VERIFIED in 4.5 before it > can land. > > This bug does not address unreliable DNS. Does something there need to get > spun out into a new bug? > > [1]: https://github.com/openshift/etcd/pull/48/files > [2]: https://github.com/openshift/etcd/pull/49 Hi Trevor, thanks for the PR and the 4.4 backport BZ links, appreciated. Apart from that, could be possible to confirm if this bug only affects during initial etcd cluster installation and not during 4.3 to 4.4 upgrades? I suppose it shouldn't because the bootstrap node is no longer present but I don't know which procedure the new operator follows during an upgrade, thanks. Best Regards.
> Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks. I believe this could be a problem with upgrades as well, essentially any 2 master nodes which have IP addresses that could invalidate a contains call could cause etcd to become confused. This is a bug[1],[2] and it needs to be merged Ge please verify ASAP. https://play.golang.org/p/ZTbgAlVBz0X [2]https://github.com/openshift/etcd/blob/openshift-4.4/openshift-tools/pkg/discover-etcd-initial-cluster/initial-cluster.go#L244
(In reply to Sam Batschelet from comment #54) > > Hi all, can someone please confirm if this bug is only present for new installations and not affecting upgrades from 4.3 to 4.4? thanks. > > I believe this could be a problem with upgrades as well, essentially any 2 > master nodes which have IP addresses that could invalidate a contains call > could cause etcd to become confused. This is a bug[1],[2] and it needs to be > merged Ge please verify ASAP. > > https://play.golang.org/p/ZTbgAlVBz0X > > [2]https://github.com/openshift/etcd/blob/openshift-4.4/openshift-tools/pkg/ > discover-etcd-initial-cluster/initial-cluster.go#L244 Thanks for the confirmation Sam, appreciated!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Issued appear to be resolved. Main problem was a bad DNS resolver having problems during build time.