Created attachment 1685559 [details] Log bundle from bootstrap Description of problem: Bootstrap of 3 bare metal converged master/worker nodes via UPI fails with an etcd Degraded error Version-Release number of the following components: openshift-installer 4.4.3 oc 4.4.3 How reproducible: Consistent Steps to Reproduce: 1. Environment has correct DNS and SRV records and has been previously used to deploy OCP 4.3.x UPI 2. openshift-install create ignition-configs --dir=baremetal 3.openshift-install --dir=baremetal wait-for bootstrap-complete -log-level=info Actual results: INFO Waiting up to 20m0s for the Kubernetes API at https://api.test.bionode.io:6443... INFO API v1.17.1 up INFO Waiting up to 40m0s for bootstrapping to complete... ^[[1;2AERROR Cluster operator etcd Degraded is True with StaticPods_Error: StaticPodsDegraded: nodes/nuc3.redpill.nz pods/etcd-nuc3.redpill.nz container="etcd" is not ready StaticPodsDegraded: nodes/nuc3.redpill.nz pods/etcd-nuc3.redpill.nz container="etcd" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=etcd pod=etcd-nuc3.redpill.nz_openshift-etcd(b41045a04c0dabe833895029ccac2a37)" StaticPodsDegraded: pods "etcd-nuc2.redpill.nz" not found INFO Cluster operator etcd Progressing is True with EtcdMembers_MembersNotStarted::NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2 EtcdMembersProgressing: members have not started yet INFO Cluster operator insights Disabled is False with : INFO Use the following commands to gather logs from the cluster INFO openshift-install gather bootstrap --help FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition Expected results: Additional info: openshift-install gather bootstrap --bootstrap 10.1.10.31 --master 10.1.10.2 INFO Pulling debug logs from the bootstrap machine INFO Bootstrap gather logs captured here "log-bundle-20200506183622.tar.gz"
oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.3 True False False 30m cloud-credential 4.4.3 True False False 52m cluster-autoscaler 4.4.3 True False False 38m console 4.4.3 True False False 33m csi-snapshot-controller 4.4.3 True False False 39m dns 4.4.3 True False False 43m etcd 4.4.3 True True True 31m image-registry 4.4.3 True False False 39m ingress 4.4.3 True False False 38m insights 4.4.3 True False False 39m kube-apiserver 4.4.3 True False False 42m kube-controller-manager 4.4.3 True False False 42m kube-scheduler 4.4.3 True False False 41m kube-storage-version-migrator 4.4.3 True False False 44m machine-api 4.4.3 True False False 44m machine-config 4.4.3 True False False 42m marketplace 4.4.3 True False False 39m monitoring 4.4.3 True False False 32m network 4.4.3 True False False 43m node-tuning 4.4.3 True False False 46m openshift-apiserver 4.4.3 True False False 38m openshift-controller-manager 4.4.3 True False False 39m openshift-samples 4.4.3 True False False 38m operator-lifecycle-manager 4.4.3 True False False 44m operator-lifecycle-manager-catalog 4.4.3 True False False 44m operator-lifecycle-manager-packageserver 4.4.3 True False False 38m service-ca 4.4.3 True False False 45m service-catalog-apiserver 4.4.3 True False False 46m service-catalog-controller-manager 4.4.3 True False False 46m storage 4.4.3 True False False 39m
Looking at my environment I currently only have etcd running on one master [root@nuc4 core]# crictl ps | grep etcd 23e64b86e6d26 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 47 minutes ago Running etcd 2 5e547808bbb31 437b2c821c84e add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 48 minutes ago Running etcd-metrics 0 5e547808bbb31 76e3c178a3aba add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 48 minutes ago Running etcdctl 0 5e547808bbb31 [root@nuc3 core]# crictl ps | grep etcd 56c5b79da135b add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 46 minutes ago Running etcd-metrics 0 7dfc95f6e2ad4 4db2c8d7fe981 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 46 minutes ago Running etcdctl 0 7dfc95f6e2ad4 crictl ps | grep etcd [root@nuc2 core]# [root@bootstrap core]# crictl ps | grep etcd fb3b8da2e8dc3 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 58 minutes ago Running etcd-metrics 0 c74d25add80a1 3f154521529f2 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8f9978516adb30da807b5b30551348223827419ad0666905a6f8792bf51462c 58 minutes ago Running etcd-member 0 c74d25add80a1
oc project openshift-etcd oc get pods NAME READY STATUS RESTARTS AGE etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 14 49m etcd-nuc4.redpill.nz 3/3 Running 2 51m installer-2-nuc3.redpill.nz 0/1 Completed 0 49m installer-2-nuc4.redpill.nz 0/1 Completed 0 51m
oc get csr NAME AGE REQUESTOR CONDITION csr-5cgc5 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-5vhbm 68m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-b9bgd 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-gsxvz 67m system:node:nuc4.redpill.nz Approved,Issued csr-rv5vk 65m system:node:nuc2.redpill.nz Approved,Issued csr-w6mkx 65m system:node:nuc3.redpill.nz Approved,Issued oc get nodes NAME STATUS ROLES AGE VERSION nuc2.redpill.nz Ready master,worker 65m v1.17.1 nuc3.redpill.nz Ready master,worker 65m v1.17.1 nuc4.redpill.nz Ready master,worker 67m v1.17.1 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.3 True False 49m Cluster version is 4.4.3
oc get pods --all-namespaces | grep etcd openshift-etcd-operator etcd-operator-59cf47554b-nr9qb 1/1 Running 1 72m openshift-etcd etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 17 10m openshift-etcd etcd-nuc4.redpill.nz 3/3 Running 2 65m openshift-etcd installer-2-nuc3.redpill.nz 0/1 Completed 0 64m openshift-etcd installer-2-nuc4.redpill.nz 0/1 Completed 0 65m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-4k2ss 0/1 Running 0 60m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-cb9fw 0/1 Running 0 60m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-qfmnx 1/1 Running 0 60m
`oc adm must-gather` would be useful to debug this since we have apiserver up. If that does not work can we get some details on the failed pod and operator logs. ### $ oc describe pods -n openshift-etcd etcd-nuc3.redpill.nz $ oc get pods -n openshift-etcd etcd-nuc3.redpill.nz -o json $ oc logs -n openshift-etcd-operator etcd-operator-59cf47554b-nr9qb
events would be useful as well here to triage assuming must-gather fails. ### $ oc get events -A -o json &> evetns.json
From ./resources/pods.json in the attached log-bundle, etcd-nuc3.redpill.nz's etcd container died with: 2020-05-06 06:32:02.474958 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD=2379 2020-05-06 06:32:02.474961 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD_METRICS=9979 2020-05-06 06:32:02.474966 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_ADDR=172.30.205.33 2020-05-06 06:32:02.474968 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379 2020-05-06 06:32:02.474972 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp 2020-05-06 06:32:02.474974 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.205.33:2379 2020-05-06 06:32:02.474977 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.205.33:2379 2020-05-06 06:32:02.474980 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.205.33 2020-05-06 06:32:02.474995 I | etcdmain: etcd Version: 3.3.18 2020-05-06 06:32:02.475001 I | etcdmain: Git SHA: c0157a9 2020-05-06 06:32:02.475011 I | etcdmain: Go Version: go1.13.4 2020-05-06 06:32:02.475014 I | etcdmain: Go OS/Arch: linux/amd64 2020-05-06 06:32:02.475017 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2020-05-06 06:32:02.475078 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-05-06 06:32:02.475091 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 2020-05-06 06:32:02.475526 I | embed: listening for peers on https://0.0.0.0:2380 2020-05-06 06:32:02.475597 I | embed: listening for client requests on 0.0.0.0:2379 2020-05-06 06:32:02.478570 C | etcdmain: couldn't find local name "nuc3.redpill.nz" in the initial cluster configuration
Searching for "in the initial cluster configuration" turns up bug 1814576 , which looks very similar. Closing this one as a dup, but feel free to reopen if I'm misunderstanding. *** This bug has been marked as a duplicate of bug 1814576 ***
I think these are different as I've just had the same issue deploying 4.3.18. I can deploy 4.3.9 without any issues, but it looks like with 4.3.18 the install isn't using any SRV records. I'm running my DNS server in debug mode so I can see requests and no SRV records are being requested. I'll upload the log-bundle from the failed install Can't run must-gather on the 4.3.18 install as I can't interact with the master etcd instance
Created attachment 1686132 [details] installer log bundle from ocp 4.3.18
I've had a different issue with UPI and OCP 4.3.15 documented under - https://bugzilla.redhat.com/show_bug.cgi?id=1833160
I've now managed to get ocp 4.3.19 to install bare metal with all 3 nodes and I suspect my ocp 4.3 issues are different from 4.4 Moving back to 4.4 testing.
Created attachment 1687128 [details] New bootstrap log bundle from today's testing Bootstrap failed again with 4.4.3 Cluster came up, but and is consistent, but bootstrap failed. as bootstrap hasn't finished I don't have consistent ETCD oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-nuc2.redpill.nz 3/3 Running 0 65m etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 17 64m etcd-nuc4.redpill.nz 3/3 Running 4 68m installer-2-nuc2.redpill.nz 0/1 Completed 0 65m installer-2-nuc3.redpill.nz 0/1 Completed 0 64m installer-2-nuc4.redpill.nz 0/1 Completed 0 68m
Looks like this is a duplicate based on the latest build *** This bug has been marked as a duplicate of bug 1814576 ***