Bug 1832120
| Summary: | OCP 4.4 UPI bare metal installation bootstrap etcd Degraded | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Ellis <sellis> | ||||||||
| Component: | Etcd Operator | Assignee: | Sam Batschelet <sbatsche> | ||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 4.4 | CC: | mrhodes, wking | ||||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2020-05-11 05:36:01 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Steven Ellis
2020-05-06 06:43:28 UTC
oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.3 True False False 30m cloud-credential 4.4.3 True False False 52m cluster-autoscaler 4.4.3 True False False 38m console 4.4.3 True False False 33m csi-snapshot-controller 4.4.3 True False False 39m dns 4.4.3 True False False 43m etcd 4.4.3 True True True 31m image-registry 4.4.3 True False False 39m ingress 4.4.3 True False False 38m insights 4.4.3 True False False 39m kube-apiserver 4.4.3 True False False 42m kube-controller-manager 4.4.3 True False False 42m kube-scheduler 4.4.3 True False False 41m kube-storage-version-migrator 4.4.3 True False False 44m machine-api 4.4.3 True False False 44m machine-config 4.4.3 True False False 42m marketplace 4.4.3 True False False 39m monitoring 4.4.3 True False False 32m network 4.4.3 True False False 43m node-tuning 4.4.3 True False False 46m openshift-apiserver 4.4.3 True False False 38m openshift-controller-manager 4.4.3 True False False 39m openshift-samples 4.4.3 True False False 38m operator-lifecycle-manager 4.4.3 True False False 44m operator-lifecycle-manager-catalog 4.4.3 True False False 44m operator-lifecycle-manager-packageserver 4.4.3 True False False 38m service-ca 4.4.3 True False False 45m service-catalog-apiserver 4.4.3 True False False 46m service-catalog-controller-manager 4.4.3 True False False 46m storage 4.4.3 True False False 39m Looking at my environment I currently only have etcd running on one master [root@nuc4 core]# crictl ps | grep etcd 23e64b86e6d26 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 47 minutes ago Running etcd 2 5e547808bbb31 437b2c821c84e add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 48 minutes ago Running etcd-metrics 0 5e547808bbb31 76e3c178a3aba add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 48 minutes ago Running etcdctl 0 5e547808bbb31 [root@nuc3 core]# crictl ps | grep etcd 56c5b79da135b add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 46 minutes ago Running etcd-metrics 0 7dfc95f6e2ad4 4db2c8d7fe981 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 46 minutes ago Running etcdctl 0 7dfc95f6e2ad4 crictl ps | grep etcd [root@nuc2 core]# [root@bootstrap core]# crictl ps | grep etcd fb3b8da2e8dc3 add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69 58 minutes ago Running etcd-metrics 0 c74d25add80a1 3f154521529f2 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8f9978516adb30da807b5b30551348223827419ad0666905a6f8792bf51462c 58 minutes ago Running etcd-member 0 c74d25add80a1 oc project openshift-etcd oc get pods NAME READY STATUS RESTARTS AGE etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 14 49m etcd-nuc4.redpill.nz 3/3 Running 2 51m installer-2-nuc3.redpill.nz 0/1 Completed 0 49m installer-2-nuc4.redpill.nz 0/1 Completed 0 51m oc get csr NAME AGE REQUESTOR CONDITION csr-5cgc5 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-5vhbm 68m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-b9bgd 65m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-gsxvz 67m system:node:nuc4.redpill.nz Approved,Issued csr-rv5vk 65m system:node:nuc2.redpill.nz Approved,Issued csr-w6mkx 65m system:node:nuc3.redpill.nz Approved,Issued oc get nodes NAME STATUS ROLES AGE VERSION nuc2.redpill.nz Ready master,worker 65m v1.17.1 nuc3.redpill.nz Ready master,worker 65m v1.17.1 nuc4.redpill.nz Ready master,worker 67m v1.17.1 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.3 True False 49m Cluster version is 4.4.3 oc get pods --all-namespaces | grep etcd openshift-etcd-operator etcd-operator-59cf47554b-nr9qb 1/1 Running 1 72m openshift-etcd etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 17 10m openshift-etcd etcd-nuc4.redpill.nz 3/3 Running 2 65m openshift-etcd installer-2-nuc3.redpill.nz 0/1 Completed 0 64m openshift-etcd installer-2-nuc4.redpill.nz 0/1 Completed 0 65m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-4k2ss 0/1 Running 0 60m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-cb9fw 0/1 Running 0 60m openshift-machine-config-operator etcd-quorum-guard-58d794d79f-qfmnx 1/1 Running 0 60m `oc adm must-gather` would be useful to debug this since we have apiserver up. If that does not work can we get some details on the failed pod and operator logs. ### $ oc describe pods -n openshift-etcd etcd-nuc3.redpill.nz $ oc get pods -n openshift-etcd etcd-nuc3.redpill.nz -o json $ oc logs -n openshift-etcd-operator etcd-operator-59cf47554b-nr9qb events would be useful as well here to triage assuming must-gather fails. ### $ oc get events -A -o json &> evetns.json From ./resources/pods.json in the attached log-bundle, etcd-nuc3.redpill.nz's etcd container died with: 2020-05-06 06:32:02.474958 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD=2379 2020-05-06 06:32:02.474961 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD_METRICS=9979 2020-05-06 06:32:02.474966 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_ADDR=172.30.205.33 2020-05-06 06:32:02.474968 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379 2020-05-06 06:32:02.474972 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp 2020-05-06 06:32:02.474974 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.205.33:2379 2020-05-06 06:32:02.474977 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.205.33:2379 2020-05-06 06:32:02.474980 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.205.33 2020-05-06 06:32:02.474995 I | etcdmain: etcd Version: 3.3.18 2020-05-06 06:32:02.475001 I | etcdmain: Git SHA: c0157a9 2020-05-06 06:32:02.475011 I | etcdmain: Go Version: go1.13.4 2020-05-06 06:32:02.475014 I | etcdmain: Go OS/Arch: linux/amd64 2020-05-06 06:32:02.475017 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8 2020-05-06 06:32:02.475078 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2020-05-06 06:32:02.475091 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 2020-05-06 06:32:02.475526 I | embed: listening for peers on https://0.0.0.0:2380 2020-05-06 06:32:02.475597 I | embed: listening for client requests on 0.0.0.0:2379 2020-05-06 06:32:02.478570 C | etcdmain: couldn't find local name "nuc3.redpill.nz" in the initial cluster configuration Searching for "in the initial cluster configuration" turns up bug 1814576 , which looks very similar. Closing this one as a dup, but feel free to reopen if I'm misunderstanding. *** This bug has been marked as a duplicate of bug 1814576 *** I think these are different as I've just had the same issue deploying 4.3.18. I can deploy 4.3.9 without any issues, but it looks like with 4.3.18 the install isn't using any SRV records. I'm running my DNS server in debug mode so I can see requests and no SRV records are being requested. I'll upload the log-bundle from the failed install Can't run must-gather on the 4.3.18 install as I can't interact with the master etcd instance Created attachment 1686132 [details]
installer log bundle from ocp 4.3.18
I've had a different issue with UPI and OCP 4.3.15 documented under - https://bugzilla.redhat.com/show_bug.cgi?id=1833160 I've now managed to get ocp 4.3.19 to install bare metal with all 3 nodes and I suspect my ocp 4.3 issues are different from 4.4 Moving back to 4.4 testing. Created attachment 1687128 [details]
New bootstrap log bundle from today's testing
Bootstrap failed again with 4.4.3
Cluster came up, but and is consistent, but bootstrap failed.
as bootstrap hasn't finished I don't have consistent ETCD
oc get pods -n openshift-etcd
NAME READY STATUS RESTARTS AGE
etcd-nuc2.redpill.nz 3/3 Running 0 65m
etcd-nuc3.redpill.nz 2/3 CrashLoopBackOff 17 64m
etcd-nuc4.redpill.nz 3/3 Running 4 68m
installer-2-nuc2.redpill.nz 0/1 Completed 0 65m
installer-2-nuc3.redpill.nz 0/1 Completed 0 64m
installer-2-nuc4.redpill.nz 0/1 Completed 0 68m
Looks like this is a duplicate based on the latest build *** This bug has been marked as a duplicate of bug 1814576 *** |