1832120 – OCP 4.4 UPI bare metal installation bootstrap etcd Degraded

Bug 1832120 - OCP 4.4 UPI bare metal installation bootstrap etcd Degraded

Summary: OCP 4.4 UPI bare metal installation bootstrap etcd Degraded

Keywords:
Status:	CLOSED DUPLICATE of bug 1814576
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd Operator
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-06 06:43 UTC by Steven Ellis
Modified:	2020-05-11 05:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-11 05:36:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Log bundle from bootstrap (6.61 MB, application/gzip) 2020-05-06 06:43 UTC, Steven Ellis	no flags	Details
installer log bundle from ocp 4.3.18 (13.99 MB, application/gzip) 2020-05-07 11:14 UTC, Steven Ellis	no flags	Details
New bootstrap log bundle from today's testing (6.99 MB, application/gzip) 2020-05-11 03:25 UTC, Steven Ellis	no flags	Details
View All

Description Steven Ellis 2020-05-06 06:43:28 UTC

Created attachment 1685559 [details]
Log bundle from bootstrap

Description of problem:

Bootstrap of 3 bare metal converged master/worker nodes via UPI fails with an etcd Degraded error

Version-Release number of the following components:

openshift-installer 4.4.3
oc 4.4.3

How reproducible:

Consistent

Steps to Reproduce:
1. Environment has correct DNS and SRV records and has been previously used to deploy OCP 4.3.x UPI
2. openshift-install create ignition-configs --dir=baremetal
3.openshift-install --dir=baremetal wait-for bootstrap-complete -log-level=info


Actual results:

INFO Waiting up to 20m0s for the Kubernetes API at https://api.test.bionode.io:6443... 
INFO API v1.17.1 up                               
INFO Waiting up to 40m0s for bootstrapping to complete... 
^[[1;2AERROR Cluster operator etcd Degraded is True with StaticPods_Error: StaticPodsDegraded: nodes/nuc3.redpill.nz pods/etcd-nuc3.redpill.nz container="etcd" is not ready
StaticPodsDegraded: nodes/nuc3.redpill.nz pods/etcd-nuc3.redpill.nz container="etcd" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=etcd pod=etcd-nuc3.redpill.nz_openshift-etcd(b41045a04c0dabe833895029ccac2a37)"
StaticPodsDegraded: pods "etcd-nuc2.redpill.nz" not found 
INFO Cluster operator etcd Progressing is True with EtcdMembers_MembersNotStarted::NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2
EtcdMembersProgressing:  members have not started yet 
INFO Cluster operator insights Disabled is False with :  
INFO Use the following commands to gather logs from the cluster 
INFO openshift-install gather bootstrap --help    
FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition 


Expected results:

Additional info:

openshift-install gather bootstrap --bootstrap 10.1.10.31  --master 10.1.10.2
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "log-bundle-20200506183622.tar.gz"

Comment 1 Steven Ellis 2020-05-06 06:44:15 UTC

oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.3     True        False         False      30m
cloud-credential                           4.4.3     True        False         False      52m
cluster-autoscaler                         4.4.3     True        False         False      38m
console                                    4.4.3     True        False         False      33m
csi-snapshot-controller                    4.4.3     True        False         False      39m
dns                                        4.4.3     True        False         False      43m
etcd                                       4.4.3     True        True          True       31m
image-registry                             4.4.3     True        False         False      39m
ingress                                    4.4.3     True        False         False      38m
insights                                   4.4.3     True        False         False      39m
kube-apiserver                             4.4.3     True        False         False      42m
kube-controller-manager                    4.4.3     True        False         False      42m
kube-scheduler                             4.4.3     True        False         False      41m
kube-storage-version-migrator              4.4.3     True        False         False      44m
machine-api                                4.4.3     True        False         False      44m
machine-config                             4.4.3     True        False         False      42m
marketplace                                4.4.3     True        False         False      39m
monitoring                                 4.4.3     True        False         False      32m
network                                    4.4.3     True        False         False      43m
node-tuning                                4.4.3     True        False         False      46m
openshift-apiserver                        4.4.3     True        False         False      38m
openshift-controller-manager               4.4.3     True        False         False      39m
openshift-samples                          4.4.3     True        False         False      38m
operator-lifecycle-manager                 4.4.3     True        False         False      44m
operator-lifecycle-manager-catalog         4.4.3     True        False         False      44m
operator-lifecycle-manager-packageserver   4.4.3     True        False         False      38m
service-ca                                 4.4.3     True        False         False      45m
service-catalog-apiserver                  4.4.3     True        False         False      46m
service-catalog-controller-manager         4.4.3     True        False         False      46m
storage                                    4.4.3     True        False         False      39m

Comment 2 Steven Ellis 2020-05-06 06:49:29 UTC

Looking at my environment I currently only have etcd running on one master 

[root@nuc4 core]# crictl ps | grep etcd
23e64b86e6d26       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         47 minutes ago      Running             etcd                                          2                   5e547808bbb31
437b2c821c84e       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         48 minutes ago      Running             etcd-metrics                                  0                   5e547808bbb31
76e3c178a3aba       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         48 minutes ago      Running             etcdctl                                       0                   5e547808bbb31


[root@nuc3 core]# crictl ps | grep etcd
56c5b79da135b       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         46 minutes ago      Running             etcd-metrics                                  0                   7dfc95f6e2ad4
4db2c8d7fe981       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         46 minutes ago      Running             etcdctl                                       0                   7dfc95f6e2ad4

crictl ps | grep etcd
[root@nuc2 core]# 

[root@bootstrap core]# crictl ps | grep etcd
fb3b8da2e8dc3       add8db87608dca5020e25b71cd0bdd6a5f9b017353b4d0af91238eada0343b69                                                         58 minutes ago      Running             etcd-metrics        0                   c74d25add80a1
3f154521529f2       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8f9978516adb30da807b5b30551348223827419ad0666905a6f8792bf51462c   58 minutes ago      Running             etcd-member         0                   c74d25add80a1

Comment 3 Steven Ellis 2020-05-06 06:51:01 UTC

oc project openshift-etcd

oc get pods
NAME                          READY   STATUS             RESTARTS   AGE
etcd-nuc3.redpill.nz          2/3     CrashLoopBackOff   14         49m
etcd-nuc4.redpill.nz          3/3     Running            2          51m
installer-2-nuc3.redpill.nz   0/1     Completed          0          49m
installer-2-nuc4.redpill.nz   0/1     Completed          0          51m

Comment 4 Steven Ellis 2020-05-06 07:03:05 UTC

oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-5cgc5   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-5vhbm   68m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-b9bgd   65m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-gsxvz   67m   system:node:nuc4.redpill.nz                                                 Approved,Issued
csr-rv5vk   65m   system:node:nuc2.redpill.nz                                                 Approved,Issued
csr-w6mkx   65m   system:node:nuc3.redpill.nz                                                 Approved,Issued


oc get nodes
NAME              STATUS   ROLES           AGE   VERSION
nuc2.redpill.nz   Ready    master,worker   65m   v1.17.1
nuc3.redpill.nz   Ready    master,worker   65m   v1.17.1
nuc4.redpill.nz   Ready    master,worker   67m   v1.17.1

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.3     True        False         49m     Cluster version is 4.4.3

Comment 5 Steven Ellis 2020-05-06 07:04:09 UTC

oc get pods --all-namespaces  | grep etcd
openshift-etcd-operator                                 etcd-operator-59cf47554b-nr9qb                                    1/1     Running            1          72m
openshift-etcd                                          etcd-nuc3.redpill.nz                                              2/3     CrashLoopBackOff   17         10m
openshift-etcd                                          etcd-nuc4.redpill.nz                                              3/3     Running            2          65m
openshift-etcd                                          installer-2-nuc3.redpill.nz                                       0/1     Completed          0          64m
openshift-etcd                                          installer-2-nuc4.redpill.nz                                       0/1     Completed          0          65m
openshift-machine-config-operator                       etcd-quorum-guard-58d794d79f-4k2ss                                0/1     Running            0          60m
openshift-machine-config-operator                       etcd-quorum-guard-58d794d79f-cb9fw                                0/1     Running            0          60m
openshift-machine-config-operator                       etcd-quorum-guard-58d794d79f-qfmnx                                1/1     Running            0          60m

Comment 6 Sam Batschelet 2020-05-06 22:28:25 UTC

`oc adm must-gather` would be useful to debug this since we have apiserver up.

If that does not work can we get some details on the failed pod and operator logs.

###
$ oc describe pods -n openshift-etcd etcd-nuc3.redpill.nz

$ oc get pods -n openshift-etcd etcd-nuc3.redpill.nz -o json

$ oc logs -n openshift-etcd-operator etcd-operator-59cf47554b-nr9qb

Comment 7 Sam Batschelet 2020-05-06 22:38:57 UTC

events would be useful as well here to triage assuming must-gather fails.

###
$ oc get events -A -o json &> evetns.json

Comment 8 W. Trevor King 2020-05-07 04:34:11 UTC

From ./resources/pods.json in the attached log-bundle, etcd-nuc3.redpill.nz's etcd container died with:

2020-05-06 06:32:02.474958 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD=2379
2020-05-06 06:32:02.474961 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT_ETCD_METRICS=9979
2020-05-06 06:32:02.474966 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP_ADDR=172.30.205.33
2020-05-06 06:32:02.474968 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379
2020-05-06 06:32:02.474972 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp
2020-05-06 06:32:02.474974 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.205.33:2379
2020-05-06 06:32:02.474977 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.205.33:2379
2020-05-06 06:32:02.474980 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.205.33
2020-05-06 06:32:02.474995 I | etcdmain: etcd Version: 3.3.18
2020-05-06 06:32:02.475001 I | etcdmain: Git SHA: c0157a9
2020-05-06 06:32:02.475011 I | etcdmain: Go Version: go1.13.4
2020-05-06 06:32:02.475014 I | etcdmain: Go OS/Arch: linux/amd64
2020-05-06 06:32:02.475017 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2020-05-06 06:32:02.475078 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-05-06 06:32:02.475091 I | embed: peerTLS: cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-nuc3.redpill.nz.key, ca = , trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = 
2020-05-06 06:32:02.475526 I | embed: listening for peers on https://0.0.0.0:2380
2020-05-06 06:32:02.475597 I | embed: listening for client requests on 0.0.0.0:2379
2020-05-06 06:32:02.478570 C | etcdmain: couldn't find local name "nuc3.redpill.nz" in the initial cluster configuration

Comment 9 W. Trevor King 2020-05-07 04:39:24 UTC

Searching for "in the initial cluster configuration" turns up bug 1814576 , which looks very similar.  Closing this one as a dup, but feel free to reopen if I'm misunderstanding.

*** This bug has been marked as a duplicate of bug 1814576 ***

Comment 10 Steven Ellis 2020-05-07 11:13:30 UTC

I think these are different as I've just had the same issue deploying 4.3.18.

I can deploy 4.3.9 without any issues, but it looks like with 4.3.18 the install isn't using any SRV records. I'm running my DNS server in debug mode so I can see requests and no SRV records are being requested. I'll upload the log-bundle from the failed install

Can't run must-gather on the 4.3.18 install as I can't interact with the master etcd instance

Comment 11 Steven Ellis 2020-05-07 11:14:43 UTC

Created attachment 1686132 [details]
installer log bundle from ocp 4.3.18

Comment 12 Steven Ellis 2020-05-07 23:30:08 UTC

I've had a different issue with UPI and OCP 4.3.15 documented under
 - https://bugzilla.redhat.com/show_bug.cgi?id=1833160

Comment 14 Steven Ellis 2020-05-11 03:22:16 UTC

I've now managed to get ocp 4.3.19 to install bare metal with all 3 nodes and I suspect my ocp 4.3 issues are different from 4.4

Moving back to 4.4 testing.

Comment 15 Steven Ellis 2020-05-11 03:25:36 UTC

Created attachment 1687128 [details]
New bootstrap log bundle from today's testing

Bootstrap failed again with 4.4.3

Cluster came up, but and is consistent, but bootstrap failed.

as bootstrap hasn't finished I don't have consistent ETCD

oc get pods -n openshift-etcd
NAME                          READY   STATUS             RESTARTS   AGE
etcd-nuc2.redpill.nz          3/3     Running            0          65m
etcd-nuc3.redpill.nz          2/3     CrashLoopBackOff   17         64m
etcd-nuc4.redpill.nz          3/3     Running            4          68m
installer-2-nuc2.redpill.nz   0/1     Completed          0          65m
installer-2-nuc3.redpill.nz   0/1     Completed          0          64m
installer-2-nuc4.redpill.nz   0/1     Completed          0          68m

Comment 17 Steven Ellis 2020-05-11 05:36:01 UTC

Looks like this is a duplicate based on the latest build

*** This bug has been marked as a duplicate of bug 1814576 ***

Note You need to log in before you can comment on or make changes to this bug.