Description of problem: When attempting to install OCP 4.3.18 on bare metal s390x, some of the nodes come up with hostname 'localhost'. I know there are quite a few BZ for that already, but.. The issue here is that the first 'localhost' issue I see ends up being on one of the master nodes. Somehow the install finishes, and the cluster appears fine (minus the hostname). At that point, I rebuilt the master node in question in an attempt to get it functional again, but I believe there to be an issue with ETCD that is keeping this from being a usable workaround.. Version-Release number of selected component (if applicable): 4.3.18 - (had the same issue in 4.3.21 after an upgrade as well) How reproducible: Steps to Reproduce: 1. install OCP 4.3.18 on bare metal s390x nodes (2 workers and 3 masters to begin) 2. inevitably 1 master comes up as localhost 3. oc delete node localhost 4. rebuild master 5. `hostname` > /etc/hostname 6. reboot master 7. CSRs then get regenerated 8. chaos ensues cluster operator status: [root@bastion ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.18 True False True 12h cloud-credential 4.3.18 True False False 12h cluster-autoscaler 4.3.18 True False False 12h console 4.3.18 False True False 10h dns 4.3.18 True False False 12h image-registry 4.3.18 True False False 11h ingress 4.3.18 True False False 92m insights 4.3.18 True False False 12h kube-apiserver 4.3.18 True False False 12h kube-controller-manager 4.3.18 True False False 12h kube-scheduler 4.3.18 True False False 12h machine-api 4.3.18 True False False 12h machine-config 4.3.18 True False False 12h marketplace 4.3.18 True False False 10h monitoring 4.3.18 False True True 71m network 4.3.18 True True True 12h node-tuning 4.3.18 True False False 10h openshift-apiserver 4.3.18 True False False 11h openshift-controller-manager 4.3.18 True False False 12h openshift-samples 4.3.18 True False False 12h operator-lifecycle-manager 4.3.18 True False False 12h operator-lifecycle-manager-catalog 4.3.18 True False False 12h operator-lifecycle-manager-packageserver 4.3.18 True False False 11h service-ca 4.3.18 True False False 12h service-catalog-apiserver 4.3.18 True False False 12h service-catalog-controller-manager 4.3.18 True False False 12h storage 4.3.18 True False False 12h Status of etcd pods: openshift-etcd etcd-member-mstr-11.qe1-s390x.prod.psi.rdu2.redhat.com 2/2 Running 0 12h openshift-etcd etcd-member-mstr-12.qe1-s390x.prod.psi.rdu2.redhat.com 0/2 Init:1/2 4 11h openshift-etcd etcd-member-mstr-13.qe1-s390x.prod.psi.rdu2.redhat.com 2/2 Running On doing a describe of the etcd pod on mstr-12: Message: ing before flag.Parse: E0603 05:42:26.134794 9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists ERROR: logging before flag.Parse: E0603 05:42:36.134673 9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists ERROR: logging before flag.Parse: E0603 05:42:46.137318 9 agent.go:150] status on CSR not set. Retrying. ERROR: logging before flag.Parse: E0603 05:42:49.140241 9 agent.go:150] status on CSR not set. Retrying. ERROR: logging before flag.Parse: E0603 05:42:52.140320 9 agent.go:150] status on CSR not set. Retrying. ERROR: logging before flag.Parse: E0603 05:42:55.140113 9 agent.go:150] status on CSR not set. Retrying. ERROR: logging before flag.Parse: E0603 05:42:56.142123 9 agent.go:150] status on CSR not set. Retrying. Error: error requesting certificate: error obtaining signed certificate from signer: timed out waiting for the condition Additional info:
So help me understand what happens here. here is your failure[1]. ETCD_DNS_NAME is populated by doing an SRV query against the cluster domain[2]. The certs on disk are assumed to have the naming for example system:etcd-peer:${ETCD_DNS_NAME}.key why does that check fail? > Message: ing before flag.Parse: E0603 05:42:26.134794 9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists This is expected although not optimal. etcd certs in 4.3 are minted during bootstrap. New nodes or changes to nodes such as IP can invalidate assumptions baked into TLS SAN. If you make a CSR request it is because either no certs exist, this would happen if the node is new. In 4.3 a new node would require disaster recovery process to replace failed master node. [1] https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L39 [2] https://github.com/openshift/machine-config-operator/blob/release-4.3/cmd/setup-etcd-environment/run.go#L63
Turns out there was an issue with the network when fixed everything came up fine.