1843565 – ETCD fails to approve CSRs after hostname on a master is changed

Bug 1843565 - ETCD fails to approve CSRs after hostname on a master is changed

Summary: ETCD fails to approve CSRs after hostname on a master is changed

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.3.z
Hardware:	s390x
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-03 14:45 UTC by Peter Kirkpatrick
Modified:	2023-10-06 20:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-03 20:04:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1809345	high	CLOSED	OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost	2023-10-06 19:18:57 UTC
Red Hat Bugzilla	1823883	high	CLOSED	[release-4.3] OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1837124	high	CLOSED	OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost	2021-02-22 00:41:40 UTC

Description Peter Kirkpatrick 2020-06-03 14:45:56 UTC

Description of problem:
When attempting to install OCP 4.3.18 on bare metal s390x, some of the nodes come up with hostname 'localhost'.  I know there are quite a few BZ for that already, but..

The issue here is that the first 'localhost' issue I see ends up being on one of the master nodes.  Somehow the install finishes, and the cluster appears fine (minus the hostname).

At that point, I rebuilt the master node in question in an attempt to get it functional again, but I believe there to be an issue with ETCD that is keeping this from being a usable workaround..

Version-Release number of selected component (if applicable):
4.3.18 - (had the same issue in 4.3.21 after an upgrade as well)

How reproducible:


Steps to Reproduce:
1. install OCP 4.3.18 on bare metal s390x nodes (2 workers and 3 masters to begin)
2. inevitably 1 master comes up as localhost
3. oc delete node localhost
4. rebuild master
5. `hostname` > /etc/hostname
6. reboot master
7. CSRs then get regenerated
8. chaos ensues

cluster operator status:

[root@bastion ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.18    True        False         True       12h
cloud-credential                           4.3.18    True        False         False      12h
cluster-autoscaler                         4.3.18    True        False         False      12h
console                                    4.3.18    False       True          False      10h
dns                                        4.3.18    True        False         False      12h
image-registry                             4.3.18    True        False         False      11h
ingress                                    4.3.18    True        False         False      92m
insights                                   4.3.18    True        False         False      12h
kube-apiserver                             4.3.18    True        False         False      12h
kube-controller-manager                    4.3.18    True        False         False      12h
kube-scheduler                             4.3.18    True        False         False      12h
machine-api                                4.3.18    True        False         False      12h
machine-config                             4.3.18    True        False         False      12h
marketplace                                4.3.18    True        False         False      10h
monitoring                                 4.3.18    False       True          True       71m
network                                    4.3.18    True        True          True       12h
node-tuning                                4.3.18    True        False         False      10h
openshift-apiserver                        4.3.18    True        False         False      11h
openshift-controller-manager               4.3.18    True        False         False      12h
openshift-samples                          4.3.18    True        False         False      12h
operator-lifecycle-manager                 4.3.18    True        False         False      12h
operator-lifecycle-manager-catalog         4.3.18    True        False         False      12h
operator-lifecycle-manager-packageserver   4.3.18    True        False         False      11h
service-ca                                 4.3.18    True        False         False      12h
service-catalog-apiserver                  4.3.18    True        False         False      12h
service-catalog-controller-manager         4.3.18    True        False         False      12h
storage                                    4.3.18    True        False         False      12h


Status of etcd pods: 

openshift-etcd                                          etcd-member-mstr-11.qe1-s390x.prod.psi.rdu2.redhat.com                2/2     Running             0          12h
openshift-etcd                                          etcd-member-mstr-12.qe1-s390x.prod.psi.rdu2.redhat.com                0/2     Init:1/2            4          11h
openshift-etcd                                          etcd-member-mstr-13.qe1-s390x.prod.psi.rdu2.redhat.com                2/2     Running           

On doing a describe of the etcd pod on mstr-12:

      Message:   ing before flag.Parse: E0603 05:42:26.134794       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists
ERROR: logging before flag.Parse: E0603 05:42:36.134673       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists
ERROR: logging before flag.Parse: E0603 05:42:46.137318       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:49.140241       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:52.140320       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:55.140113       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:56.142123       9 agent.go:150] status on CSR not set. Retrying.
Error: error requesting certificate: error obtaining signed certificate from signer: timed out waiting for the condition



Additional info:

Comment 1 Sam Batschelet 2020-06-03 17:58:20 UTC

So help me understand what happens here.

here is your failure[1]. ETCD_DNS_NAME is populated by doing an SRV query against the cluster domain[2]. The certs on disk are assumed to have the naming for example system:etcd-peer:${ETCD_DNS_NAME}.key why does that check fail?

>       Message:   ing before flag.Parse: E0603 05:42:26.134794       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists

This is expected although not optimal. etcd certs in 4.3 are minted during bootstrap. New nodes or changes to nodes such as IP can invalidate assumptions baked into TLS SAN. If you make a CSR request it is because either no certs exist, this would happen if the node is new. In 4.3 a new node would require disaster recovery process to replace failed master node.

[1] https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L39
[2] https://github.com/openshift/machine-config-operator/blob/release-4.3/cmd/setup-etcd-environment/run.go#L63

Comment 2 Prashanth Sundararaman 2020-06-03 20:04:49 UTC

Turns out there was an issue with the network when fixed everything came up fine.

Note You need to log in before you can comment on or make changes to this bug.