Bug 1843565

Summary: ETCD fails to approve CSRs after hostname on a master is changed
Product: OpenShift Container Platform Reporter: Peter Kirkpatrick <pkirkpat>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED NOTABUG QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: psundara, zyu
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-03 20:04:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Kirkpatrick 2020-06-03 14:45:56 UTC
Description of problem:
When attempting to install OCP 4.3.18 on bare metal s390x, some of the nodes come up with hostname 'localhost'.  I know there are quite a few BZ for that already, but..

The issue here is that the first 'localhost' issue I see ends up being on one of the master nodes.  Somehow the install finishes, and the cluster appears fine (minus the hostname).

At that point, I rebuilt the master node in question in an attempt to get it functional again, but I believe there to be an issue with ETCD that is keeping this from being a usable workaround..

Version-Release number of selected component (if applicable):
4.3.18 - (had the same issue in 4.3.21 after an upgrade as well)

How reproducible:


Steps to Reproduce:
1. install OCP 4.3.18 on bare metal s390x nodes (2 workers and 3 masters to begin)
2. inevitably 1 master comes up as localhost
3. oc delete node localhost
4. rebuild master
5. `hostname` > /etc/hostname
6. reboot master
7. CSRs then get regenerated
8. chaos ensues

cluster operator status:

[root@bastion ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.18    True        False         True       12h
cloud-credential                           4.3.18    True        False         False      12h
cluster-autoscaler                         4.3.18    True        False         False      12h
console                                    4.3.18    False       True          False      10h
dns                                        4.3.18    True        False         False      12h
image-registry                             4.3.18    True        False         False      11h
ingress                                    4.3.18    True        False         False      92m
insights                                   4.3.18    True        False         False      12h
kube-apiserver                             4.3.18    True        False         False      12h
kube-controller-manager                    4.3.18    True        False         False      12h
kube-scheduler                             4.3.18    True        False         False      12h
machine-api                                4.3.18    True        False         False      12h
machine-config                             4.3.18    True        False         False      12h
marketplace                                4.3.18    True        False         False      10h
monitoring                                 4.3.18    False       True          True       71m
network                                    4.3.18    True        True          True       12h
node-tuning                                4.3.18    True        False         False      10h
openshift-apiserver                        4.3.18    True        False         False      11h
openshift-controller-manager               4.3.18    True        False         False      12h
openshift-samples                          4.3.18    True        False         False      12h
operator-lifecycle-manager                 4.3.18    True        False         False      12h
operator-lifecycle-manager-catalog         4.3.18    True        False         False      12h
operator-lifecycle-manager-packageserver   4.3.18    True        False         False      11h
service-ca                                 4.3.18    True        False         False      12h
service-catalog-apiserver                  4.3.18    True        False         False      12h
service-catalog-controller-manager         4.3.18    True        False         False      12h
storage                                    4.3.18    True        False         False      12h


Status of etcd pods: 

openshift-etcd                                          etcd-member-mstr-11.qe1-s390x.prod.psi.rdu2.redhat.com                2/2     Running             0          12h
openshift-etcd                                          etcd-member-mstr-12.qe1-s390x.prod.psi.rdu2.redhat.com                0/2     Init:1/2            4          11h
openshift-etcd                                          etcd-member-mstr-13.qe1-s390x.prod.psi.rdu2.redhat.com                2/2     Running           

On doing a describe of the etcd pod on mstr-12:

      Message:   ing before flag.Parse: E0603 05:42:26.134794       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists
ERROR: logging before flag.Parse: E0603 05:42:36.134673       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists
ERROR: logging before flag.Parse: E0603 05:42:46.137318       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:49.140241       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:52.140320       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:55.140113       9 agent.go:150] status on CSR not set. Retrying.
ERROR: logging before flag.Parse: E0603 05:42:56.142123       9 agent.go:150] status on CSR not set. Retrying.
Error: error requesting certificate: error obtaining signed certificate from signer: timed out waiting for the condition



Additional info:

Comment 1 Sam Batschelet 2020-06-03 17:58:20 UTC
So help me understand what happens here.

here is your failure[1]. ETCD_DNS_NAME is populated by doing an SRV query against the cluster domain[2]. The certs on disk are assumed to have the naming for example system:etcd-peer:${ETCD_DNS_NAME}.key why does that check fail?

>       Message:   ing before flag.Parse: E0603 05:42:26.134794       9 agent.go:116] error sending CSR to signer: certificatesigningrequests.certificates.k8s.io "system:etcd-server:etcd-1.qe1-s390x.psi.redhat.com" already exists

This is expected although not optimal. etcd certs in 4.3 are minted during bootstrap. New nodes or changes to nodes such as IP can invalidate assumptions baked into TLS SAN. If you make a CSR request it is because either no certs exist, this would happen if the node is new. In 4.3 a new node would require disaster recovery process to replace failed master node.

[1] https://github.com/openshift/machine-config-operator/blob/release-4.3/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L39
[2] https://github.com/openshift/machine-config-operator/blob/release-4.3/cmd/setup-etcd-environment/run.go#L63

Comment 2 Prashanth Sundararaman 2020-06-03 20:04:49 UTC
Turns out there was an issue with the network when fixed everything came up fine.