Description of problem: On a healthy IPI cluster on OSP, from the OSP console, delete a worker instance. It is created again, with the same name, but it never gets ready as node. Logs from approver pods shows errors when trying to sign cert for node: # oc logs -f machine-approver-5646d57764-kzx6l -n openshift-cluster-machine-approver [...] I0828 09:41:07.768120 1 main.go:107] CSR csr-tf478 added I0828 09:41:07.790716 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.790775 1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.796244 1 main.go:107] CSR csr-tf478 added I0828 09:41:07.814771 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.814919 1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.825210 1 main.go:107] CSR csr-tf478 added I0828 09:41:07.841866 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.842074 1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.862340 1 main.go:107] CSR csr-tf478 added I0828 09:41:07.882301 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.882462 1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.922685 1 main.go:107] CSR csr-tf478 added I0828 09:41:07.936928 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:07.937070 1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:08.017303 1 main.go:107] CSR csr-tf478 added I0828 09:41:08.031257 1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists E0828 09:41:08.031319 1 main.go:174] node morenod-ocp-wvn8n-worker-9sfbs already exists I0828 09:41:08.031334 1 main.go:175] Dropping CSR "csr-tf478" out of the queue: node morenod-ocp-wvn8n-worker-9sfbs already exists Also, OCP keep trying to terminate pods that are not present on the node (they are from the deleted one). Datetimes from machines and nodes are not updated, showing info from the deleted instance: # oc get nodes NAME STATUS ROLES AGE VERSION morenod-ocp-wvn8n-master-0 Ready master 37m v1.14.0+09eb70949 morenod-ocp-wvn8n-master-1 Ready master 37m v1.14.0+09eb70949 morenod-ocp-wvn8n-master-2 Ready master 37m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-9sfbs NotReady worker 30m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-qx7xm Ready worker 31m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-zvz88 Ready worker 31m v1.14.0+09eb70949 # oc get machines -A NAMESPACE NAME STATE TYPE REGION ZONE AGE openshift-machine-api morenod-ocp-wvn8n-master-0 ACTIVE ci.m1.xlarge regionOne nova 37m openshift-machine-api morenod-ocp-wvn8n-master-1 ACTIVE ci.m1.xlarge regionOne nova 37m openshift-machine-api morenod-ocp-wvn8n-master-2 ACTIVE ci.m1.xlarge regionOne nova 37m openshift-machine-api morenod-ocp-wvn8n-worker-9sfbs ACTIVE ci.m1.xlarge regionOne nova 35m openshift-machine-api morenod-ocp-wvn8n-worker-qx7xm ACTIVE ci.m1.xlarge regionOne nova 35m openshift-machine-api morenod-ocp-wvn8n-worker-zvz88 ACTIVE ci.m1.xlarge regionOne nova 35m Same age on both workers but, from the OSP console: morenod-ocp-wvn8n-worker-9sfbs rhcos-42.80.20190828.0 192.168.0.34 ci.m1.xlarge - Active nova None Running 1 minute morenod-ocp-wvn8n-worker-qx7xm rhcos-42.80.20190828.0 192.168.0.30 ci.m1.xlarge - Active nova None Running 27 minutes morenod-ocp-wvn8n-worker-zvz88 rhcos-42.80.20190828.0 192.168.0.27 ci.m1.xlarge - Active nova None Running 28 minutes # oc get csr NAME AGE REQUESTOR CONDITION csr-6pn4j 39m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8plwp 39m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-9f2pw 39m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-dcgdm 32m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-dvt5q 33m system:node:morenod-ocp-wvn8n-worker-zvz88 Approved,Issued csr-gbwps 39m system:node:morenod-ocp-wvn8n-master-0 Approved,Issued csr-khv9l 33m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-q9kdb 39m system:node:morenod-ocp-wvn8n-master-2 Approved,Issued csr-t2rvq 32m system:node:morenod-ocp-wvn8n-worker-9sfbs Approved,Issued csr-tf478 6m56s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-wqpfb 32m system:node:morenod-ocp-wvn8n-worker-qx7xm Approved,Issued csr-wrpnb 33m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-x2z9v 39m system:node:morenod-ocp-wvn8n-master-1 Approved,Issued # oc adm certificate approve csr-tf478 certificatesigningrequest.certificates.k8s.io/csr-tf478 approved After manually approved the pending certificate, node gets ready: # oc get nodes NAME STATUS ROLES AGE VERSION morenod-ocp-wvn8n-master-0 Ready master 40m v1.14.0+09eb70949 morenod-ocp-wvn8n-master-1 Ready master 41m v1.14.0+09eb70949 morenod-ocp-wvn8n-master-2 Ready master 40m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-9sfbs Ready worker 34m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-qx7xm Ready worker 34m v1.14.0+09eb70949 morenod-ocp-wvn8n-worker-zvz88 Ready worker 35m v1.14.0+09eb70949 Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-28-083236 How reproducible: Steps to Reproduce: 1.Install fresh OCP cluster using IPI on OSP 2.Enter on the OSP console and delete a worker instance 3.Check how instance is created again but never gets Ready. Actual results: Node cannot be used as it is not on Ready status Expected results: Node get Ready Additional info:
I don't think this is a guarantee we make at the moment -- that you can delete an instance backing a machine directly at the infrastructure provider, outside Kubernetes, and get a new functioning replacement without manual intervention. The problem here is that the hostname for the new instance is the same as it was previously, and a node by that name already exists. But, even if that weren't the case, the Machine object will already have a nodeRef set, so the cluster-machine-approver will still refuse to approve the CSR. You would have to manually approve the CSR in this case.
So far we cannot fix it in 4.2, because the necessary logic will be implemented by machine-api-operator in 4.3 only, when the healthchecker component is activated. See: https://coreos.slack.com/archives/CBZHF4DHC/p1567432430029700?thread_ts=1567430585.028900&cid=CBZHF4DHC the workaround is to remove the failed machine manually. it explained here https://github.com/openshift/installer/pull/2305 I've proposed a fix that just returns an error if we want to recreate an instance with the old name https://github.com/openshift/cluster-api-provider-openstack/pull/62 It doesn't really fix the problem, as the manual intervention is required anyway, but I hope it solves https://bugzilla.redhat.com/show_bug.cgi?id=1748263, because after the manual master deletion a new one comes up automatically, and etcd work fine after that. So, my suggestion is either to move the target to 4.3 or close this bug as invalid
Moved to 4.3, because we have to fix https://bugzilla.redhat.com/show_bug.cgi?id=1748263 first and can't do this without cluster-etcd-operator component which will appear only in 4.3.
Defer to 4.4, since this bz is dependent on https://bugzilla.redhat.com/show_bug.cgi?id=1748263 which was deferred to 4.4
You can close this BZ. Behaviour has changed on >= 4.3, instances deleted from OSP are only recreated if a MachineHealthCheck object is present on the cluster. Instances are created and they join correctly the cluster, so this bugzilla has no sense now
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days