Bug 1746369 - [IPI] [OSP] Deleted openstack instance is recreated but never gets Ready status as OCP node [NEEDINFO]
Summary: [IPI] [OSP] Deleted openstack instance is recreated but never gets Ready stat...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Mike Fedosin
QA Contact: David Sanz
URL:
Whiteboard: osp
Depends On:
Blocks: 1748263
TreeView+ depends on / blocked
 
Reported: 2019-08-28 09:52 UTC by David Sanz
Modified: 2020-02-20 15:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-20 15:34:55 UTC
Target Upstream Version:
dsanzmor: needinfo? (mfedosin)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 62 'None' closed Bug 1746369: return an error if the instance has been destroyed 2020-10-26 16:19:29 UTC

Description David Sanz 2019-08-28 09:52:02 UTC
Description of problem:
On a healthy IPI cluster on OSP, from the OSP console, delete a worker instance.

It is created again, with the same name, but it never gets ready as node.

Logs from approver pods shows errors when trying to sign cert for node:

# oc logs -f machine-approver-5646d57764-kzx6l -n openshift-cluster-machine-approver
[...]
I0828 09:41:07.768120       1 main.go:107] CSR csr-tf478 added
I0828 09:41:07.790716       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.790775       1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.796244       1 main.go:107] CSR csr-tf478 added
I0828 09:41:07.814771       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.814919       1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.825210       1 main.go:107] CSR csr-tf478 added
I0828 09:41:07.841866       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.842074       1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.862340       1 main.go:107] CSR csr-tf478 added
I0828 09:41:07.882301       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.882462       1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.922685       1 main.go:107] CSR csr-tf478 added
I0828 09:41:07.936928       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:07.937070       1 main.go:164] Error syncing csr csr-tf478: node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:08.017303       1 main.go:107] CSR csr-tf478 added
I0828 09:41:08.031257       1 main.go:132] CSR csr-tf478 not authorized: node morenod-ocp-wvn8n-worker-9sfbs already exists
E0828 09:41:08.031319       1 main.go:174] node morenod-ocp-wvn8n-worker-9sfbs already exists
I0828 09:41:08.031334       1 main.go:175] Dropping CSR "csr-tf478" out of the queue: node morenod-ocp-wvn8n-worker-9sfbs already exists


Also, OCP keep trying to terminate pods that are not present on the node (they are from the deleted one).

Datetimes from machines and nodes are not updated, showing info from the deleted instance:

# oc get nodes
NAME                             STATUS     ROLES    AGE   VERSION
morenod-ocp-wvn8n-master-0       Ready      master   37m   v1.14.0+09eb70949
morenod-ocp-wvn8n-master-1       Ready      master   37m   v1.14.0+09eb70949
morenod-ocp-wvn8n-master-2       Ready      master   37m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-9sfbs   NotReady   worker   30m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-qx7xm   Ready      worker   31m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-zvz88   Ready      worker   31m   v1.14.0+09eb70949


# oc get machines -A
NAMESPACE               NAME                             STATE    TYPE           REGION      ZONE   AGE
openshift-machine-api   morenod-ocp-wvn8n-master-0       ACTIVE   ci.m1.xlarge   regionOne   nova   37m
openshift-machine-api   morenod-ocp-wvn8n-master-1       ACTIVE   ci.m1.xlarge   regionOne   nova   37m
openshift-machine-api   morenod-ocp-wvn8n-master-2       ACTIVE   ci.m1.xlarge   regionOne   nova   37m
openshift-machine-api   morenod-ocp-wvn8n-worker-9sfbs   ACTIVE   ci.m1.xlarge   regionOne   nova   35m
openshift-machine-api   morenod-ocp-wvn8n-worker-qx7xm   ACTIVE   ci.m1.xlarge   regionOne   nova   35m
openshift-machine-api   morenod-ocp-wvn8n-worker-zvz88   ACTIVE   ci.m1.xlarge   regionOne   nova   35m


Same age on both workers but, from the OSP console:

morenod-ocp-wvn8n-worker-9sfbs	rhcos-42.80.20190828.0	
192.168.0.34
ci.m1.xlarge	-	Active		nova	None	Running	1 minute	
morenod-ocp-wvn8n-worker-qx7xm	rhcos-42.80.20190828.0	
192.168.0.30
ci.m1.xlarge	-	Active		nova	None	Running	27 minutes	
morenod-ocp-wvn8n-worker-zvz88	rhcos-42.80.20190828.0	
192.168.0.27
ci.m1.xlarge	-	Active		nova	None	Running	28 minutes

# oc get csr
NAME        AGE     REQUESTOR                                                                   CONDITION
csr-6pn4j   39m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8plwp   39m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-9f2pw   39m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-dcgdm   32m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-dvt5q   33m     system:node:morenod-ocp-wvn8n-worker-zvz88                                  Approved,Issued
csr-gbwps   39m     system:node:morenod-ocp-wvn8n-master-0                                      Approved,Issued
csr-khv9l   33m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-q9kdb   39m     system:node:morenod-ocp-wvn8n-master-2                                      Approved,Issued
csr-t2rvq   32m     system:node:morenod-ocp-wvn8n-worker-9sfbs                                  Approved,Issued
csr-tf478   6m56s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-wqpfb   32m     system:node:morenod-ocp-wvn8n-worker-qx7xm                                  Approved,Issued
csr-wrpnb   33m     system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-x2z9v   39m     system:node:morenod-ocp-wvn8n-master-1                                      Approved,Issued


# oc adm certificate approve csr-tf478
certificatesigningrequest.certificates.k8s.io/csr-tf478 approved


After manually approved the pending certificate, node gets ready:

# oc get nodes
NAME                             STATUS   ROLES    AGE   VERSION
morenod-ocp-wvn8n-master-0       Ready    master   40m   v1.14.0+09eb70949
morenod-ocp-wvn8n-master-1       Ready    master   41m   v1.14.0+09eb70949
morenod-ocp-wvn8n-master-2       Ready    master   40m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-9sfbs   Ready    worker   34m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-qx7xm   Ready    worker   34m   v1.14.0+09eb70949
morenod-ocp-wvn8n-worker-zvz88   Ready    worker   35m   v1.14.0+09eb70949


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-28-083236

How reproducible:

Steps to Reproduce:
1.Install fresh OCP cluster using IPI on OSP
2.Enter on the OSP console and delete a worker instance
3.Check how instance is created again but never gets Ready.

Actual results:
Node cannot be used as it is not on Ready status

Expected results:
Node get Ready

Additional info:

Comment 2 Brad Ison 2019-08-29 12:03:55 UTC
I don't think this is a guarantee we make at the moment -- that you can delete an instance backing a machine directly at the infrastructure provider, outside Kubernetes, and get a new functioning replacement without manual intervention.

The problem here is that the hostname for the new instance is the same as it was previously, and a node by that name already exists. But, even if that weren't the case, the Machine object will already have a nodeRef set, so the cluster-machine-approver will still refuse to approve the CSR. You would have to manually approve the CSR in this case.

Comment 3 Mike Fedosin 2019-09-03 12:32:02 UTC
So far we cannot fix it in 4.2, because the necessary logic will be implemented by machine-api-operator in 4.3 only, when the healthchecker component is activated.
See: https://coreos.slack.com/archives/CBZHF4DHC/p1567432430029700?thread_ts=1567430585.028900&cid=CBZHF4DHC

the workaround is to remove the failed machine manually. it explained here https://github.com/openshift/installer/pull/2305

I've proposed a fix that just returns an error if we want to recreate an instance with the old name https://github.com/openshift/cluster-api-provider-openstack/pull/62
It doesn't really fix the problem, as the manual intervention is required anyway, but I hope it solves https://bugzilla.redhat.com/show_bug.cgi?id=1748263, because after the manual master deletion a new one comes up automatically, and etcd work fine after that.

So, my suggestion is either to move the target to 4.3 or close this bug as invalid

Comment 6 Mike Fedosin 2019-09-05 10:11:49 UTC
Moved to 4.3, because we have to fix https://bugzilla.redhat.com/show_bug.cgi?id=1748263 first and can't do this without cluster-etcd-operator component which will appear only in 4.3.

Comment 8 egarcia 2019-12-12 20:27:01 UTC
Defer to 4.4, since this bz is dependent on https://bugzilla.redhat.com/show_bug.cgi?id=1748263 which was deferred to 4.4

Comment 9 David Sanz 2020-02-19 10:15:12 UTC
You can close this BZ.

Behaviour has changed on >= 4.3, instances deleted from OSP are only recreated if a MachineHealthCheck object is present on the cluster.

Instances are created and they join correctly the cluster, so this bugzilla has no sense now


Note You need to log in before you can comment on or make changes to this bug.