Bug 1843597

Summary: [IPI][OSP] Worker deleted on openstack is not recreated
Product: OpenShift Container Platform Reporter: David Sanz <dsanzmor>
Component: Cloud ComputeAssignee: egarcia
Cloud Compute sub component: OpenStack Provider QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adduarte, egarcia, m.andre, mfedosin, pprinett
Version: 4.5Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:04:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1848755    

Description David Sanz 2020-06-03 15:51:02 UTC
Description of problem:

With a MachineHealthCheck created for the workers machineset, if we delete a worker machine on openstack (openstack server delete <<UUID>>), it is not recreated.

$ oc get nodes,machines,machineset,MachineHealthCheck -A
NAME                               STATUS     ROLES    AGE   VERSION
node/mrnd-tst-6dlcb-master-0       Ready      master   43m   v1.18.3+a637491
node/mrnd-tst-6dlcb-master-1       Ready      master   43m   v1.18.3+a637491
node/mrnd-tst-6dlcb-master-2       Ready      master   43m   v1.18.3+a637491
node/mrnd-tst-6dlcb-worker-9wnc2   Ready      worker   30m   v1.18.3+a637491
node/mrnd-tst-6dlcb-worker-bgqgq   NotReady   worker   27m   v1.18.3+a637491

NAMESPACE               NAME                                                       PHASE     TYPE           REGION      ZONE   AGE
openshift-machine-api   machine.machine.openshift.io/mrnd-tst-6dlcb-master-0       Running   ci.m1.xlarge   regionOne   nova   43m
openshift-machine-api   machine.machine.openshift.io/mrnd-tst-6dlcb-master-1       Running   ci.m1.xlarge   regionOne   nova   43m
openshift-machine-api   machine.machine.openshift.io/mrnd-tst-6dlcb-master-2       Running   ci.m1.xlarge   regionOne   nova   43m
openshift-machine-api   machine.machine.openshift.io/mrnd-tst-6dlcb-worker-9wnc2   Running   ci.m1.xlarge   regionOne   nova   36m
openshift-machine-api   machine.machine.openshift.io/mrnd-tst-6dlcb-worker-bgqgq   Failed    ci.m1.xlarge   regionOne   nova   36m

NAMESPACE               NAME                                                    DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   machineset.machine.openshift.io/mrnd-tst-6dlcb-worker   2         2         1       1           43m

NAMESPACE               NAME                                                             MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
openshift-machine-api   machinehealthcheck.machine.openshift.io/openstack-health-check   40%            2                  1

Related log from machine-api-controllers:

I0603 15:26:45.455332       1 controller.go:165] mrnd-tst-6dlcb-worker-bgqgq: reconciling Machine
I0603 15:26:45.497169       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0603 15:26:50.751532       1 controller.go:420] mrnd-tst-6dlcb-worker-bgqgq: going into phase "Failed"
I0603 15:26:50.771680       1 controller.go:165] mrnd-tst-6dlcb-worker-bgqgq: reconciling Machine
W0603 15:26:50.771710       1 controller.go:262] mrnd-tst-6dlcb-worker-bgqgq: machine has gone "Failed" phase. It won't reconcile

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-03-105031

How reproducible:


Steps to Reproduce:
1.Install cluster IPI on OSP
2.Create a MachineHealthCheck for workers machineset
3.Destroy worker instance on openstack

Actual results:
Worker is not recreated

Expected results:
Worker is recreated by the machineset

Additional info:

Comment 2 Mike Fedosin 2020-06-22 08:36:34 UTC
The fix has been merged, so I move this bug to ON_QA: https://github.com/openshift/cluster-api-provider-openstack/pull/101

Comment 3 David Sanz 2020-06-22 11:32:38 UTC
Verified on 4.6.0-0.nightly-2020-06-20-011219

Comment 6 errata-xmlrpc 2020-10-27 16:04:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196