Bug 1839952 - Machines phase should become 'Failed' when its instance is deleted
Summary: Machines phase should become 'Failed' when its instance is deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-26 05:51 UTC by sunzhaohua
Modified: 2020-10-27 16:01 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:01:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:27 UTC

Description sunzhaohua 2020-05-26 05:51:23 UTC
Description of problem:
Terminate a running instance from aws/azure/gcp web console, then check its machine phase shows "running" 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-25-052746

How reproducible:
Always

Steps to Reproduce:
1. Terminate a running instance from aws/azure/gcp web console
2. Check machine phase
3.

Actual results:
Machine phase still is Running.
$ oc get machine -o wide
NAME                                        PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
zhsunaws525-qtlbn-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   23h   ip-10-0-132-252.us-east-2.compute.internal   aws:///us-east-2a/i-0853c407eef01db2d   running
zhsunaws525-qtlbn-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   23h   ip-10-0-172-96.us-east-2.compute.internal    aws:///us-east-2b/i-04f8bd514ff1bfa86   running
zhsunaws525-qtlbn-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   23h   ip-10-0-215-247.us-east-2.compute.internal   aws:///us-east-2c/i-07cfd6d19592182b6   running
zhsunaws525-qtlbn-worker-us-east-2a-wbkws   Running   m4.large    us-east-2   us-east-2a   23h   ip-10-0-152-19.us-east-2.compute.internal    aws:///us-east-2a/i-0b2f1f8b6b1fdc6a6   running
zhsunaws525-qtlbn-worker-us-east-2b-h8pq2   Running   m4.large    us-east-2   us-east-2b   23h   ip-10-0-179-126.us-east-2.compute.internal   aws:///us-east-2b/i-0f1ea8865fd3e68f5   Unknown


I0526 01:19:28.695088       1 controller.go:169] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: reconciling Machine
I0526 01:19:28.695101       1 actuator.go:100] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: actuator checking if machine exists
W0526 01:19:28.756428       1 reconciler.go:364] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: Failed to find existing instance by id i-0f1ea8865fd3e68f5: instance i-0f1ea8865fd3e68f5 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0526 01:19:28.810651       1 utils.go:166] Excluding instance matching zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: instance i-0f1ea8865fd3e68f5 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0526 01:19:28.810674       1 reconciler.go:210] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: Instance does not exist
I0526 01:19:28.810682       1 controller.go:424] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: going into phase "Failed"
I0526 01:19:28.842111       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunaws525-qtlbn-worker-us-east-2b-h8pq2"}
I0526 01:19:28.842158       1 controller.go:169] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: reconciling Machine
I0526 01:19:28.842166       1 actuator.go:100] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: actuator checking if machine exists
W0526 01:19:28.898814       1 reconciler.go:364] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: Failed to find existing instance by id i-0f1ea8865fd3e68f5: instance i-0f1ea8865fd3e68f5 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0526 01:19:28.953888       1 utils.go:166] Excluding instance matching zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: instance i-0f1ea8865fd3e68f5 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0526 01:19:28.953921       1 reconciler.go:210] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: Instance does not exist
I0526 01:19:28.953932       1 controller.go:424] zhsunaws525-qtlbn-worker-us-east-2b-h8pq2: going into phase "Failed"


status:
  addresses:
  - address: 10.0.179.126
    type: InternalIP
  - address: ip-10-0-179-126.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-179-126.us-east-2.compute.internal
    type: Hostname
  errorMessage: Can't find created instance.
  lastUpdated: "2020-05-26T01:14:07Z"
  nodeRef:
    kind: Node
    name: ip-10-0-179-126.us-east-2.compute.internal
    uid: 43cee894-bb51-4dcc-a304-28a948fe6e67
  phase: Running
  providerStatus:
    conditions:
    - lastProbeTime: "2020-05-25T01:38:59Z"
      lastTransitionTime: "2020-05-25T01:38:59Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0f1ea8865fd3e68f5
    instanceState: running

Expected results:
Machine status.phase should become 'Failed'


Additional info:

Comment 1 Alberto 2020-05-26 08:06:13 UTC
This is expected. Once a machine is given a node is considering in "running" phase. The particular cloud state is reflected in STATE: Unknown. https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-instance-lifecycle.md
We should come up with a more meaningful name to show for the phase similar to what we do for the console. This does not result trivial without disrupting potential existing clients

Comment 2 Alberto 2020-05-29 11:16:47 UTC
Please ignore my comment in https://bugzilla.redhat.com/show_bug.cgi?id=1839952#c1. I miss read the description.

The machine should indeed go failed if the underlying instance is deleted. This should be fixed by https://github.com/openshift/cluster-api-provider-aws/pull/325

Comment 5 sunzhaohua 2020-06-04 02:35:30 UTC
Verified
tested on azure, clusterversion: 4.5.0-0.nightly-2020-06-03-013823, delete an instance from azure web console.
$ oc get machine
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
zhsun63azure-7h44z-master-0              Running   Standard_D8s_v3   westus          18h
zhsun63azure-7h44z-master-1              Running   Standard_D8s_v3   westus          18h
zhsun63azure-7h44z-master-2              Running   Standard_D8s_v3   westus          18h
zhsun63azure-7h44z-worker-westus-4cmjd   Running   Standard_D2s_v3   westus          17h
zhsun63azure-7h44z-worker-westus-hv647   Running   Standard_D2s_v3   westus          17h
zhsun63azure-7h44z-worker-westus-wtz6j   Failed    Standard_D2s_v3   westus          17h

Comment 7 errata-xmlrpc 2020-10-27 16:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.