Bug 1875598 - machine status is Running for a master node which has been terminated from the console
Summary: machine status is Running for a master node which has been terminated from th...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-03 20:38 UTC by Roshni
Modified: 2021-02-24 15:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Once a Machine entered a Failed state, the cloud provider state was no longer reconciled Consequence: Machine status would report the cloud VM as running even after the VM could have been removed Fix: Set the VM state to unknown if the Machine ends up in a failed state Result: The status more accurately reflects the observed state of the world
Clone Of:
Environment:
Last Closed: 2021-02-24 15:17:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 355 0 None closed BUG 1875598: Ensure the Virtual Machine provider state is set to Unknown when Failed 2021-02-12 19:33:11 UTC
Github openshift cluster-api-provider-azure pull 168 0 None closed BUG 1875598: Ensure the Virtual Machine provider state is set to Unknown when Failed 2021-02-12 19:33:11 UTC
Github openshift cluster-api-provider-baremetal pull 118 0 None closed Bug 1883497: Fix missing logs due to mixed klog versions 2021-02-12 19:33:10 UTC
Github openshift cluster-api-provider-gcp pull 122 0 None closed BUG 1875598: Ensure the Virtual Machine provider state is set to Unknown when Failed 2021-02-12 19:33:11 UTC
Github openshift cluster-api-provider-openstack pull 128 0 None closed BUG 1875598: Ensure the Virtual Machine provider state is set to Unknown when Failed 2021-02-12 19:33:11 UTC
Github openshift machine-api-operator pull 696 0 None closed BUG 1875598: Ensure the Virtual Machine provider state is set to Unknown when Failed 2021-02-12 19:33:11 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:17:46 UTC

Description Roshni 2020-09-03 20:38:13 UTC
Description of problem:
machine status is Running for a master node which has been terminated from the console

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-03-084733   True        False         5h38m   Cluster version is 4.6.0-0.nightly-2020-09-03-084733


How reproducible:


Steps to Reproduce:
1. Shutdown a cluster.
2. Terminate a master node.
3. Restart the cluster

Actual results:
$ oc get machines -n openshift-machine-api
NAME                                         PHASE     TYPE            REGION        ZONE            AGE
rpattath-46-artifacts-xhgjs-master-0         Running   n1-standard-4   us-central1   us-central1-a   4h50m
rpattath-46-artifacts-xhgjs-master-1         Running   n1-standard-4   us-central1   us-central1-b   4h50m
rpattath-46-artifacts-xhgjs-master-2         Failed    n1-standard-4   us-central1   us-central1-c   4h50m
rpattath-46-artifacts-xhgjs-worker-a-m4tpk   Running   n1-standard-4   us-central1   us-central1-a   4h40m
rpattath-46-artifacts-xhgjs-worker-b-2qtfh   Running   n1-standard-4   us-central1   us-central1-b   4h40m
rpattath-46-artifacts-xhgjs-worker-c-7wtxq   Running   n1-standard-4   us-central1   us-central1-c   4h40m
[rpattath_redhat_com@rpattath-vpc-bastion1 ~]$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running
rpattath-46-artifacts-xhgjs-master-0.c.openshift-qe.internal	RUNNING
rpattath-46-artifacts-xhgjs-master-1.c.openshift-qe.internal	RUNNING
rpattath-46-artifacts-xhgjs-master-2.c.openshift-qe.internal	RUNNING
rpattath-46-artifacts-xhgjs-worker-a-m4tpk	RUNNING
rpattath-46-artifacts-xhgjs-worker-b-2qtfh	RUNNING
rpattath-46-artifacts-xhgjs-worker-c-7wtxq	RUNNING

Expected results:
State of the terminated node should not be running.

Additional info:
This was seen on gcp.

Comment 1 Joel Speed 2020-09-07 12:26:27 UTC
Could you provide a must gather from the cluster at all? Or the logs from the machine-controller in the machine-api-controllers pod from the openshift-machine-api namespace, and the YAML representation of the stopped master machine please

I have a feeling that this isn't limited to master machines, nor is it limited to GCP. If I remember correctly we always just set the main machine phase to failed rather than updating the provider status when this happens.

Comment 2 Joel Speed 2020-09-07 13:42:42 UTC
I've confirmed that this is also reproducible by deleting a worker machine. To fix this, we would want to add some way to sync the provider status when the machine doesn't exist, I think this would mean changing the actuator interface to introduce a new method, if that is the case, then this would be a major change and I would recommend we defer fixing this until 4.7. We may be able to just call update in this case and have the providers clear their status, but we will need to check that each provider is able to handle this gracefully.

Comment 3 Alberto 2020-09-07 13:48:37 UTC
Machines can go failed either when failing on creation because of invalid config or because the instance was deleted out of band. For both cases I think setting the provider status as unknown is acceptable. This should be happening already https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L453 so if it's not, that's a code bug.

Comment 4 Alberto 2020-09-07 13:49:53 UTC
See original PR https://github.com/openshift/machine-api-operator/pull/575

Comment 5 Joel Speed 2020-09-07 13:59:39 UTC
@Alberto this bug is referring to .status.providerStatus.instanceState which is set by each actuator currently to `Running` once the instance comes up.
Because we don't call the actuator at all after we determine the machine has failed, nothing updates the providerStatus and so it ends up out of sync/still saying Running.

I believe the code that you've linked is working as expected in this case based on my testing.

Comment 6 Joel Speed 2020-09-11 09:15:25 UTC
Waiting on tests to pass for this, we seem to have some flakiness, should merge today or early next week hopefully

Comment 8 sunzhaohua 2020-09-25 16:13:25 UTC
Failed to verify, tested on aws, found different results: zhsun924aws-8gmkk-worker-us-east-2a-ck968 "instanceState: running",zhsun924aws-8gmkk-worker-us-east-2c-lgsqf "instanceState: shutting-down", neither is Unknown.
clusterversion: 4.6.0-0.nightly-2020-09-24-074159

step:
1.terminate instances from aws web console
2.check machine instanceState/vmState

$ oc get machine -o wide
NAME                                        PHASE     TYPE        REGION      ZONE         AGE     NODE                                         PROVIDERID                              STATE
zhsun924aws-8gmkk-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   29h     ip-10-0-133-192.us-east-2.compute.internal   aws:///us-east-2a/i-0e9e1ccf5522a8e58   running
zhsun924aws-8gmkk-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   29h     ip-10-0-176-241.us-east-2.compute.internal   aws:///us-east-2b/i-006dfa039aaf08211   running
zhsun924aws-8gmkk-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   29h     ip-10-0-213-91.us-east-2.compute.internal    aws:///us-east-2c/i-0971ca92b91c7cfed   running
zhsun924aws-8gmkk-worker-us-east-2a-ck968   Failed    m5.large    us-east-2   us-east-2a   29h     ip-10-0-152-114.us-east-2.compute.internal   aws:///us-east-2a/i-09d7a4c72fb8ffb43   Unknown
zhsun924aws-8gmkk-worker-us-east-2b-sq2jt   Running   m5.large    us-east-2   us-east-2b   29h     ip-10-0-177-121.us-east-2.compute.internal   aws:///us-east-2b/i-01e1ce14de47a8c81   running
zhsun924aws-8gmkk-worker-us-east-2c-lgsqf   Failed    m5.large    us-east-2   us-east-2c   29h     ip-10-0-220-249.us-east-2.compute.internal   aws:///us-east-2c/i-0c527cfa402f6f067   Unknown

$ oc get machine zhsun924aws-8gmkk-worker-us-east-2a-ck968 -o yaml
status:
  addresses:
  - address: 10.0.152.114
    type: InternalIP
  - address: ip-10-0-152-114.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-152-114.us-east-2.compute.internal
    type: Hostname
  errorMessage: Can't find created instance.
  lastUpdated: "2020-09-25T14:31:34Z"
  nodeRef:
    kind: Node
    name: ip-10-0-152-114.us-east-2.compute.internal
    uid: c591b5fd-3576-4d6c-a490-15aee99b1ca7
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2020-09-24T10:37:46Z"
      lastTransitionTime: "2020-09-24T10:37:46Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-09d7a4c72fb8ffb43
    instanceState: running

$ oc get machine zhsun924aws-8gmkk-worker-us-east-2c-lgsqf -o yaml
 errorMessage: Can't find created instance.
  lastUpdated: "2020-09-25T15:45:43Z"
  nodeRef:
    kind: Node
    name: ip-10-0-220-249.us-east-2.compute.internal
    uid: f5451452-8bca-4629-8143-7be48ff2a4b7
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2020-09-24T10:37:49Z"
      lastTransitionTime: "2020-09-24T10:37:49Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0c527cfa402f6f067
    instanceState: shutting-down

Comment 12 Alberto 2020-10-02 10:11:02 UTC
Moving to baremetal and bumping to 4.7 since it's the only PR that remains with out merging.

Comment 13 Joel Speed 2020-10-02 12:42:33 UTC
Baremetal has now merged as part of https://github.com/openshift/cluster-api-provider-baremetal/pull/118

This is ready for QE. Moving back to cloud as this was primarily our effort.

Comment 16 sunzhaohua 2020-10-09 06:24:06 UTC
I have tested on AWS, GCP, Azure and Vsphere, the instanceState is being updated as expected. But on osp, machine doesn't have a providerStatus filed, so couldn't check, move this to verified. @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed.

verified on aws
clusterversion: 4.6.0-0.nightly-2020-10-08-210814
$ oc get machine -o wide
NAME                                        PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
zhsun109aws-tsn5h-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   71m   ip-10-0-142-174.us-east-2.compute.internal   aws:///us-east-2a/i-060c32bc8833bed44   running
zhsun109aws-tsn5h-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   71m   ip-10-0-190-89.us-east-2.compute.internal    aws:///us-east-2b/i-0de030c3f1c87fb52   running
zhsun109aws-tsn5h-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   71m   ip-10-0-206-22.us-east-2.compute.internal    aws:///us-east-2c/i-07ac5dcd8c4be3520   running
zhsun109aws-tsn5h-worker-us-east-2a-qxnvk   Running   m5.large    us-east-2   us-east-2a   60m   ip-10-0-130-174.us-east-2.compute.internal   aws:///us-east-2a/i-0cef294deb407c9fc   running
zhsun109aws-tsn5h-worker-us-east-2b-sgstt   Running   m5.large    us-east-2   us-east-2b   60m   ip-10-0-183-78.us-east-2.compute.internal    aws:///us-east-2b/i-0820fef5ce6c7fd93   running
zhsun109aws-tsn5h-worker-us-east-2c-frj7w   Failed    m5.large    us-east-2   us-east-2c   60m   ip-10-0-206-48.us-east-2.compute.internal    aws:///us-east-2c/i-05230e25ee2e8e854   Unknown
Status:
  Error Message:  Can't find created instance.
  Last Updated:   2020-10-09T03:01:10Z
  Node Ref:
    Kind:  Node
    Name:  ip-10-0-206-48.us-east-2.compute.internal
    UID:   ba7a87d7-e68a-4e6e-a8a4-80bc21ab1a41
  Phase:   Failed
  Provider Status:
    Conditions:
      Last Probe Time:       2020-10-09T02:03:28Z
      Last Transition Time:  2020-10-09T02:03:28Z
      Message:               Machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreation
    Instance Id:             i-05230e25ee2e8e854
    Instance State:          Unknown

verified on gcp
$ oc get machine -o wide
NAME                               PHASE     TYPE            REGION        ZONE            AGE   NODE                                                       PROVIDERID                                                          STATE
zhsun109gcp-m2fw5-master-0         Running   n1-standard-4   us-central1   us-central1-a   99m   zhsun109gcp-m2fw5-master-0.c.openshift-qe.internal         gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-master-0         RUNNING
zhsun109gcp-m2fw5-master-1         Running   n1-standard-4   us-central1   us-central1-b   99m   zhsun109gcp-m2fw5-master-1.c.openshift-qe.internal         gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-master-1         RUNNING
zhsun109gcp-m2fw5-master-2         Running   n1-standard-4   us-central1   us-central1-c   99m   zhsun109gcp-m2fw5-master-2.c.openshift-qe.internal         gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-master-2         RUNNING
zhsun109gcp-m2fw5-worker-a-dqj26   Running   n1-standard-4   us-central1   us-central1-a   92m   zhsun109gcp-m2fw5-worker-a-dqj26.c.openshift-qe.internal   gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-worker-a-dqj26   RUNNING
zhsun109gcp-m2fw5-worker-b-tlsbk   Running   n1-standard-4   us-central1   us-central1-b   92m   zhsun109gcp-m2fw5-worker-b-tlsbk.c.openshift-qe.internal   gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-worker-b-tlsbk   RUNNING
zhsun109gcp-m2fw5-worker-c-dwvsb   Failed    n1-standard-4   us-central1   us-central1-c   92m   zhsun109gcp-m2fw5-worker-c-dwvsb.c.openshift-qe.internal   gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-worker-c-dwvsb   Unknown
  Phase:   Failed
  Provider Status:
    Conditions:
      Last Probe Time:       2020-10-09T02:00:40Z
      Last Transition Time:  2020-10-09T02:00:40Z
      Message:               machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreated
    Instance Id:             zhsun109gcp-m2fw5-worker-c-dwvsb
    Instance State:          Unknown

verified on azure
$ oc get machine -o wide
NAME                                           PHASE     TYPE              REGION           ZONE   AGE   NODE                                           PROVIDERID                                                                                                                                                                                STATE
zhsun109az-hnqvv-master-0                      Running   Standard_D8s_v3   northcentralus          90m   zhsun109az-hnqvv-master-0                      azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-0                      Running
zhsun109az-hnqvv-master-1                      Running   Standard_D8s_v3   northcentralus          90m   zhsun109az-hnqvv-master-1                      azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-1                      Running
zhsun109az-hnqvv-master-2                      Running   Standard_D8s_v3   northcentralus          90m   zhsun109az-hnqvv-master-2                      azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-2                      Running
zhsun109az-hnqvv-worker-northcentralus-5q7v9   Running   Standard_D2s_v3   northcentralus          85m   zhsun109az-hnqvv-worker-northcentralus-5q7v9   azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-5q7v9   Running
zhsun109az-hnqvv-worker-northcentralus-ghlmb   Running   Standard_D2s_v3   northcentralus          85m   zhsun109az-hnqvv-worker-northcentralus-ghlmb   azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-ghlmb   Running
zhsun109az-hnqvv-worker-northcentralus-rwrv8   Failed    Standard_D2s_v3   northcentralus          85m   zhsun109az-hnqvv-worker-northcentralus-rwrv8   azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8   Unknown
  Phase:   Failed
  Provider Status:
    Conditions:
      Last Probe Time:       2020-10-09T02:09:15Z
      Last Transition Time:  2020-10-09T02:09:15Z
      Message:               machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreated
    Metadata:
    Vm Id:     /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8
    Vm State:  Unknown

verified on osphere
$ oc get machine
NAME                             PHASE    TYPE   REGION   ZONE   AGE
zhsunvs2109-2h9b2-worker-qhps6   Failed                          12m
  Phase:   Failed
  Provider Status:
    Conditions:
      Last Probe Time:       2020-10-09T05:54:22Z
      Last Transition Time:  2020-10-09T05:54:22Z
      Message:               Machine successfully created
      Reason:                MachineCreationSucceeded
      Status:                True
      Type:                  MachineCreation
    Instance Id:             422b855c-8889-bb3c-f83b-2a3ff6029c3f
    Instance State:          Unknown
    Task Ref:                task-57827

Comment 17 Joel Speed 2020-10-09 11:46:29 UTC
>  @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed.

Good question, not one I can really answer. I guess the openstack team didn't want to set a field that matches this pattern? I don't think there is one for baremetal either. Might just be they've never felt the need for it

Comment 20 errata-xmlrpc 2021-02-24 15:17:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.