Description of problem: machine status is Running for a master node which has been terminated from the console Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-03-084733 True False 5h38m Cluster version is 4.6.0-0.nightly-2020-09-03-084733 How reproducible: Steps to Reproduce: 1. Shutdown a cluster. 2. Terminate a master node. 3. Restart the cluster Actual results: $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE rpattath-46-artifacts-xhgjs-master-0 Running n1-standard-4 us-central1 us-central1-a 4h50m rpattath-46-artifacts-xhgjs-master-1 Running n1-standard-4 us-central1 us-central1-b 4h50m rpattath-46-artifacts-xhgjs-master-2 Failed n1-standard-4 us-central1 us-central1-c 4h50m rpattath-46-artifacts-xhgjs-worker-a-m4tpk Running n1-standard-4 us-central1 us-central1-a 4h40m rpattath-46-artifacts-xhgjs-worker-b-2qtfh Running n1-standard-4 us-central1 us-central1-b 4h40m rpattath-46-artifacts-xhgjs-worker-c-7wtxq Running n1-standard-4 us-central1 us-central1-c 4h40m [rpattath_redhat_com@rpattath-vpc-bastion1 ~]$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running rpattath-46-artifacts-xhgjs-master-0.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-master-1.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-master-2.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-worker-a-m4tpk RUNNING rpattath-46-artifacts-xhgjs-worker-b-2qtfh RUNNING rpattath-46-artifacts-xhgjs-worker-c-7wtxq RUNNING Expected results: State of the terminated node should not be running. Additional info: This was seen on gcp.
Could you provide a must gather from the cluster at all? Or the logs from the machine-controller in the machine-api-controllers pod from the openshift-machine-api namespace, and the YAML representation of the stopped master machine please I have a feeling that this isn't limited to master machines, nor is it limited to GCP. If I remember correctly we always just set the main machine phase to failed rather than updating the provider status when this happens.
I've confirmed that this is also reproducible by deleting a worker machine. To fix this, we would want to add some way to sync the provider status when the machine doesn't exist, I think this would mean changing the actuator interface to introduce a new method, if that is the case, then this would be a major change and I would recommend we defer fixing this until 4.7. We may be able to just call update in this case and have the providers clear their status, but we will need to check that each provider is able to handle this gracefully.
Machines can go failed either when failing on creation because of invalid config or because the instance was deleted out of band. For both cases I think setting the provider status as unknown is acceptable. This should be happening already https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L453 so if it's not, that's a code bug.
See original PR https://github.com/openshift/machine-api-operator/pull/575
@Alberto this bug is referring to .status.providerStatus.instanceState which is set by each actuator currently to `Running` once the instance comes up. Because we don't call the actuator at all after we determine the machine has failed, nothing updates the providerStatus and so it ends up out of sync/still saying Running. I believe the code that you've linked is working as expected in this case based on my testing.
Waiting on tests to pass for this, we seem to have some flakiness, should merge today or early next week hopefully
Failed to verify, tested on aws, found different results: zhsun924aws-8gmkk-worker-us-east-2a-ck968 "instanceState: running",zhsun924aws-8gmkk-worker-us-east-2c-lgsqf "instanceState: shutting-down", neither is Unknown. clusterversion: 4.6.0-0.nightly-2020-09-24-074159 step: 1.terminate instances from aws web console 2.check machine instanceState/vmState $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsun924aws-8gmkk-master-0 Running m5.xlarge us-east-2 us-east-2a 29h ip-10-0-133-192.us-east-2.compute.internal aws:///us-east-2a/i-0e9e1ccf5522a8e58 running zhsun924aws-8gmkk-master-1 Running m5.xlarge us-east-2 us-east-2b 29h ip-10-0-176-241.us-east-2.compute.internal aws:///us-east-2b/i-006dfa039aaf08211 running zhsun924aws-8gmkk-master-2 Running m5.xlarge us-east-2 us-east-2c 29h ip-10-0-213-91.us-east-2.compute.internal aws:///us-east-2c/i-0971ca92b91c7cfed running zhsun924aws-8gmkk-worker-us-east-2a-ck968 Failed m5.large us-east-2 us-east-2a 29h ip-10-0-152-114.us-east-2.compute.internal aws:///us-east-2a/i-09d7a4c72fb8ffb43 Unknown zhsun924aws-8gmkk-worker-us-east-2b-sq2jt Running m5.large us-east-2 us-east-2b 29h ip-10-0-177-121.us-east-2.compute.internal aws:///us-east-2b/i-01e1ce14de47a8c81 running zhsun924aws-8gmkk-worker-us-east-2c-lgsqf Failed m5.large us-east-2 us-east-2c 29h ip-10-0-220-249.us-east-2.compute.internal aws:///us-east-2c/i-0c527cfa402f6f067 Unknown $ oc get machine zhsun924aws-8gmkk-worker-us-east-2a-ck968 -o yaml status: addresses: - address: 10.0.152.114 type: InternalIP - address: ip-10-0-152-114.us-east-2.compute.internal type: InternalDNS - address: ip-10-0-152-114.us-east-2.compute.internal type: Hostname errorMessage: Can't find created instance. lastUpdated: "2020-09-25T14:31:34Z" nodeRef: kind: Node name: ip-10-0-152-114.us-east-2.compute.internal uid: c591b5fd-3576-4d6c-a490-15aee99b1ca7 phase: Failed providerStatus: conditions: - lastProbeTime: "2020-09-24T10:37:46Z" lastTransitionTime: "2020-09-24T10:37:46Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-09d7a4c72fb8ffb43 instanceState: running $ oc get machine zhsun924aws-8gmkk-worker-us-east-2c-lgsqf -o yaml errorMessage: Can't find created instance. lastUpdated: "2020-09-25T15:45:43Z" nodeRef: kind: Node name: ip-10-0-220-249.us-east-2.compute.internal uid: f5451452-8bca-4629-8143-7be48ff2a4b7 phase: Failed providerStatus: conditions: - lastProbeTime: "2020-09-24T10:37:49Z" lastTransitionTime: "2020-09-24T10:37:49Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-0c527cfa402f6f067 instanceState: shutting-down
Moving to baremetal and bumping to 4.7 since it's the only PR that remains with out merging.
Baremetal has now merged as part of https://github.com/openshift/cluster-api-provider-baremetal/pull/118 This is ready for QE. Moving back to cloud as this was primarily our effort.
I have tested on AWS, GCP, Azure and Vsphere, the instanceState is being updated as expected. But on osp, machine doesn't have a providerStatus filed, so couldn't check, move this to verified. @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed. verified on aws clusterversion: 4.6.0-0.nightly-2020-10-08-210814 $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsun109aws-tsn5h-master-0 Running m5.xlarge us-east-2 us-east-2a 71m ip-10-0-142-174.us-east-2.compute.internal aws:///us-east-2a/i-060c32bc8833bed44 running zhsun109aws-tsn5h-master-1 Running m5.xlarge us-east-2 us-east-2b 71m ip-10-0-190-89.us-east-2.compute.internal aws:///us-east-2b/i-0de030c3f1c87fb52 running zhsun109aws-tsn5h-master-2 Running m5.xlarge us-east-2 us-east-2c 71m ip-10-0-206-22.us-east-2.compute.internal aws:///us-east-2c/i-07ac5dcd8c4be3520 running zhsun109aws-tsn5h-worker-us-east-2a-qxnvk Running m5.large us-east-2 us-east-2a 60m ip-10-0-130-174.us-east-2.compute.internal aws:///us-east-2a/i-0cef294deb407c9fc running zhsun109aws-tsn5h-worker-us-east-2b-sgstt Running m5.large us-east-2 us-east-2b 60m ip-10-0-183-78.us-east-2.compute.internal aws:///us-east-2b/i-0820fef5ce6c7fd93 running zhsun109aws-tsn5h-worker-us-east-2c-frj7w Failed m5.large us-east-2 us-east-2c 60m ip-10-0-206-48.us-east-2.compute.internal aws:///us-east-2c/i-05230e25ee2e8e854 Unknown Status: Error Message: Can't find created instance. Last Updated: 2020-10-09T03:01:10Z Node Ref: Kind: Node Name: ip-10-0-206-48.us-east-2.compute.internal UID: ba7a87d7-e68a-4e6e-a8a4-80bc21ab1a41 Phase: Failed Provider Status: Conditions: Last Probe Time: 2020-10-09T02:03:28Z Last Transition Time: 2020-10-09T02:03:28Z Message: Machine successfully created Reason: MachineCreationSucceeded Status: True Type: MachineCreation Instance Id: i-05230e25ee2e8e854 Instance State: Unknown verified on gcp $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsun109gcp-m2fw5-master-0 Running n1-standard-4 us-central1 us-central1-a 99m zhsun109gcp-m2fw5-master-0.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-master-0 RUNNING zhsun109gcp-m2fw5-master-1 Running n1-standard-4 us-central1 us-central1-b 99m zhsun109gcp-m2fw5-master-1.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-master-1 RUNNING zhsun109gcp-m2fw5-master-2 Running n1-standard-4 us-central1 us-central1-c 99m zhsun109gcp-m2fw5-master-2.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-master-2 RUNNING zhsun109gcp-m2fw5-worker-a-dqj26 Running n1-standard-4 us-central1 us-central1-a 92m zhsun109gcp-m2fw5-worker-a-dqj26.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-worker-a-dqj26 RUNNING zhsun109gcp-m2fw5-worker-b-tlsbk Running n1-standard-4 us-central1 us-central1-b 92m zhsun109gcp-m2fw5-worker-b-tlsbk.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-worker-b-tlsbk RUNNING zhsun109gcp-m2fw5-worker-c-dwvsb Failed n1-standard-4 us-central1 us-central1-c 92m zhsun109gcp-m2fw5-worker-c-dwvsb.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-worker-c-dwvsb Unknown Phase: Failed Provider Status: Conditions: Last Probe Time: 2020-10-09T02:00:40Z Last Transition Time: 2020-10-09T02:00:40Z Message: machine successfully created Reason: MachineCreationSucceeded Status: True Type: MachineCreated Instance Id: zhsun109gcp-m2fw5-worker-c-dwvsb Instance State: Unknown verified on azure $ oc get machine -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE zhsun109az-hnqvv-master-0 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-0 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-0 Running zhsun109az-hnqvv-master-1 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-1 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-1 Running zhsun109az-hnqvv-master-2 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-2 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-2 Running zhsun109az-hnqvv-worker-northcentralus-5q7v9 Running Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-5q7v9 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-5q7v9 Running zhsun109az-hnqvv-worker-northcentralus-ghlmb Running Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-ghlmb azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-ghlmb Running zhsun109az-hnqvv-worker-northcentralus-rwrv8 Failed Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-rwrv8 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8 Unknown Phase: Failed Provider Status: Conditions: Last Probe Time: 2020-10-09T02:09:15Z Last Transition Time: 2020-10-09T02:09:15Z Message: machine successfully created Reason: MachineCreationSucceeded Status: True Type: MachineCreated Metadata: Vm Id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8 Vm State: Unknown verified on osphere $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunvs2109-2h9b2-worker-qhps6 Failed 12m Phase: Failed Provider Status: Conditions: Last Probe Time: 2020-10-09T05:54:22Z Last Transition Time: 2020-10-09T05:54:22Z Message: Machine successfully created Reason: MachineCreationSucceeded Status: True Type: MachineCreation Instance Id: 422b855c-8889-bb3c-f83b-2a3ff6029c3f Instance State: Unknown Task Ref: task-57827
> @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed. Good question, not one I can really answer. I guess the openstack team didn't want to set a field that matches this pattern? I don't think there is one for baremetal either. Might just be they've never felt the need for it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633