Bug 1875598
| Summary: | machine status is Running for a master node which has been terminated from the console | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Roshni <rpattath> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | BareMetal Provider | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | agarcial, asegurap, dhellmann, jspeed, stbenjam |
| Version: | 4.6 | Keywords: | NeedsTestCase, Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Once a Machine entered a Failed state, the cloud provider state was no longer reconciled
Consequence: Machine status would report the cloud VM as running even after the VM could have been removed
Fix: Set the VM state to unknown if the Machine ends up in a failed state
Result: The status more accurately reflects the observed state of the world
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:17:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Could you provide a must gather from the cluster at all? Or the logs from the machine-controller in the machine-api-controllers pod from the openshift-machine-api namespace, and the YAML representation of the stopped master machine please I have a feeling that this isn't limited to master machines, nor is it limited to GCP. If I remember correctly we always just set the main machine phase to failed rather than updating the provider status when this happens. I've confirmed that this is also reproducible by deleting a worker machine. To fix this, we would want to add some way to sync the provider status when the machine doesn't exist, I think this would mean changing the actuator interface to introduce a new method, if that is the case, then this would be a major change and I would recommend we defer fixing this until 4.7. We may be able to just call update in this case and have the providers clear their status, but we will need to check that each provider is able to handle this gracefully. Machines can go failed either when failing on creation because of invalid config or because the instance was deleted out of band. For both cases I think setting the provider status as unknown is acceptable. This should be happening already https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L453 so if it's not, that's a code bug. See original PR https://github.com/openshift/machine-api-operator/pull/575 @Alberto this bug is referring to .status.providerStatus.instanceState which is set by each actuator currently to `Running` once the instance comes up. Because we don't call the actuator at all after we determine the machine has failed, nothing updates the providerStatus and so it ends up out of sync/still saying Running. I believe the code that you've linked is working as expected in this case based on my testing. Waiting on tests to pass for this, we seem to have some flakiness, should merge today or early next week hopefully Failed to verify, tested on aws, found different results: zhsun924aws-8gmkk-worker-us-east-2a-ck968 "instanceState: running",zhsun924aws-8gmkk-worker-us-east-2c-lgsqf "instanceState: shutting-down", neither is Unknown.
clusterversion: 4.6.0-0.nightly-2020-09-24-074159
step:
1.terminate instances from aws web console
2.check machine instanceState/vmState
$ oc get machine -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
zhsun924aws-8gmkk-master-0 Running m5.xlarge us-east-2 us-east-2a 29h ip-10-0-133-192.us-east-2.compute.internal aws:///us-east-2a/i-0e9e1ccf5522a8e58 running
zhsun924aws-8gmkk-master-1 Running m5.xlarge us-east-2 us-east-2b 29h ip-10-0-176-241.us-east-2.compute.internal aws:///us-east-2b/i-006dfa039aaf08211 running
zhsun924aws-8gmkk-master-2 Running m5.xlarge us-east-2 us-east-2c 29h ip-10-0-213-91.us-east-2.compute.internal aws:///us-east-2c/i-0971ca92b91c7cfed running
zhsun924aws-8gmkk-worker-us-east-2a-ck968 Failed m5.large us-east-2 us-east-2a 29h ip-10-0-152-114.us-east-2.compute.internal aws:///us-east-2a/i-09d7a4c72fb8ffb43 Unknown
zhsun924aws-8gmkk-worker-us-east-2b-sq2jt Running m5.large us-east-2 us-east-2b 29h ip-10-0-177-121.us-east-2.compute.internal aws:///us-east-2b/i-01e1ce14de47a8c81 running
zhsun924aws-8gmkk-worker-us-east-2c-lgsqf Failed m5.large us-east-2 us-east-2c 29h ip-10-0-220-249.us-east-2.compute.internal aws:///us-east-2c/i-0c527cfa402f6f067 Unknown
$ oc get machine zhsun924aws-8gmkk-worker-us-east-2a-ck968 -o yaml
status:
addresses:
- address: 10.0.152.114
type: InternalIP
- address: ip-10-0-152-114.us-east-2.compute.internal
type: InternalDNS
- address: ip-10-0-152-114.us-east-2.compute.internal
type: Hostname
errorMessage: Can't find created instance.
lastUpdated: "2020-09-25T14:31:34Z"
nodeRef:
kind: Node
name: ip-10-0-152-114.us-east-2.compute.internal
uid: c591b5fd-3576-4d6c-a490-15aee99b1ca7
phase: Failed
providerStatus:
conditions:
- lastProbeTime: "2020-09-24T10:37:46Z"
lastTransitionTime: "2020-09-24T10:37:46Z"
message: Machine successfully created
reason: MachineCreationSucceeded
status: "True"
type: MachineCreation
instanceId: i-09d7a4c72fb8ffb43
instanceState: running
$ oc get machine zhsun924aws-8gmkk-worker-us-east-2c-lgsqf -o yaml
errorMessage: Can't find created instance.
lastUpdated: "2020-09-25T15:45:43Z"
nodeRef:
kind: Node
name: ip-10-0-220-249.us-east-2.compute.internal
uid: f5451452-8bca-4629-8143-7be48ff2a4b7
phase: Failed
providerStatus:
conditions:
- lastProbeTime: "2020-09-24T10:37:49Z"
lastTransitionTime: "2020-09-24T10:37:49Z"
message: Machine successfully created
reason: MachineCreationSucceeded
status: "True"
type: MachineCreation
instanceId: i-0c527cfa402f6f067
instanceState: shutting-down
Moving to baremetal and bumping to 4.7 since it's the only PR that remains with out merging. Baremetal has now merged as part of https://github.com/openshift/cluster-api-provider-baremetal/pull/118 This is ready for QE. Moving back to cloud as this was primarily our effort. I have tested on AWS, GCP, Azure and Vsphere, the instanceState is being updated as expected. But on osp, machine doesn't have a providerStatus filed, so couldn't check, move this to verified. @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed.
verified on aws
clusterversion: 4.6.0-0.nightly-2020-10-08-210814
$ oc get machine -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
zhsun109aws-tsn5h-master-0 Running m5.xlarge us-east-2 us-east-2a 71m ip-10-0-142-174.us-east-2.compute.internal aws:///us-east-2a/i-060c32bc8833bed44 running
zhsun109aws-tsn5h-master-1 Running m5.xlarge us-east-2 us-east-2b 71m ip-10-0-190-89.us-east-2.compute.internal aws:///us-east-2b/i-0de030c3f1c87fb52 running
zhsun109aws-tsn5h-master-2 Running m5.xlarge us-east-2 us-east-2c 71m ip-10-0-206-22.us-east-2.compute.internal aws:///us-east-2c/i-07ac5dcd8c4be3520 running
zhsun109aws-tsn5h-worker-us-east-2a-qxnvk Running m5.large us-east-2 us-east-2a 60m ip-10-0-130-174.us-east-2.compute.internal aws:///us-east-2a/i-0cef294deb407c9fc running
zhsun109aws-tsn5h-worker-us-east-2b-sgstt Running m5.large us-east-2 us-east-2b 60m ip-10-0-183-78.us-east-2.compute.internal aws:///us-east-2b/i-0820fef5ce6c7fd93 running
zhsun109aws-tsn5h-worker-us-east-2c-frj7w Failed m5.large us-east-2 us-east-2c 60m ip-10-0-206-48.us-east-2.compute.internal aws:///us-east-2c/i-05230e25ee2e8e854 Unknown
Status:
Error Message: Can't find created instance.
Last Updated: 2020-10-09T03:01:10Z
Node Ref:
Kind: Node
Name: ip-10-0-206-48.us-east-2.compute.internal
UID: ba7a87d7-e68a-4e6e-a8a4-80bc21ab1a41
Phase: Failed
Provider Status:
Conditions:
Last Probe Time: 2020-10-09T02:03:28Z
Last Transition Time: 2020-10-09T02:03:28Z
Message: Machine successfully created
Reason: MachineCreationSucceeded
Status: True
Type: MachineCreation
Instance Id: i-05230e25ee2e8e854
Instance State: Unknown
verified on gcp
$ oc get machine -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
zhsun109gcp-m2fw5-master-0 Running n1-standard-4 us-central1 us-central1-a 99m zhsun109gcp-m2fw5-master-0.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-master-0 RUNNING
zhsun109gcp-m2fw5-master-1 Running n1-standard-4 us-central1 us-central1-b 99m zhsun109gcp-m2fw5-master-1.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-master-1 RUNNING
zhsun109gcp-m2fw5-master-2 Running n1-standard-4 us-central1 us-central1-c 99m zhsun109gcp-m2fw5-master-2.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-master-2 RUNNING
zhsun109gcp-m2fw5-worker-a-dqj26 Running n1-standard-4 us-central1 us-central1-a 92m zhsun109gcp-m2fw5-worker-a-dqj26.c.openshift-qe.internal gce://openshift-qe/us-central1-a/zhsun109gcp-m2fw5-worker-a-dqj26 RUNNING
zhsun109gcp-m2fw5-worker-b-tlsbk Running n1-standard-4 us-central1 us-central1-b 92m zhsun109gcp-m2fw5-worker-b-tlsbk.c.openshift-qe.internal gce://openshift-qe/us-central1-b/zhsun109gcp-m2fw5-worker-b-tlsbk RUNNING
zhsun109gcp-m2fw5-worker-c-dwvsb Failed n1-standard-4 us-central1 us-central1-c 92m zhsun109gcp-m2fw5-worker-c-dwvsb.c.openshift-qe.internal gce://openshift-qe/us-central1-c/zhsun109gcp-m2fw5-worker-c-dwvsb Unknown
Phase: Failed
Provider Status:
Conditions:
Last Probe Time: 2020-10-09T02:00:40Z
Last Transition Time: 2020-10-09T02:00:40Z
Message: machine successfully created
Reason: MachineCreationSucceeded
Status: True
Type: MachineCreated
Instance Id: zhsun109gcp-m2fw5-worker-c-dwvsb
Instance State: Unknown
verified on azure
$ oc get machine -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
zhsun109az-hnqvv-master-0 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-0 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-0 Running
zhsun109az-hnqvv-master-1 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-1 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-1 Running
zhsun109az-hnqvv-master-2 Running Standard_D8s_v3 northcentralus 90m zhsun109az-hnqvv-master-2 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-master-2 Running
zhsun109az-hnqvv-worker-northcentralus-5q7v9 Running Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-5q7v9 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-5q7v9 Running
zhsun109az-hnqvv-worker-northcentralus-ghlmb Running Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-ghlmb azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-ghlmb Running
zhsun109az-hnqvv-worker-northcentralus-rwrv8 Failed Standard_D2s_v3 northcentralus 85m zhsun109az-hnqvv-worker-northcentralus-rwrv8 azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8 Unknown
Phase: Failed
Provider Status:
Conditions:
Last Probe Time: 2020-10-09T02:09:15Z
Last Transition Time: 2020-10-09T02:09:15Z
Message: machine successfully created
Reason: MachineCreationSucceeded
Status: True
Type: MachineCreated
Metadata:
Vm Id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/zhsun109az-hnqvv-rg/providers/Microsoft.Compute/virtualMachines/zhsun109az-hnqvv-worker-northcentralus-rwrv8
Vm State: Unknown
verified on osphere
$ oc get machine
NAME PHASE TYPE REGION ZONE AGE
zhsunvs2109-2h9b2-worker-qhps6 Failed 12m
Phase: Failed
Provider Status:
Conditions:
Last Probe Time: 2020-10-09T05:54:22Z
Last Transition Time: 2020-10-09T05:54:22Z
Message: Machine successfully created
Reason: MachineCreationSucceeded
Status: True
Type: MachineCreation
Instance Id: 422b855c-8889-bb3c-f83b-2a3ff6029c3f
Instance State: Unknown
Task Ref: task-57827
> @Joel Speed want to know why osp is not same with aws,gcp and azre which have providerStatus filed.
Good question, not one I can really answer. I guess the openstack team didn't want to set a field that matches this pattern? I don't think there is one for baremetal either. Might just be they've never felt the need for it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Description of problem: machine status is Running for a master node which has been terminated from the console Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-03-084733 True False 5h38m Cluster version is 4.6.0-0.nightly-2020-09-03-084733 How reproducible: Steps to Reproduce: 1. Shutdown a cluster. 2. Terminate a master node. 3. Restart the cluster Actual results: $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE rpattath-46-artifacts-xhgjs-master-0 Running n1-standard-4 us-central1 us-central1-a 4h50m rpattath-46-artifacts-xhgjs-master-1 Running n1-standard-4 us-central1 us-central1-b 4h50m rpattath-46-artifacts-xhgjs-master-2 Failed n1-standard-4 us-central1 us-central1-c 4h50m rpattath-46-artifacts-xhgjs-worker-a-m4tpk Running n1-standard-4 us-central1 us-central1-a 4h40m rpattath-46-artifacts-xhgjs-worker-b-2qtfh Running n1-standard-4 us-central1 us-central1-b 4h40m rpattath-46-artifacts-xhgjs-worker-c-7wtxq Running n1-standard-4 us-central1 us-central1-c 4h40m [rpattath_redhat_com@rpattath-vpc-bastion1 ~]$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running rpattath-46-artifacts-xhgjs-master-0.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-master-1.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-master-2.c.openshift-qe.internal RUNNING rpattath-46-artifacts-xhgjs-worker-a-m4tpk RUNNING rpattath-46-artifacts-xhgjs-worker-b-2qtfh RUNNING rpattath-46-artifacts-xhgjs-worker-c-7wtxq RUNNING Expected results: State of the terminated node should not be running. Additional info: This was seen on gcp.