Bug 1840552
Summary: | [aws]Machine status should be "Failed" with an invalid configuration | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> | |
Component: | Cloud Compute | Assignee: | Danil Grigorev <dgrigore> | |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | Keywords: | Regression | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: When updating the phase to failed, an update to the Machine annotations overwrote changes to the status
Consequence: Status updates were never persisted on a failed phase
Fix: Update the annotations before starting to update the phase
Result: Annotation and phase changes are persisted
|
Story Points: | --- | |
Clone Of: | ||||
: | 1840821 1840822 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:41:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1840821, 1840822 |
Description
sunzhaohua
2020-05-27 08:06:41 UTC
@sunzhaohua Did this eventually go failed or was it stuck in provisioning forever? It appears that the code is following the right/expected paths, but the status update seems to have failed You can see in the logs that we have reached controller.go:424 where is is going into phase Failed, so the error is being returned and recognised correctly, and for some reason setPhase is failing to actually update the API. I would have expected this to be a transient issue. https://github.com/openshift/cluster-api-provider-aws/blob/5e266b553d8e7c5809d94653e2531167c865a762/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go#L422-L447 Could you also check the annotations on the Machine when this happened? There should also be "machine.openshift.io/instance-state": "unknown" in the annotations at this point @joel speed It stuck in Provisioning, I tried several times, all stuck in Provisioning, the longest time is 89m. "machine.openshift.io/instance-state": "unknown" annotations could see in the machine. My cluster is here: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/95078/artifact/workdir/install-dir/auth/kubeconfig/*view*/ in case you need to debug, machine still in Provisioning Verified clusterversion: 4.5.0-0.nightly-2020-05-30-025738 create a machine with invalid ami $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun601aws-5w9lj-master-0 Running m4.xlarge us-east-2 us-east-2a 129m zhsun601aws-5w9lj-master-1 Running m4.xlarge us-east-2 us-east-2b 129m zhsun601aws-5w9lj-master-2 Running m4.xlarge us-east-2 us-east-2c 129m zhsun601aws-5w9lj-worker-us-east-2a-xfchg Running m4.large us-east-2 us-east-2a 119m zhsun601aws-5w9lj-worker-us-east-2b-b8x9c Running m4.large us-east-2 us-east-2b 119m zhsun601aws-5w9lj-worker-us-east-2c-ph88r Failed 98s zhsun601aws-5w9lj-worker-us-east-2c-vv8wv Running m4.large us-east-2 us-east-2c 93m、 I0601 03:31:14.031906 1 controller.go:169] zhsun601aws-5w9lj-worker-us-east-2c-ph88r: reconciling Machine W0601 03:31:14.031921 1 controller.go:266] zhsun601aws-5w9lj-worker-us-east-2c-ph88r: machine has gone "Failed" phase. It won't reconcile Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |