Bug 1761882
| Summary: | AWS MachineWithNoRunningPhase firing, phase Failed, instanceState pending | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gabe Montero <gmontero> | |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | Jianwei Hou <jhou> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | unspecified | |||
| Priority: | unspecified | CC: | alegrand, anpicker, aos-bugs, erooth, jhou, jokerman, kakkoyun, lcosic, mbukatov, mloibl, pkrupa, rteague, skordas, spasquie, surbania, vlaad, wking, xtian | |
| Version: | 4.3.0 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.4.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1775874 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-15 14:41:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1775874 | |||
|
Description
Gabe Montero
2019-10-15 13:50:10 UTC
Possibly related to bug 1759659 which mentions MachineWithNoRunningPhase on Azure, although that's ON_QA, and this is about AWS. There's another AWS instance in [1] showing: ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.19:8443",job="machine-api-operator",name="ci-op-gjbm73c3-0fbd5-jmbtt-worker-us-east-1b-gj49c",namespace="openshift-machine-api",node="ip-10-0-157-49.ec2.internal",phase="Failed",pod="machine-api-operator-6d46668589-t4kkw",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-08abc1b8e67a01724"} starting at 1:15:55Z. And lots of other jobs failing the Prometheus alert limit [2], which might be that, or might be unrelated alerts. It's unfortunate that the violating alerts don't seem to make it into the build log, maybe the test logic can be improved to make that happen? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/138 [2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed:.*Prometheus+when+installed+on+the+cluster+should+report+less+than+two+alerts+in+firing+or+pending+state&maxAge=336h&context=2&type=all And in [1]:
ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.16:8443",job="machine-api-operator",name="ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q",namespace="openshift-machine-api",node="ip-10-0-145-56.ec2.internal",phase="Failed",pod="machine-api-operator-5655ff9484-2zcfk",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-0ccde84aec2292d56"}
Drilling down to the failed machine:
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q.yaml | yaml2json | jq '.status | {errorMessage, phase, providerStatus}'
{
"errorMessage": "Can't find created instance.",
"phase": "Failed",
"providerStatus": {
"apiVersion": "awsproviderconfig.openshift.io/v1beta1",
"instanceId": "i-0ccde84aec2292d56",
"instanceState": "pending",
"kind": "AWSMachineProviderStatus",
"conditions": [
{
"lastProbeTime": "2019-10-18T09:29:07Z",
"status": "True",
"message": "machine successfully created",
"reason": "MachineCreationSucceeded",
"lastTransitionTime": "2019-10-18T09:29:07Z",
"type": "MachineCreation"
}
]
}
}
But the ClusterOperator is happy anyway:
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml | yaml2json | jq -r '.status.conditions[] | .type + " " + .status + ": " + .message'
Progressing False:
Available True: Cluster Machine API Operator is available at operator: 4.3.0-0.ci-2019-10-18-090821
Degraded False:
Upgradeable True:
And so is the MachineSet:
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machinesets/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b.yaml | yaml2json | jq .status
{
"replicas": "1",
"readyReplicas": "1",
"availableReplicas": "1",
"observedGeneration": "1",
"fullyLabeledReplicas": "1"
}
Would be nice if we bubbled "waiting a really long time for AWS to get out of pending" up to something more discoverable like the ClusterOperator and/or at least the MachineSet (and maybe it gets shown on the MachineSet sometimes, and must-gather just raced to get the broken Machine, Machine resolved, MachineSet went happy, must-gather got the happy MachineSet or something?).
[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936
I've seen this as well, the actuator Exists() wrongly reports false but the machine already got a providerID/address so it goes failed. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-69f56b54d-hqfgq_machine-controller.log I adjusted the summary to focus on the AWS pending-instance aspect of this. Between that and the Azure stuff in bug 1759659, we hopefully take a large bite out of the MachineWithNoRunningPhase alert. We can revisit and see if there are other MachineWithNoRunningPhase triggers or other alerts causing the "less than two alerts" symptoms once we get these resolved. > ...the actuator Exists() wrongly reports false... The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify. Exists(...) -> (bool, error) is a pretty coarse check. Is there a generic machine-state model we can lean on to expose more detail? I guess [2]? There's also a pending comment down in updateStatus [3], but I'm not clear on what that's about yet. [1]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L487 [2]: https://github.com/openshift/enhancements/pull/11 [3]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L644-L651 > The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify.
That comment is just a bit confusing I think -- "running state" here actually means anything that's not terminated.
I think this is what's happening here:
- We call the actuator's Create() method.
- An instance is created and we update the provider ID field in spec, and update the Machine's status.
- This causes us to immediately reconcile the Machine again.
- In Exists(), we're looking for the instance by filtering based on tags, not by instance ID.
- If the instance has been created, but the tags haven't been set, Exists() returns false and we're in an error state.
I think that has always been the case, but it previously resulted in "leaked" AWS instances. Now it results in a Machine in a permanent Failed state due to the recent changes introducing a more consistent set of states.
The linked PR will try to first search by instance ID in Exists() and fall back to the old behavior if that's not possible.
Is this test is suitable to verify the issue? openshift-tests run-test "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]" *** Bug 1772163 has been marked as a duplicate of this bug. *** Moving this back to ASSIGNED. As discussed in bug 1772163, we're still seeing this behavior in CI. Still seeing "Can't find created instance." in CI jobs.
status:
addresses:
- address: 10.0.135.20
type: InternalIP
- address: ''
type: ExternalDNS
- address: ip-10-0-135-20.ec2.internal
type: InternalDNS
errorMessage: Can't find created instance.
lastUpdated: '2019-11-22T17:39:34Z'
phase: Failed
providerStatus:
apiVersion: awsproviderconfig.openshift.io/v1beta1
conditions:
- lastProbeTime: '2019-11-22T17:39:34Z'
lastTransitionTime: '2019-11-22T17:39:34Z'
message: machine successfully created
reason: MachineCreationSucceeded
status: 'True'
type: MachineCreation
instanceId: i-08b09536e36303450
instanceState: pending
kind: AWSMachineProviderStatus
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059 $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]' ce4e6b67555844fcfbf096de679552cfce8cbec7 which is the current master tip [1], and includes the eventual-constency requeue [2]. Troubled machine is ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz [3]. Controller logs: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:31.26930496Z I1122 17:39:31.269237 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:31.269341783Z I1122 17:39:31.269277 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:31.27794315Z I1122 17:39:31.277897 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:31.27794315Z I1122 17:39:31.277926 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:31.277977399Z I1122 17:39:31.277942 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:32.431695199Z I1122 17:39:32.431647 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:32.431695199Z I1122 17:39:32.431675 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Provisioning" 2019-11-22T17:39:32.439999853Z I1122 17:39:32.439939 1 controller.go:319] Reconciling machine object ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz triggers idempotent create. 2019-11-22T17:39:32.439999853Z I1122 17:39:32.439963 1 actuator.go:109] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: creating machine 2019-11-22T17:39:32.4402235Z E1122 17:39:32.440190 1 utils.go:257] NodeRef not found in machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:32.47213483Z I1122 17:39:32.472090 1 instances.go:47] No stopped instances found for machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651 1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450 2019-11-22T17:39:34.651818357Z I1122 17:39:34.651773 1 actuator.go:606] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Updating status 2019-11-22T17:39:34.651818357Z I1122 17:39:34.651808 1 actuator.go:650] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: finished calculating AWS status 2019-11-22T17:39:34.651899512Z I1122 17:39:34.651872 1 actuator.go:226] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: machine status has changed, updating 2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202 1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue 2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285 1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660402 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed" ... [1]: https://github.com/openshift/cluster-api-provider-aws/commits [2]: https://github.com/openshift/cluster-api-provider-aws/pull/271 [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz.yaml Filed https://github.com/openshift/cluster-api-provider-aws/pull/273 to add guards to Exists and Describe. I've shifted this bug to 4.4.0 and cloned bug 1775874 back to 4.3.0. Tested in 4.4.0-0.nightly-2019-12-19-223334. Tested using autosaler to scale up the cluster, had not been able to reproduce nor found similar log in machine-contorller, moving to verified. |