Description of problem: Job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/53 Had one e2e failure [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal] expand_less 7m29s fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:166]: Expected <map[string]error | len:1>: { "sum(ALERTS{alertstate=\"firing\"})": { s: "query sum(ALERTS{alertstate=\"firing\"}) for tests []prometheus.metricTest{prometheus.metricTest{labels:map[string]string(nil), greaterThanEqual:false, value:2, success:false}} had results {\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{},\"value\":[1571118402.97,\"3\"]}]}}", }, } to be empty Is it a flake elsewhere or in monitoring? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Possibly related to bug 1759659 which mentions MachineWithNoRunningPhase on Azure, although that's ON_QA, and this is about AWS. There's another AWS instance in [1] showing: ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.19:8443",job="machine-api-operator",name="ci-op-gjbm73c3-0fbd5-jmbtt-worker-us-east-1b-gj49c",namespace="openshift-machine-api",node="ip-10-0-157-49.ec2.internal",phase="Failed",pod="machine-api-operator-6d46668589-t4kkw",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-08abc1b8e67a01724"} starting at 1:15:55Z. And lots of other jobs failing the Prometheus alert limit [2], which might be that, or might be unrelated alerts. It's unfortunate that the violating alerts don't seem to make it into the build log, maybe the test logic can be improved to make that happen? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/138 [2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed:.*Prometheus+when+installed+on+the+cluster+should+report+less+than+two+alerts+in+firing+or+pending+state&maxAge=336h&context=2&type=all
And in [1]: ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.16:8443",job="machine-api-operator",name="ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q",namespace="openshift-machine-api",node="ip-10-0-145-56.ec2.internal",phase="Failed",pod="machine-api-operator-5655ff9484-2zcfk",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-0ccde84aec2292d56"} Drilling down to the failed machine: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q.yaml | yaml2json | jq '.status | {errorMessage, phase, providerStatus}' { "errorMessage": "Can't find created instance.", "phase": "Failed", "providerStatus": { "apiVersion": "awsproviderconfig.openshift.io/v1beta1", "instanceId": "i-0ccde84aec2292d56", "instanceState": "pending", "kind": "AWSMachineProviderStatus", "conditions": [ { "lastProbeTime": "2019-10-18T09:29:07Z", "status": "True", "message": "machine successfully created", "reason": "MachineCreationSucceeded", "lastTransitionTime": "2019-10-18T09:29:07Z", "type": "MachineCreation" } ] } } But the ClusterOperator is happy anyway: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml | yaml2json | jq -r '.status.conditions[] | .type + " " + .status + ": " + .message' Progressing False: Available True: Cluster Machine API Operator is available at operator: 4.3.0-0.ci-2019-10-18-090821 Degraded False: Upgradeable True: And so is the MachineSet: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machinesets/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b.yaml | yaml2json | jq .status { "replicas": "1", "readyReplicas": "1", "availableReplicas": "1", "observedGeneration": "1", "fullyLabeledReplicas": "1" } Would be nice if we bubbled "waiting a really long time for AWS to get out of pending" up to something more discoverable like the ClusterOperator and/or at least the MachineSet (and maybe it gets shown on the MachineSet sometimes, and must-gather just raced to get the broken Machine, Machine resolved, MachineSet went happy, must-gather got the happy MachineSet or something?). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936
I've seen this as well, the actuator Exists() wrongly reports false but the machine already got a providerID/address so it goes failed. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-69f56b54d-hqfgq_machine-controller.log
I adjusted the summary to focus on the AWS pending-instance aspect of this. Between that and the Azure stuff in bug 1759659, we hopefully take a large bite out of the MachineWithNoRunningPhase alert. We can revisit and see if there are other MachineWithNoRunningPhase triggers or other alerts causing the "less than two alerts" symptoms once we get these resolved.
> ...the actuator Exists() wrongly reports false... The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify. Exists(...) -> (bool, error) is a pretty coarse check. Is there a generic machine-state model we can lean on to expose more detail? I guess [2]? There's also a pending comment down in updateStatus [3], but I'm not clear on what that's about yet. [1]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L487 [2]: https://github.com/openshift/enhancements/pull/11 [3]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L644-L651
> The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify. That comment is just a bit confusing I think -- "running state" here actually means anything that's not terminated. I think this is what's happening here: - We call the actuator's Create() method. - An instance is created and we update the provider ID field in spec, and update the Machine's status. - This causes us to immediately reconcile the Machine again. - In Exists(), we're looking for the instance by filtering based on tags, not by instance ID. - If the instance has been created, but the tags haven't been set, Exists() returns false and we're in an error state. I think that has always been the case, but it previously resulted in "leaked" AWS instances. Now it results in a Machine in a permanent Failed state due to the recent changes introducing a more consistent set of states. The linked PR will try to first search by instance ID in Exists() and fall back to the old behavior if that's not possible.
Is this test is suitable to verify the issue? openshift-tests run-test "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"
*** Bug 1772163 has been marked as a duplicate of this bug. ***
Moving this back to ASSIGNED. As discussed in bug 1772163, we're still seeing this behavior in CI.
Still seeing "Can't find created instance." in CI jobs. status: addresses: - address: 10.0.135.20 type: InternalIP - address: '' type: ExternalDNS - address: ip-10-0-135-20.ec2.internal type: InternalDNS errorMessage: Can't find created instance. lastUpdated: '2019-11-22T17:39:34Z' phase: Failed providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: '2019-11-22T17:39:34Z' lastTransitionTime: '2019-11-22T17:39:34Z' message: machine successfully created reason: MachineCreationSucceeded status: 'True' type: MachineCreation instanceId: i-08b09536e36303450 instanceState: pending kind: AWSMachineProviderStatus https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059 $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]' ce4e6b67555844fcfbf096de679552cfce8cbec7 which is the current master tip [1], and includes the eventual-constency requeue [2]. Troubled machine is ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz [3]. Controller logs: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:31.26930496Z I1122 17:39:31.269237 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:31.269341783Z I1122 17:39:31.269277 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:31.27794315Z I1122 17:39:31.277897 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:31.27794315Z I1122 17:39:31.277926 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:31.277977399Z I1122 17:39:31.277942 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:32.431695199Z I1122 17:39:32.431647 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:32.431695199Z I1122 17:39:32.431675 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Provisioning" 2019-11-22T17:39:32.439999853Z I1122 17:39:32.439939 1 controller.go:319] Reconciling machine object ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz triggers idempotent create. 2019-11-22T17:39:32.439999853Z I1122 17:39:32.439963 1 actuator.go:109] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: creating machine 2019-11-22T17:39:32.4402235Z E1122 17:39:32.440190 1 utils.go:257] NodeRef not found in machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:32.47213483Z I1122 17:39:32.472090 1 instances.go:47] No stopped instances found for machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz 2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651 1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450 2019-11-22T17:39:34.651818357Z I1122 17:39:34.651773 1 actuator.go:606] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Updating status 2019-11-22T17:39:34.651818357Z I1122 17:39:34.651808 1 actuator.go:650] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: finished calculating AWS status 2019-11-22T17:39:34.651899512Z I1122 17:39:34.651872 1 actuator.go:226] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: machine status has changed, updating 2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202 1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue 2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285 1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660402 1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed" ... [1]: https://github.com/openshift/cluster-api-provider-aws/commits [2]: https://github.com/openshift/cluster-api-provider-aws/pull/271 [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz.yaml
Filed https://github.com/openshift/cluster-api-provider-aws/pull/273 to add guards to Exists and Describe.
I've shifted this bug to 4.4.0 and cloned bug 1775874 back to 4.3.0.
Tested in 4.4.0-0.nightly-2019-12-19-223334. Tested using autosaler to scale up the cluster, had not been able to reproduce nor found similar log in machine-contorller, moving to verified.