1761882 – AWS MachineWithNoRunningPhase firing, phase Failed, instanceState pending

Bug 1761882 - AWS MachineWithNoRunningPhase firing, phase Failed, instanceState pending

Summary: AWS MachineWithNoRunningPhase firing, phase Failed, instanceState pending

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1772163 (view as bug list)
Depends On:
Blocks:	1775874
TreeView+	depends on / blocked

Reported:	2019-10-15 13:50 UTC by Gabe Montero
Modified:	2020-05-15 14:41 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1775874 (view as bug list)
Environment:
Last Closed:	2020-05-15 14:41:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 266	'None'	closed	Bug 1761882: Search for existing instances by instance ID	2020-12-24 20:51:52 UTC
Github	openshift cluster-api-provider-aws pull 271	'None'	closed	Bug 1761882: pkg/actuators/machine/actuator: Eventual-consistency wait for machines	2020-12-24 20:51:52 UTC
Github	openshift cluster-api-provider-aws pull 273	'None'	closed	Bug 1761882: pkg/actuators/machine: Eventual-consistency guards for Exists/Describe	2020-12-24 20:51:51 UTC

Description Gabe Montero 2019-10-15 13:50:10 UTC

Description of problem:

Job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/53

Had one e2e failure

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal] expand_less 	7m29s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:166]: Expected
    <map[string]error | len:1>: {
        "sum(ALERTS{alertstate=\"firing\"})": {
            s: "query sum(ALERTS{alertstate=\"firing\"}) for tests []prometheus.metricTest{prometheus.metricTest{labels:map[string]string(nil), greaterThanEqual:false, value:2, success:false}} had results {\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{},\"value\":[1571118402.97,\"3\"]}]}}",
        },
    }
to be empty


Is it a flake elsewhere or in monitoring?


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 W. Trevor King 2019-10-17 08:17:35 UTC

Possibly related to bug 1759659 which mentions MachineWithNoRunningPhase on Azure, although that's ON_QA, and this is about AWS.  There's another AWS instance in [1] showing:

  ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.19:8443",job="machine-api-operator",name="ci-op-gjbm73c3-0fbd5-jmbtt-worker-us-east-1b-gj49c",namespace="openshift-machine-api",node="ip-10-0-157-49.ec2.internal",phase="Failed",pod="machine-api-operator-6d46668589-t4kkw",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-08abc1b8e67a01724"}

starting at 1:15:55Z.  And lots of other jobs failing the Prometheus alert limit [2], which might be that, or might be unrelated alerts.  It's unfortunate that the violating alerts don't seem to make it into the build log, maybe the test logic can be improved to make that happen?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/138
[2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed:.*Prometheus+when+installed+on+the+cluster+should+report+less+than+two+alerts+in+firing+or+pending+state&maxAge=336h&context=2&type=all

Comment 4 W. Trevor King 2019-10-18 11:04:04 UTC

And in [1]:

ALERTS{alertname="MachineWithNoRunningPhase",alertstate="firing",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.130.0.16:8443",job="machine-api-operator",name="ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q",namespace="openshift-machine-api",node="ip-10-0-145-56.ec2.internal",phase="Failed",pod="machine-api-operator-5655ff9484-2zcfk",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-0ccde84aec2292d56"}

Drilling down to the failed machine:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b-5mc9q.yaml | yaml2json | jq '.status | {errorMessage, phase, providerStatus}'
{
  "errorMessage": "Can't find created instance.",
  "phase": "Failed",
  "providerStatus": {
    "apiVersion": "awsproviderconfig.openshift.io/v1beta1",
    "instanceId": "i-0ccde84aec2292d56",
    "instanceState": "pending",
    "kind": "AWSMachineProviderStatus",
    "conditions": [
      {
        "lastProbeTime": "2019-10-18T09:29:07Z",
        "status": "True",
        "message": "machine successfully created",
        "reason": "MachineCreationSucceeded",
        "lastTransitionTime": "2019-10-18T09:29:07Z",
        "type": "MachineCreation"
      }
    ]
  }
}

But the ClusterOperator is happy anyway:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml | yaml2json | jq -r '.status.conditions[] | .type + " " + .status + ": " + .message'
Progressing False: 
Available True: Cluster Machine API Operator is available at operator: 4.3.0-0.ci-2019-10-18-090821
Degraded False: 
Upgradeable True: 

And so is the MachineSet:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-10-18-090821-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machinesets/ci-op-b79ds2yk-0fbd5-57dlv-worker-us-east-1b.yaml | yaml2json | jq .status
{
  "replicas": "1",
  "readyReplicas": "1",
  "availableReplicas": "1",
  "observedGeneration": "1",
  "fullyLabeledReplicas": "1"
}

Would be nice if we bubbled "waiting a really long time for AWS to get out of pending" up to something more discoverable like the ClusterOperator and/or at least the MachineSet (and maybe it gets shown on the MachineSet sometimes, and must-gather just raced to get the broken Machine, Machine resolved, MachineSet went happy, must-gather got the happy MachineSet or something?).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936

Comment 5 Alberto 2019-10-18 11:28:52 UTC

I've seen this as well, the actuator Exists() wrongly reports false but the machine already got a providerID/address so it goes failed.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.3/936/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-69f56b54d-hqfgq_machine-controller.log

Comment 6 W. Trevor King 2019-10-18 11:43:06 UTC

I adjusted the summary to focus on the AWS pending-instance aspect of this.  Between that and the Azure stuff in bug 1759659, we hopefully take a large bite out of the MachineWithNoRunningPhase alert.  We can revisit and see if there are other MachineWithNoRunningPhase triggers or other alerts causing the "less than two alerts" symptoms once we get these resolved.

Comment 7 W. Trevor King 2019-10-18 11:51:44 UTC

> ...the actuator Exists() wrongly reports false...

The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify.  Exists(...) -> (bool, error) is a pretty coarse check.  Is there a generic machine-state model we can lean on to expose more detail?  I guess [2]?  There's also a pending comment down in updateStatus [3], but I'm not clear on what that's about yet.

[1]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L487
[2]: https://github.com/openshift/enhancements/pull/11
[3]: https://github.com/openshift/cluster-api-provider-aws/blob/151ec6cd21b9739206fd16f2ddf9f036318d7917/pkg/actuators/machine/actuator.go#L644-L651

Comment 8 Brad Ison 2019-10-23 14:50:57 UTC

> The documented Exists API is "...For AWS we query for instances in running state..." [1], and this pending instance doesn't qualify.

That comment is just a bit confusing I think -- "running state" here actually means anything that's not terminated.

I think this is what's happening here:

- We call the actuator's Create() method.
- An instance is created and we update the provider ID field in spec, and update the Machine's status.
- This causes us to immediately reconcile the Machine again.
- In Exists(), we're looking for the instance by filtering based on tags, not by instance ID.
- If the instance has been created, but the tags haven't been set, Exists() returns false and we're in an error state.

I think that has always been the case, but it previously resulted in "leaked" AWS instances. Now it results in a Machine in a permanent Failed state due to the recent changes introducing a more consistent set of states.

The linked PR will try to first search by instance ID in Exists() and fall back to the old behavior if that's not possible.

Comment 10 Simon 2019-11-19 13:39:35 UTC

Is this test is suitable to verify the issue?

openshift-tests run-test "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"

Comment 11 W. Trevor King 2019-11-19 22:04:31 UTC

*** Bug 1772163 has been marked as a duplicate of this bug. ***

Comment 12 W. Trevor King 2019-11-19 22:05:32 UTC

Moving this back to ASSIGNED.  As discussed in bug 1772163, we're still seeing this behavior in CI.

Comment 14 Russell Teague 2019-11-22 19:38:57 UTC

Still seeing "Can't find created instance." in CI jobs.

    status:
      addresses:
      - address: 10.0.135.20
        type: InternalIP
      - address: ''
        type: ExternalDNS
      - address: ip-10-0-135-20.ec2.internal
        type: InternalDNS
      errorMessage: Can't find created instance.
      lastUpdated: '2019-11-22T17:39:34Z'
      phase: Failed
      providerStatus:
        apiVersion: awsproviderconfig.openshift.io/v1beta1
        conditions:
        - lastProbeTime: '2019-11-22T17:39:34Z'
          lastTransitionTime: '2019-11-22T17:39:34Z'
          message: machine successfully created
          reason: MachineCreationSucceeded
          status: 'True'
          type: MachineCreation
        instanceId: i-08b09536e36303450
        instanceState: pending
        kind: AWSMachineProviderStatus


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059

Comment 15 W. Trevor King 2019-11-22 23:55:59 UTC

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]'
ce4e6b67555844fcfbf096de679552cfce8cbec7

which is the current master tip [1], and includes the eventual-constency requeue [2].  Troubled machine is ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz [3].  Controller logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz
2019-11-22T17:39:31.26930496Z I1122 17:39:31.269237       1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz"
2019-11-22T17:39:31.269341783Z I1122 17:39:31.269277       1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-22T17:39:31.27794315Z I1122 17:39:31.277897       1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz"
2019-11-22T17:39:31.27794315Z I1122 17:39:31.277926       1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-22T17:39:31.277977399Z I1122 17:39:31.277942       1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists
2019-11-22T17:39:32.431695199Z I1122 17:39:32.431647       1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist
2019-11-22T17:39:32.431695199Z I1122 17:39:32.431675       1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Provisioning"
2019-11-22T17:39:32.439999853Z I1122 17:39:32.439939       1 controller.go:319] Reconciling machine object ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz triggers idempotent create.
2019-11-22T17:39:32.439999853Z I1122 17:39:32.439963       1 actuator.go:109] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: creating machine
2019-11-22T17:39:32.4402235Z E1122 17:39:32.440190       1 utils.go:257] NodeRef not found in machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz
2019-11-22T17:39:32.47213483Z I1122 17:39:32.472090       1 instances.go:47] No stopped instances found for machine ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz
2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651       1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450
2019-11-22T17:39:34.651818357Z I1122 17:39:34.651773       1 actuator.go:606] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Updating status
2019-11-22T17:39:34.651818357Z I1122 17:39:34.651808       1 actuator.go:650] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: finished calculating AWS status
2019-11-22T17:39:34.651899512Z I1122 17:39:34.651872       1 actuator.go:226] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: machine status has changed, updating
2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202       1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue
2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285       1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s
2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380       1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz"
2019-11-22T17:39:34.660426954Z I1122 17:39:34.660402       1 controller.go:376] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416       1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists
2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177       1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist
2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202       1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed"
...



[1]: https://github.com/openshift/cluster-api-provider-aws/commits
[2]: https://github.com/openshift/cluster-api-provider-aws/pull/271
[3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz.yaml

Comment 16 W. Trevor King 2019-11-23 00:03:01 UTC

Filed https://github.com/openshift/cluster-api-provider-aws/pull/273 to add guards to Exists and Describe.

Comment 17 W. Trevor King 2019-11-23 04:50:42 UTC

I've shifted this bug to 4.4.0 and cloned bug 1775874 back to 4.3.0.

Comment 19 Jianwei Hou 2019-12-20 03:40:16 UTC

Tested in 4.4.0-0.nightly-2019-12-19-223334.

Tested using autosaler to scale up the cluster, had not been able to reproduce nor found similar log in machine-contorller, moving to verified.

Note You need to log in before you can comment on or make changes to this bug.