Bug 1781339 - AWS: terminated Machine adopted by replacement Machine with the same name
Summary: AWS: terminated Machine adopted by replacement Machine with the same name
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1782176
TreeView+ depends on / blocked
 
Reported: 2019-12-09 19:58 UTC by W. Trevor King
Modified: 2020-05-15 15:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1782176 (view as bug list)
Environment:
Last Closed: 2020-05-15 15:52:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 280 0 None closed Bug 1781339: Ensure Spec.ProviderID is not empty string 2020-05-15 14:35:17 UTC

Description W. Trevor King 2019-12-09 19:58:54 UTC
From a recent disruption job [1], where two control-plane Machines are deleted and then pushed back into the cluster.  Here's the old Machine going out:

$ curl -s  https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/previous.log | grep -B1 i-0e21ada20e9b19da2
2019-12-08T22:19:53.070511591Z I1208 22:19:53.070473       1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine
2019-12-08T22:19:53.185377501Z I1208 22:19:53.185272       1 utils.go:218] Cleaning up extraneous instance for machine: i-0e21ada20e9b19da2, state: running, launchTime: 2019-12-08 21:43:24 +0000 UTC
2019-12-08T22:19:53.185486624Z I1208 22:19:53.185462       1 utils.go:222] Terminating i-0e21ada20e9b19da2 instance

And then (in a new controller pod), the new Machine coming in and picking up that same (now terminated) i-0e21ada20e9b19da2:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | head -n12
2019-12-08T22:24:58.11203914Z I1208 22:24:58.110637       1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:24:58.11203914Z I1208 22:24:58.110660       1 controller.go:201] Reconciling machine "ci-op-vkpy76m9-16905-pdt5z-master-1" triggers delete
2019-12-08T22:24:58.124204426Z I1208 22:24:58.124188       1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine
2019-12-08T22:24:58.20959814Z E1208 22:24:58.209529       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running
2019-12-08T22:24:58.20959814Z W1208 22:24:58.209569       1 actuator.go:377] ci-op-vkpy76m9-16905-pdt5z-master-1: no instances found to delete for machine
2019-12-08T22:24:58.209648523Z I1208 22:24:58.209630       1 controller.go:228] Deleting node "ip-10-0-154-177.ec2.internal" for machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:24:58.231206051Z I1208 22:24:58.231148       1 controller.go:242] Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" deletion successful
2019-12-08T22:25:00.610252762Z I1208 22:25:00.610169       1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:25:00.610252762Z I1208 22:25:00.610215       1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists
2019-12-08T22:25:00.667060176Z E1208 22:25:00.666191       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-08T22:25:00.667060176Z I1208 22:25:00.666222       1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-08T22:25:00.667060176Z E1208 22:25:00.666241       1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s

which continues for at least 30m:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | tail -n4
2019-12-08T22:54:58.361762418Z I1208 22:54:58.360176       1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists
2019-12-08T22:54:58.442454699Z E1208 22:54:58.442398       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-08T22:54:58.442454699Z I1208 22:54:58.442429       1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-08T22:54:58.442503051Z E1208 22:54:58.442449       1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s

The confused logic around eventual-consistency landed for bug 1772163.  I think we may need to go back and add limits to those eventual-consistency guards as discussed in [2].

From Clayton:
> mark urgent since disruption tests have caught at least 5 serious bugs that would impact customers at the worst possible time - disruptive suites get priority because of that

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12
[2]: https://github.com/openshift/cluster-api-provider-aws/pull/271#discussion_r348300076

Comment 1 Michael Gugino 2019-12-09 22:13:48 UTC
IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions.  We shouldn't remove the details of the instance from the machine object, it should just be an 'instance state' field that describes the current state.

Comment 2 W. Trevor King 2019-12-09 22:19:52 UTC
Are you asking for a revert of [1]?

[1]: https://github.com/openshift/origin/commit/bd064a37e3ec5bf6db4b6a890096398a27be4459

Comment 3 Brad Ison 2019-12-10 11:57:45 UTC
> IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions.

That seems pretty drastic. If you're going to suggest something like that, can you *please* be more specific?


It seems to me that the existing PR should fix this:
https://github.com/openshift/cluster-api-provider-aws/pull/280

Comment 5 Jianwei Hou 2019-12-20 03:52:26 UTC
Verified this has been fixed in 4.4.0-0.ci-2019-12-13-145806. 

Job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15
Version: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15/artifacts/e2e-aws-disruptive/clusterversion.json
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15/artifacts/release-images-latest/release-images-latest

commit 07a8fd7b9920bcc9c6e6e0ef37529f0b3969836f
Merge: c22ade5b 3d1e2c69
Author: OpenShift Merge Robot <openshift-merge-robot.github.com>
Date:   Wed Dec 11 13:20:47 2019 +0100

    Merge pull request #280 from mgugino-upstream-stage/check-empty-quote-providerspec

    Bug 1781339: Ensure Spec.ProviderID is not empty string


Note You need to log in before you can comment on or make changes to this bug.