Bug 1782176

Summary: AWS: terminated Machine adopted by replacement Machine with the same name
Product: OpenShift Container Platform Reporter: Alberto <agarcial>
Component: Cloud ComputeAssignee: Brad Ison <brad.ison>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: brad.ison, jhou, mgugino, wking
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1781339 Environment:
Last Closed: 2020-01-23 11:18:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1781339    
Bug Blocks:    

Description Alberto 2019-12-11 10:48:09 UTC
+++ This bug was initially created as a clone of Bug #1781339 +++

From a recent disruption job [1], where two control-plane Machines are deleted and then pushed back into the cluster.  Here's the old Machine going out:

$ curl -s  https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/previous.log | grep -B1 i-0e21ada20e9b19da2
2019-12-08T22:19:53.070511591Z I1208 22:19:53.070473       1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine
2019-12-08T22:19:53.185377501Z I1208 22:19:53.185272       1 utils.go:218] Cleaning up extraneous instance for machine: i-0e21ada20e9b19da2, state: running, launchTime: 2019-12-08 21:43:24 +0000 UTC
2019-12-08T22:19:53.185486624Z I1208 22:19:53.185462       1 utils.go:222] Terminating i-0e21ada20e9b19da2 instance

And then (in a new controller pod), the new Machine coming in and picking up that same (now terminated) i-0e21ada20e9b19da2:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | head -n12
2019-12-08T22:24:58.11203914Z I1208 22:24:58.110637       1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:24:58.11203914Z I1208 22:24:58.110660       1 controller.go:201] Reconciling machine "ci-op-vkpy76m9-16905-pdt5z-master-1" triggers delete
2019-12-08T22:24:58.124204426Z I1208 22:24:58.124188       1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine
2019-12-08T22:24:58.20959814Z E1208 22:24:58.209529       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running
2019-12-08T22:24:58.20959814Z W1208 22:24:58.209569       1 actuator.go:377] ci-op-vkpy76m9-16905-pdt5z-master-1: no instances found to delete for machine
2019-12-08T22:24:58.209648523Z I1208 22:24:58.209630       1 controller.go:228] Deleting node "ip-10-0-154-177.ec2.internal" for machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:24:58.231206051Z I1208 22:24:58.231148       1 controller.go:242] Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" deletion successful
2019-12-08T22:25:00.610252762Z I1208 22:25:00.610169       1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1"
2019-12-08T22:25:00.610252762Z I1208 22:25:00.610215       1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists
2019-12-08T22:25:00.667060176Z E1208 22:25:00.666191       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-08T22:25:00.667060176Z I1208 22:25:00.666222       1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-08T22:25:00.667060176Z E1208 22:25:00.666241       1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s

which continues for at least 30m:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | tail -n4
2019-12-08T22:54:58.361762418Z I1208 22:54:58.360176       1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists
2019-12-08T22:54:58.442454699Z E1208 22:54:58.442398       1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-08T22:54:58.442454699Z I1208 22:54:58.442429       1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-08T22:54:58.442503051Z E1208 22:54:58.442449       1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s

The confused logic around eventual-consistency landed for bug 1772163.  I think we may need to go back and add limits to those eventual-consistency guards as discussed in [2].

From Clayton:
> mark urgent since disruption tests have caught at least 5 serious bugs that would impact customers at the worst possible time - disruptive suites get priority because of that

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12
[2]: https://github.com/openshift/cluster-api-provider-aws/pull/271#discussion_r348300076

--- Additional comment from Michael Gugino on 2019-12-09 22:13:48 UTC ---

IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions.  We shouldn't remove the details of the instance from the machine object, it should just be an 'instance state' field that describes the current state.

--- Additional comment from W. Trevor King on 2019-12-09 22:19:52 UTC ---

Are you asking for a revert of [1]?

[1]: https://github.com/openshift/origin/commit/bd064a37e3ec5bf6db4b6a890096398a27be4459

--- Additional comment from Brad Ison on 2019-12-10 11:57:45 UTC ---

> IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions.

That seems pretty drastic. If you're going to suggest something like that, can you *please* be more specific?


It seems to me that the existing PR should fix this:
https://github.com/openshift/cluster-api-provider-aws/pull/280

Comment 2 Jianwei Hou 2019-12-13 09:14:04 UTC
According to a most recent disruptive test for 4.3, it's still failing [1].

Two machines that were deleted and recreated by the disruptive test were
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-vxh8i8vn-2770b-v8htb-master-0.yaml
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-vxh8i8vn-2770b-v8htb-master-1.yaml


Machine controller seems to produce same message[2] as the bug reporting.

```
2019-12-12T18:09:12.213366746Z I1212 18:09:12.212375       1 actuator.go:499] ci-op-vxh8i8vn-2770b-v8htb-master-0: Checking if machine exists
2019-12-12T18:09:12.355415018Z E1212 18:09:12.355370       1 utils.go:186] Excluding instance matching ci-op-vxh8i8vn-2770b-v8htb-master-0: instance i-03bc8d3394d9ac2a1 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-12T18:09:12.35548575Z I1212 18:09:12.355466       1 actuator.go:508] ci-op-vxh8i8vn-2770b-v8htb-master-0: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-12T18:09:12.355588804Z E1212 18:09:12.355522       1 controller.go:279] Failed to check if machine "ci-op-vxh8i8vn-2770b-v8htb-master-0" exists: requeue in: 20s

...


2019-12-12T18:38:32.316281711Z I1212 18:38:32.316267       1 actuator.go:499] ci-op-vxh8i8vn-2770b-v8htb-master-0: Checking if machine exists
2019-12-12T18:38:32.39923898Z E1212 18:38:32.399198       1 utils.go:186] Excluding instance matching ci-op-vxh8i8vn-2770b-v8htb-master-0: instance i-03bc8d3394d9ac2a1 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2019-12-12T18:38:32.39923898Z I1212 18:38:32.399221       1 actuator.go:508] ci-op-vxh8i8vn-2770b-v8htb-master-0: Possible eventual-consistency discrepancy; returning an error to requeue
2019-12-12T18:38:32.399279482Z E1212 18:38:32.399234       1 controller.go:279] Failed to check if machine "ci-op-vxh8i8vn-2770b-v8htb-master-0" exists: requeue in: 20s
```


[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/73
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-6c667d98d5-v4kwr/machine-controller/machine-controller/logs/current.log

Comment 3 Brad Ison 2019-12-13 13:31:17 UTC
I don't think either of the linked runs actually used an image that contained the fix:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/release-images-latest/release-images-latest

That shows:

  "name": "aws-machine-controllers",
  "annotations": {
    "io.openshift.build.commit.id": "a4be8350439b1da03902a65f1399b1269141c137",
    "io.openshift.build.commit.ref": "release-4.3",
    "io.openshift.build.source-location": "https://github.com/openshift/cluster-api-provider-aws"
  }

That commit (a4be835) is from November 26th, and doesn't contain the fix.

Can we try again with a newer release image?

Comment 5 Jianwei Hou 2019-12-16 04:24:23 UTC
Thank you Brad, sorry for the false alarm.

I inspected a new disruptive test that had picked up the fix[1]. The above issue has been fixed, the eventual-consistency error had not occurred[2].

"name": "aws-machine-controllers",
        "annotations": {
          "io.openshift.build.commit.id": "21713459ae063978bac32b8ea35dc6d68998ed89",
          "io.openshift.build.commit.ref": "release-4.3",
          "io.openshift.build.source-location": "https://github.com/openshift/cluster-api-provider-aws"
        },

commit 21713459ae063978bac32b8ea35dc6d68998ed89 (HEAD -> release-4.3, origin/release-4.3)
Merge: 1555366a ba72c04d
Author: OpenShift Merge Robot <openshift-merge-robot.github.com>
Date:   Wed Dec 11 14:51:41 2019 +0100

    Merge pull request #281 from openshift-cherrypick-robot/cherry-pick-280-to-release-4.3

    [release-4.3] Bug 1782176: Ensure Spec.ProviderID is not empty string

[1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/77/artifacts/release-images-latest/release-images-latest
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/77/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-14-042216-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-75d55f5c97-ccn9k/machine-controller/machine-controller/logs/current.log

Comment 7 errata-xmlrpc 2020-01-23 11:18:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062