1880110 – vsphere-ipi double instance create

Bug 1880110 - vsphere-ipi double instance create

Summary: vsphere-ipi double instance create

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michael Gugino
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:	TechnicalReleaseBlocker
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-17 17:22 UTC by David Eads
Modified:	2020-10-27 16:42 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:42:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 703	None	closed	Bug 1880110: vSphere: Fix double create	2020-11-26 05:18:01 UTC
Github	openshift machine-api-operator pull 704	None	closed	Bug 1880110: Vsphere: fix clone task log	2020-11-26 05:18:01 UTC
Github	openshift machine-api-operator pull 710	None	closed	Bug 1880110: vSphere: create requeue delay	2020-11-26 05:18:00 UTC
Github	openshift machine-api-operator pull 711	None	closed	Bug 1880110: vSphere: preserve taskID in local cache	2020-11-26 05:18:25 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:42:33 UTC

Description David Eads 2020-09-17 17:22:49 UTC

[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously

on vsphere IPI https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial , This test fails 40% of the time.

Comment 1 Michael McCune 2020-09-17 19:06:34 UTC

i am not certain this is a problem with the cloud compute components. it looks like 4 of the last 5 failures in the test grid have OVS related errors in the output, for example:

```
Sep 16 05:38:16.778 E ns/openshift-sdn pod/sdn-r267m node/ci-op-jcyf03h8-129d4-bqhm4-worker-96v8n container/sdn container exited with code 255 (Error): ontroller.go:139] [SDN setup] full SDN setup required (plugin is not setup)\nI0916 05:37:32.064020   33105 ovs.go:180] Error executing ovs-vsctl: 2020-09-16T05:37:32Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\nI0916 05:38:02.602970   33105 ovs.go:180] Error executing ovs-vsctl: 2020-09-16T05:38:02Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\nI0916 05:38:03.111744   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:03.615868   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:04.245276   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:05.030331   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:06.013862   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:07.238545   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:08.769391   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:10.684831   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:13.073405   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:16.057708   33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nF0916 05:38:16.057735   33105 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition\n 
```

and

```
Sep 17 04:22:31.106 E ns/openshift-sdn pod/sdn-8crfp node/ci-op-95qx4b6v-129d4-7x8cf-worker-86wfz container/sdn container exited with code 255 (Error): m /config/kube-proxy-config.yaml\nI0917 04:22:17.643049    8270 feature_gate.go:243] feature gates: &{map[]}\nI0917 04:22:17.643087    8270 cmd.go:216] Watching config file /config/kube-proxy-config.yaml for changes\nI0917 04:22:17.643107    8270 cmd.go:216] Watching config file /config/..2020_09_17_04_22_06.377703687/kube-proxy-config.yaml for changes\nI0917 04:22:17.722895    8270 node.go:150] Initializing SDN node "ci-op-95qx4b6v-129d4-7x8cf-worker-86wfz" (172.31.254.159) of type "redhat/openshift-ovs-networkpolicy"\nI0917 04:22:17.726947    8270 cmd.go:159] Starting node networking (v0.0.0-alpha.0-203-g2e41615c)\nI0917 04:22:17.726961    8270 node.go:338] Starting openshift-sdn network plugin\nI0917 04:22:17.910969    8270 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)\nI0917 04:22:17.937256    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:18.441281    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:19.070235    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:19.857468    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:20.838681    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:22.063129    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:23.592948    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:25.504695    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:27.894560    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:30.879774    8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nF0917 04:22:30.879862    8270 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition\n 
```


this failure mode seems to be common and looks like it adds to other issues related to network timeouts.

Comment 2 Dan Williams 2020-09-17 19:15:32 UTC

This is likely another symptom of what https://github.com/openshift/cluster-network-operator/pull/785 will fix, because the sdn_ovs containers say they are running OVS in the container, yet in 4.6 OVS will already be running on the node. Then sadness.

Comment 3 Michael McCune 2020-09-17 19:30:34 UTC

should we mark this bug as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1874696 ?

Comment 4 Dan Williams 2020-09-17 19:43:02 UTC


*** This bug has been marked as a duplicate of bug 1874696 ***

Comment 5 Michael Gugino 2020-09-17 19:47:53 UTC

Test I'm looking at: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1306496787606933504

This looks to be a timeout issue.  There's an 18 minute delta between machine creation time and node joining the cluster.

Further investigation reveals a pending client CSR around 18:35:

(base64 decode CSR request data into 'csr' file)

$ openssl req -in csr -noout -text
Certificate Request:
    Data:
        Version: 1 (0x0)
        Subject: O = system:nodes, CN = system:node:ci-op-glhngidp-129d4-mk2ql-worker-6bm5h

That matches the slow node/machine.

Seeing a lot of these log lines in the machine-controller https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1306496787606933504/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-fc8674cc5-df574_machine-controller.log :

E0917 08:35:35.563128       1 actuator.go:51] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h error: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: The name 'ci-op-glhngidp-129d4-mk2ql-worker-6bm5h' already exists.

The impression I'm getting is we entered exponential backoff at some point.

I0917 08:39:54.862025       1 controller.go:277] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent update
I0917 08:39:54.862030       1 actuator.go:97] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator updating machine
I0917 08:39:54.874098       1 reconciler.go:699] task: task-57832, state: error, description-id: VirtualMachine.clone

First couple reconciles:

I0917 08:32:42.687251       1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine
I0917 08:32:42.709776       1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine
I0917 08:32:42.709801       1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists
I0917 08:32:42.756668       1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4
I0917 08:32:42.889050       1 reconciler.go:176] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: does not exist
I0917 08:32:42.889092       1 controller.go:424] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: going into phase "Provisioning"
I0917 08:32:42.901589       1 controller.go:313] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent create
I0917 08:32:42.901608       1 actuator.go:60] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator creating machine
I0917 08:32:42.917876       1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4
I0917 08:32:43.010130       1 reconciler.go:95] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: cloning
I0917 08:32:43.010190       1 session.go:110] Invalid UUID for VM "ci-op-glhngidp-129d4-mk2ql-rhcos": , trying to find by name
I0917 08:32:43.110729       1 reconciler.go:470] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: no snapshot name provided, getting snapshot using template
I0917 08:32:43.137476       1 reconciler.go:558] Getting network devices
I0917 08:32:43.137530       1 reconciler.go:647] Adding device: ci-segment
I0917 08:32:43.146815       1 reconciler.go:679] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch}
I0917 08:32:43.152016       1 reconciler.go:609] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: running task: 
I0917 08:32:43.152041       1 reconciler.go:715] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: Updating provider status
I0917 08:32:43.152068       1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine
I0917 08:32:43.184430       1 controller.go:325] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: created instance, requeuing
I0917 08:32:43.184520       1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine
I0917 08:32:43.184530       1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists
I0917 08:32:43.196056       1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4
I0917 08:32:43.285958       1 reconciler.go:176] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: does not exist
I0917 08:32:43.285979       1 controller.go:313] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent create
I0917 08:32:43.285984       1 actuator.go:60] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator creating machine
I0917 08:32:43.300864       1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4
I0917 08:32:43.395127       1 reconciler.go:95] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: cloning
I0917 08:32:43.395168       1 session.go:110] Invalid UUID for VM "ci-op-glhngidp-129d4-mk2ql-rhcos": , trying to find by name
I0917 08:32:43.479575       1 reconciler.go:470] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: no snapshot name provided, getting snapshot using template
I0917 08:32:43.502936       1 reconciler.go:558] Getting network devices
I0917 08:32:43.502983       1 reconciler.go:647] Adding device: ci-segment
I0917 08:32:43.509957       1 reconciler.go:679] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch}
I0917 08:32:43.514331       1 reconciler.go:609] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: running task: 
I0917 08:32:43.514439       1 reconciler.go:715] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: Updating provider status
I0917 08:32:43.514499       1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine
I0917 08:32:43.537976       1 controller.go:325] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: created instance, requeuing
I0917 08:32:43.538540       1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine
I0917 08:32:43.538592       1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists
I0917 08:32:43.548909       1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4
I0917 08:32:43.657769       1 reconciler.go:179] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: already exists
I0917 08:32:43.657790       1 controller.go:277] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent update
I0917 08:32:43.657794       1 actuator.go:97] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator updating machine
I0917 08:32:43.668452       1 reconciler.go:699] task: task-57831, state: running, description-id: VirtualMachine.clone
I0917 08:32:43.668481       1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine
E0917 08:32:43.695122       1 actuator.go:51] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h error: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: task task-57831 has not finished
E0917 08:32:43.695243       1 controller.go:279] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: error updating machine: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: task task-57831 has not finished


So, we can see:
I0917 08:32:43.668452       1 reconciler.go:699] task: task-57831, state: running, description-id: VirtualMachine.clone
I0917 08:39:54.874098       1 reconciler.go:699] task: task-57832, state: error, description-id: VirtualMachine.clone

Looks like there was two calls to clone for the same machine, which is probably very wrong.

Comment 6 Michael Gugino 2020-09-17 19:48:32 UTC

This is not a duplicate.  Those other issues may be valid, but the VMWare provider looks to be double creating instances.

Comment 7 Michael McCune 2020-09-17 19:55:47 UTC

thanks Mike, good find in the logs. really odd about the double create.

Comment 8 Michael Gugino 2020-09-17 20:12:39 UTC

This line doesn't make any sense and results in always-empty 'running task:' task number:  https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/vsphere/reconciler.go#L609

Comment 10 Michael Gugino 2020-09-19 16:54:31 UTC

This one might be hard to reproduce to verify on QA as it's due to a race condition that is on the vSphere side of things.

These steps might reproduce on an affected version:

After a cluster is provisioned and all default workers have joined the cluster:
1) Scale up a single machineset +1 replicas.
2) observe CSRs: oc get csr
3) if there are any pending CSRs older than 5 minutes (it may take a few minutes to see any CSRs), you have reproduced the issue.

This only happens some of the time. You can attempt to scale up 1 replica at a time, it's important to wait until the new node
has completely joined the cluster, and then wait another minute or so, before attempting to scale up another one. This will
help ensure that the reconcile loop on the machine-api goes as fast as possible and does not reconcile a different machine
before attempting to reconcile the new replica the second time.

The test

[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously

is still failing for some other reasons, so is not directly an indicator of this problem. For any runs with that test failed, you can investigate
CSRs in the artifacts: artifacts/e2e-vsphere-serial/gather-extra/csr.json. Look for CSRs that have an empty status field.

Here's an affected version before the fix: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307001101131714560/artifacts/e2e-vsphere-serial/gather-extra/csr.json

If you grep for '"status": {}' (without single quotes) you will see a 'pending' CSR. This is the indication for any runs in the future if this fix did not work or we regress.

Comment 11 Michael Gugino 2020-09-21 13:26:09 UTC

This is still happening:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-67b44fccf-v47xq_machine-controller.log

I0921 09:43:09.361082       1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine
I0921 09:43:09.380884       1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine
I0921 09:43:09.380903       1 actuator.go:83] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator checking if machine exists
I0921 09:43:09.408822       1 session.go:113] Find template by instance uuid: 7e6da042-6bd4-4dce-9a3f-f7107cbbc34f
I0921 09:43:09.519379       1 reconciler.go:176] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: does not exist
I0921 09:43:09.519395       1 controller.go:424] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: going into phase "Provisioning"
I0921 09:43:09.528335       1 controller.go:313] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling machine triggers idempotent create
I0921 09:43:09.528351       1 actuator.go:60] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator creating machine
I0921 09:43:09.536220       1 reconciler.go:70] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: cloning
I0921 09:43:09.536248       1 session.go:110] Invalid UUID for VM "ci-op-qhs7wn7r-129d4-kwz65-rhcos": , trying to find by name
I0921 09:43:09.673165       1 reconciler.go:476] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: no snapshot name provided, getting snapshot using template
I0921 09:43:09.702159       1 reconciler.go:564] Getting network devices
I0921 09:43:09.702192       1 reconciler.go:653] Adding device: ci-segment
I0921 09:43:09.710601       1 reconciler.go:685] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch}
I0921 09:43:09.716798       1 reconciler.go:615] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: running task: task-78152
I0921 09:43:09.716817       1 reconciler.go:721] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: Updating provider status
I0921 09:43:09.716833       1 machine_scope.go:102] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: patching machine
I0921 09:43:09.752011       1 controller.go:325] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: created instance, requeuing
I0921 09:43:09.752058       1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine
I0921 09:43:09.752064       1 actuator.go:83] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator checking if machine exists
I0921 09:43:09.760184       1 session.go:113] Find template by instance uuid: 7e6da042-6bd4-4dce-9a3f-f7107cbbc34f
I0921 09:43:09.896941       1 reconciler.go:176] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: does not exist
I0921 09:43:09.897023       1 controller.go:313] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling machine triggers idempotent create
I0921 09:43:09.897041       1 actuator.go:60] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator creating machine
I0921 09:43:09.906548       1 reconciler.go:70] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: cloning
I0921 09:43:09.906584       1 session.go:110] Invalid UUID for VM "ci-op-qhs7wn7r-129d4-kwz65-rhcos": , trying to find by name
I0921 09:43:10.046831       1 reconciler.go:476] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: no snapshot name provided, getting snapshot using template
I0921 09:43:10.082641       1 reconciler.go:564] Getting network devices
I0921 09:43:10.082681       1 reconciler.go:653] Adding device: ci-segment
I0921 09:43:10.093127       1 reconciler.go:685] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch}
I0921 09:43:10.098210       1 reconciler.go:615] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: running task: task-78153
I0921 09:43:10.098224       1 reconciler.go:721] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: Updating provider status
I0921 09:43:10.098242       1 machine_scope.go:102] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: patching machine
I0921 09:43:10.150925       1 controller.go:325] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: created instance, requeuing

Comment 12 Michael Gugino 2020-09-21 14:38:29 UTC

I'm seeing two create events (well, a single "Created Machine" event with a last timestamp different from the created timestamp, unlike other machines):

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984/artifacts/e2e-vsphere-serial/gather-extra/events.json

https://github.com/openshift/machine-api-operator/blob/f304d771cd3bc40305430b6a3f1ff8ca9a56e91d/pkg/controller/vsphere/actuator.go#L79

That tells me the code didn't die up until that point, which is immediately followed by setting the status:

https://github.com/openshift/machine-api-operator/blob/f304d771cd3bc40305430b6a3f1ff8ca9a56e91d/pkg/controller/vsphere/machine_scope.go#L101

So, one of a few things is happening.  1) We're failing to generate a proper patch for the status 2) The API server is failing to apply the patch properly or 3) The next reconcile is getting a stale copy of the machine.


I'm leaning towards possibility 3.  This only ever seems to happen on the 4th machine created during the scaleup test and not the initial worker machines.  It stands to reason the API server might be a little busier at this time than during the initial machine creations.  The primary difference as best I can see is that when the initial workers are created, they are reconciled at the same time, so no machine is reconciled twice in a row, whereas this final machine is reconciled within milliseconds of patching it's status (I outlined in more detail in QA steps above).

Will need to investigate further.

Comment 14 Milind Yadav 2020-09-28 05:11:34 UTC

Moved to VERIFIED based on the artifacts of build 4.6.0-0.nightly-2020-09-26-171448 - https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1309905762046185472/artifacts/e2e-vsphere-serial/gather-extra/csr.json
 
Did not see any '"status": {}' (without single quotes) ..

Comment 15 Michael Gugino 2020-09-28 15:46:19 UTC

We hit this again in CI with the latest fix.  Need to do more to protect against this.

Latest failed CI run: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1309839466591948800

Log lines match up with latest code, so the intended fixes did not work.

Comment 16 David Eads 2020-09-30 14:53:40 UTC

This appears to prevent cluster scaling of workers on vsphere-ipi.

Comment 17 Michael Gugino 2020-10-01 21:10:21 UTC

Can see the latest patch set mitigate the race condition:

vSphere run demonstrates the exact condition is caught and mitigated in this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-795884cdfc-7zdwh_machine-controller.log

E0929 17:48:18.393190 1 actuator.go:83] ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc: machine object missing expected provider task ID, requeue

We can see the worker machine ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc has the taskRef field populated here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/machines.json

Comment 20 Milind Yadav 2020-10-07 04:44:19 UTC

Validated on latest build 4.6.0-0.nightly-2020-10-06-212602 and few others randomly 


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313591939102150656/artifacts/e2e-vsphere-serial/gather-extra/machines.json

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313638023199461376/artifacts/e2e-vsphere-serial/gather-extra/machines.json

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313446667906715648/artifacts/e2e-vsphere-serial/gather-extra/machines.json

Moving to VERIFIED..

Additional Info:
Validated as per the steps and info on comments , please feel free to add any more instructions,  in case , I am missing anything ..

Comment 22 errata-xmlrpc 2020-10-27 16:42:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.