[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously on vsphere IPI https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial , This test fails 40% of the time.
i am not certain this is a problem with the cloud compute components. it looks like 4 of the last 5 failures in the test grid have OVS related errors in the output, for example: ``` Sep 16 05:38:16.778 E ns/openshift-sdn pod/sdn-r267m node/ci-op-jcyf03h8-129d4-bqhm4-worker-96v8n container/sdn container exited with code 255 (Error): ontroller.go:139] [SDN setup] full SDN setup required (plugin is not setup)\nI0916 05:37:32.064020 33105 ovs.go:180] Error executing ovs-vsctl: 2020-09-16T05:37:32Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\nI0916 05:38:02.602970 33105 ovs.go:180] Error executing ovs-vsctl: 2020-09-16T05:38:02Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\nI0916 05:38:03.111744 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:03.615868 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:04.245276 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:05.030331 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:06.013862 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:07.238545 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:08.769391 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:10.684831 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:13.073405 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nI0916 05:38:16.057708 33105 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)\nF0916 05:38:16.057735 33105 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition\n ``` and ``` Sep 17 04:22:31.106 E ns/openshift-sdn pod/sdn-8crfp node/ci-op-95qx4b6v-129d4-7x8cf-worker-86wfz container/sdn container exited with code 255 (Error): m /config/kube-proxy-config.yaml\nI0917 04:22:17.643049 8270 feature_gate.go:243] feature gates: &{map[]}\nI0917 04:22:17.643087 8270 cmd.go:216] Watching config file /config/kube-proxy-config.yaml for changes\nI0917 04:22:17.643107 8270 cmd.go:216] Watching config file /config/..2020_09_17_04_22_06.377703687/kube-proxy-config.yaml for changes\nI0917 04:22:17.722895 8270 node.go:150] Initializing SDN node "ci-op-95qx4b6v-129d4-7x8cf-worker-86wfz" (172.31.254.159) of type "redhat/openshift-ovs-networkpolicy"\nI0917 04:22:17.726947 8270 cmd.go:159] Starting node networking (v0.0.0-alpha.0-203-g2e41615c)\nI0917 04:22:17.726961 8270 node.go:338] Starting openshift-sdn network plugin\nI0917 04:22:17.910969 8270 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)\nI0917 04:22:17.937256 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:18.441281 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:19.070235 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:19.857468 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:20.838681 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:22.063129 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:23.592948 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:25.504695 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:27.894560 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nI0917 04:22:30.879774 8270 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket\nF0917 04:22:30.879862 8270 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition\n ``` this failure mode seems to be common and looks like it adds to other issues related to network timeouts.
This is likely another symptom of what https://github.com/openshift/cluster-network-operator/pull/785 will fix, because the sdn_ovs containers say they are running OVS in the container, yet in 4.6 OVS will already be running on the node. Then sadness.
should we mark this bug as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1874696 ?
*** This bug has been marked as a duplicate of bug 1874696 ***
Test I'm looking at: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1306496787606933504 This looks to be a timeout issue. There's an 18 minute delta between machine creation time and node joining the cluster. Further investigation reveals a pending client CSR around 18:35: (base64 decode CSR request data into 'csr' file) $ openssl req -in csr -noout -text Certificate Request: Data: Version: 1 (0x0) Subject: O = system:nodes, CN = system:node:ci-op-glhngidp-129d4-mk2ql-worker-6bm5h That matches the slow node/machine. Seeing a lot of these log lines in the machine-controller https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1306496787606933504/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-fc8674cc5-df574_machine-controller.log : E0917 08:35:35.563128 1 actuator.go:51] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h error: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: The name 'ci-op-glhngidp-129d4-mk2ql-worker-6bm5h' already exists. The impression I'm getting is we entered exponential backoff at some point. I0917 08:39:54.862025 1 controller.go:277] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent update I0917 08:39:54.862030 1 actuator.go:97] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator updating machine I0917 08:39:54.874098 1 reconciler.go:699] task: task-57832, state: error, description-id: VirtualMachine.clone First couple reconciles: I0917 08:32:42.687251 1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine I0917 08:32:42.709776 1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine I0917 08:32:42.709801 1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists I0917 08:32:42.756668 1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4 I0917 08:32:42.889050 1 reconciler.go:176] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: does not exist I0917 08:32:42.889092 1 controller.go:424] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: going into phase "Provisioning" I0917 08:32:42.901589 1 controller.go:313] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent create I0917 08:32:42.901608 1 actuator.go:60] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator creating machine I0917 08:32:42.917876 1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4 I0917 08:32:43.010130 1 reconciler.go:95] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: cloning I0917 08:32:43.010190 1 session.go:110] Invalid UUID for VM "ci-op-glhngidp-129d4-mk2ql-rhcos": , trying to find by name I0917 08:32:43.110729 1 reconciler.go:470] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: no snapshot name provided, getting snapshot using template I0917 08:32:43.137476 1 reconciler.go:558] Getting network devices I0917 08:32:43.137530 1 reconciler.go:647] Adding device: ci-segment I0917 08:32:43.146815 1 reconciler.go:679] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch} I0917 08:32:43.152016 1 reconciler.go:609] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: running task: I0917 08:32:43.152041 1 reconciler.go:715] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: Updating provider status I0917 08:32:43.152068 1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine I0917 08:32:43.184430 1 controller.go:325] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: created instance, requeuing I0917 08:32:43.184520 1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine I0917 08:32:43.184530 1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists I0917 08:32:43.196056 1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4 I0917 08:32:43.285958 1 reconciler.go:176] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: does not exist I0917 08:32:43.285979 1 controller.go:313] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent create I0917 08:32:43.285984 1 actuator.go:60] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator creating machine I0917 08:32:43.300864 1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4 I0917 08:32:43.395127 1 reconciler.go:95] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: cloning I0917 08:32:43.395168 1 session.go:110] Invalid UUID for VM "ci-op-glhngidp-129d4-mk2ql-rhcos": , trying to find by name I0917 08:32:43.479575 1 reconciler.go:470] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: no snapshot name provided, getting snapshot using template I0917 08:32:43.502936 1 reconciler.go:558] Getting network devices I0917 08:32:43.502983 1 reconciler.go:647] Adding device: ci-segment I0917 08:32:43.509957 1 reconciler.go:679] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch} I0917 08:32:43.514331 1 reconciler.go:609] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: running task: I0917 08:32:43.514439 1 reconciler.go:715] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: Updating provider status I0917 08:32:43.514499 1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine I0917 08:32:43.537976 1 controller.go:325] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: created instance, requeuing I0917 08:32:43.538540 1 controller.go:169] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling Machine I0917 08:32:43.538592 1 actuator.go:83] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator checking if machine exists I0917 08:32:43.548909 1 session.go:113] Find template by instance uuid: f6bc439f-bd36-4c37-99c6-eb63acf8e7a4 I0917 08:32:43.657769 1 reconciler.go:179] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: already exists I0917 08:32:43.657790 1 controller.go:277] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciling machine triggers idempotent update I0917 08:32:43.657794 1 actuator.go:97] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: actuator updating machine I0917 08:32:43.668452 1 reconciler.go:699] task: task-57831, state: running, description-id: VirtualMachine.clone I0917 08:32:43.668481 1 machine_scope.go:102] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: patching machine E0917 08:32:43.695122 1 actuator.go:51] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h error: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: task task-57831 has not finished E0917 08:32:43.695243 1 controller.go:279] ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: error updating machine: ci-op-glhngidp-129d4-mk2ql-worker-6bm5h: reconciler failed to Update machine: task task-57831 has not finished So, we can see: I0917 08:32:43.668452 1 reconciler.go:699] task: task-57831, state: running, description-id: VirtualMachine.clone I0917 08:39:54.874098 1 reconciler.go:699] task: task-57832, state: error, description-id: VirtualMachine.clone Looks like there was two calls to clone for the same machine, which is probably very wrong.
This is not a duplicate. Those other issues may be valid, but the VMWare provider looks to be double creating instances.
thanks Mike, good find in the logs. really odd about the double create.
This line doesn't make any sense and results in always-empty 'running task:' task number: https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/vsphere/reconciler.go#L609
This one might be hard to reproduce to verify on QA as it's due to a race condition that is on the vSphere side of things. These steps might reproduce on an affected version: After a cluster is provisioned and all default workers have joined the cluster: 1) Scale up a single machineset +1 replicas. 2) observe CSRs: oc get csr 3) if there are any pending CSRs older than 5 minutes (it may take a few minutes to see any CSRs), you have reproduced the issue. This only happens some of the time. You can attempt to scale up 1 replica at a time, it's important to wait until the new node has completely joined the cluster, and then wait another minute or so, before attempting to scale up another one. This will help ensure that the reconcile loop on the machine-api goes as fast as possible and does not reconcile a different machine before attempting to reconcile the new replica the second time. The test [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously is still failing for some other reasons, so is not directly an indicator of this problem. For any runs with that test failed, you can investigate CSRs in the artifacts: artifacts/e2e-vsphere-serial/gather-extra/csr.json. Look for CSRs that have an empty status field. Here's an affected version before the fix: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307001101131714560/artifacts/e2e-vsphere-serial/gather-extra/csr.json If you grep for '"status": {}' (without single quotes) you will see a 'pending' CSR. This is the indication for any runs in the future if this fix did not work or we regress.
This is still happening: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-67b44fccf-v47xq_machine-controller.log I0921 09:43:09.361082 1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine I0921 09:43:09.380884 1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine I0921 09:43:09.380903 1 actuator.go:83] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator checking if machine exists I0921 09:43:09.408822 1 session.go:113] Find template by instance uuid: 7e6da042-6bd4-4dce-9a3f-f7107cbbc34f I0921 09:43:09.519379 1 reconciler.go:176] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: does not exist I0921 09:43:09.519395 1 controller.go:424] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: going into phase "Provisioning" I0921 09:43:09.528335 1 controller.go:313] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling machine triggers idempotent create I0921 09:43:09.528351 1 actuator.go:60] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator creating machine I0921 09:43:09.536220 1 reconciler.go:70] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: cloning I0921 09:43:09.536248 1 session.go:110] Invalid UUID for VM "ci-op-qhs7wn7r-129d4-kwz65-rhcos": , trying to find by name I0921 09:43:09.673165 1 reconciler.go:476] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: no snapshot name provided, getting snapshot using template I0921 09:43:09.702159 1 reconciler.go:564] Getting network devices I0921 09:43:09.702192 1 reconciler.go:653] Adding device: ci-segment I0921 09:43:09.710601 1 reconciler.go:685] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch} I0921 09:43:09.716798 1 reconciler.go:615] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: running task: task-78152 I0921 09:43:09.716817 1 reconciler.go:721] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: Updating provider status I0921 09:43:09.716833 1 machine_scope.go:102] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: patching machine I0921 09:43:09.752011 1 controller.go:325] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: created instance, requeuing I0921 09:43:09.752058 1 controller.go:169] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling Machine I0921 09:43:09.752064 1 actuator.go:83] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator checking if machine exists I0921 09:43:09.760184 1 session.go:113] Find template by instance uuid: 7e6da042-6bd4-4dce-9a3f-f7107cbbc34f I0921 09:43:09.896941 1 reconciler.go:176] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: does not exist I0921 09:43:09.897023 1 controller.go:313] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: reconciling machine triggers idempotent create I0921 09:43:09.897041 1 actuator.go:60] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: actuator creating machine I0921 09:43:09.906548 1 reconciler.go:70] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: cloning I0921 09:43:09.906584 1 session.go:110] Invalid UUID for VM "ci-op-qhs7wn7r-129d4-kwz65-rhcos": , trying to find by name I0921 09:43:10.046831 1 reconciler.go:476] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: no snapshot name provided, getting snapshot using template I0921 09:43:10.082641 1 reconciler.go:564] Getting network devices I0921 09:43:10.082681 1 reconciler.go:653] Adding device: ci-segment I0921 09:43:10.093127 1 reconciler.go:685] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:ci-segment}, device info: &{VirtualDeviceBackingInfo:{DynamicData:{}} OpaqueNetworkId:b20ece59-b48b-468d-b498-011a9e7bb9ec OpaqueNetworkType:nsx.LogicalSwitch} I0921 09:43:10.098210 1 reconciler.go:615] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: running task: task-78153 I0921 09:43:10.098224 1 reconciler.go:721] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: Updating provider status I0921 09:43:10.098242 1 machine_scope.go:102] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: patching machine I0921 09:43:10.150925 1 controller.go:325] ci-op-qhs7wn7r-129d4-kwz65-worker-9587t: created instance, requeuing
I'm seeing two create events (well, a single "Created Machine" event with a last timestamp different from the created timestamp, unlike other machines): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1307958161742761984/artifacts/e2e-vsphere-serial/gather-extra/events.json https://github.com/openshift/machine-api-operator/blob/f304d771cd3bc40305430b6a3f1ff8ca9a56e91d/pkg/controller/vsphere/actuator.go#L79 That tells me the code didn't die up until that point, which is immediately followed by setting the status: https://github.com/openshift/machine-api-operator/blob/f304d771cd3bc40305430b6a3f1ff8ca9a56e91d/pkg/controller/vsphere/machine_scope.go#L101 So, one of a few things is happening. 1) We're failing to generate a proper patch for the status 2) The API server is failing to apply the patch properly or 3) The next reconcile is getting a stale copy of the machine. I'm leaning towards possibility 3. This only ever seems to happen on the 4th machine created during the scaleup test and not the initial worker machines. It stands to reason the API server might be a little busier at this time than during the initial machine creations. The primary difference as best I can see is that when the initial workers are created, they are reconciled at the same time, so no machine is reconciled twice in a row, whereas this final machine is reconciled within milliseconds of patching it's status (I outlined in more detail in QA steps above). Will need to investigate further.
Moved to VERIFIED based on the artifacts of build 4.6.0-0.nightly-2020-09-26-171448 - https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1309905762046185472/artifacts/e2e-vsphere-serial/gather-extra/csr.json Did not see any '"status": {}' (without single quotes) ..
We hit this again in CI with the latest fix. Need to do more to protect against this. Latest failed CI run: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1309839466591948800 Log lines match up with latest code, so the intended fixes did not work.
This appears to prevent cluster scaling of workers on vsphere-ipi.
Can see the latest patch set mitigate the race condition: vSphere run demonstrates the exact condition is caught and mitigated in this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-795884cdfc-7zdwh_machine-controller.log E0929 17:48:18.393190 1 actuator.go:83] ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc: machine object missing expected provider task ID, requeue We can see the worker machine ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc has the taskRef field populated here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/machines.json
Validated on latest build 4.6.0-0.nightly-2020-10-06-212602 and few others randomly https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313591939102150656/artifacts/e2e-vsphere-serial/gather-extra/machines.json https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313638023199461376/artifacts/e2e-vsphere-serial/gather-extra/machines.json https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere-serial/1313446667906715648/artifacts/e2e-vsphere-serial/gather-extra/machines.json Moving to VERIFIED.. Additional Info: Validated as per the steps and info on comments , please feel free to add any more instructions, in case , I am missing anything ..
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196