Description of problem: Failed job: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/32 Failed error: STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 Version-Release number of selected component (if applicable): redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2 How reproducible: Some times
I0823 05:57:02.704074 1 controller.go:205] Reconciling machine "ci-op-4inqf4cw-3a8ca-666zl-worker-centralus1-cz5td" triggers delete I0823 06:00:14.863085 1 controller.go:239] Machine "ci-op-4inqf4cw-3a8ca-666zl-worker-centralus1-cz5td" deletion successful The deletion on Azure just takes longer than on AWS. So even 2 minutes is not sufficient.
The test appears flaky(passed 6 times, failed 4 times), sometimes it failed due to timeout. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2&sort-by-flakiness= STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 ........... STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 Aug 28 05:55:02.613: INFO: Running AfterSuite actions on all nodes Aug 28 05:55:02.613: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/machines/scale.go:221]: Timed out after 240.000s.
From https://console.cloud.google.com/storage/browser/_details/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/61/artifacts/e2e-azure-serial/container-logs/test.log: ``` Aug 28 05:55:02.613: INFO: Running AfterSuite actions on all nodes ``` From https://console.cloud.google.com/storage/browser/_details/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/61/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-564f659496-grhfp_machine-controller.log: ``` I0828 05:51:37.299415 1 controller.go:205] Reconciling machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx" triggers delete I0828 05:52:39.091238 1 controller.go:302] drain successful for machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx" I0828 05:55:41.244092 1 virtualmachines.go:242] successfully deleted vm ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx I0828 05:56:41.832275 1 disks.go:65] successfully deleted disk ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx_OSDisk I0828 05:56:52.253019 1 networkinterfaces.go:197] successfully deleted nic ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx-nic I0828 05:56:52.328967 1 controller.go:239] Machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx" deletion successful ``` The last node was deleted almost 2 minutes after the timeout. It took smth over 3 minutes to delete the vm resource in Azure. In total, 5m15s to delete the last machine. So even 6 minutes timeout does not have to enough if two machines are requested to be deleted.
The last five tests have all passed. The test is stable and reliable, mark as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922