Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1280 ``` [Feature:Machines][Serial] Managed cluster should [Top Level] [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] expand_less 14m43s fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s. Expected <bool>: false to be true ``` Version-Release number of selected component (if applicable): 4.4 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
should be looked into, but not currently a top flake based on https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-azure-serial-4.4&sort-by-flakiness=
Also happens on 4.5: $ curl -s 'https://search.svc.ci.openshift.org/search?name=release-openshift-ocp-.*4.%5b45%5d&search=failed:.*Managed+cluster+should+grow+and+decrease+when+scaling+different+machineSets+simultaneously&maxAge=24h&context=0&type=build-log' | jq -r '. | keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1339 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1349 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/279 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/280 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/282 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/301 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/303 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/298 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.4/1296 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/281 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/295
Some of these seems to be timeout while waiting for the machines to be deleted because of azure cloud taking quite a while https://github.com/openshift/origin/blob/1bd3012a7af33329ac920a589a7880db25fa2eda/test/extended/machines/scale.go#L211 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1339 STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3 STEP: got 6 nodes, expecting 3
For https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336 Timed out scaling down fromr 6 to 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 Mar 19 09:16:22.250: INFO: Running AfterSuite actions on all nodes Mar 19 09:16:22.250: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s. Expected <bool>: false to be true Although eventually got back to three nodes https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336/artifacts/e2e-azure-serial/nodes/ https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336/artifacts/e2e-azure-serial/machines.json For https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350 same as above STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 STEP: got 4 nodes, expecting 3 Mar 20 03:30:12.537: INFO: Running AfterSuite actions on all nodes Mar 20 03:30:12.537: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s. Expected <bool>: false to be true https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/nodes/ This scaled down started at: I0320 03:23:12.468046 1 controller.go:301] Too many replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-rry2m48p-0ba00-q75sz-worker-centralus3, need 1, deleting 1 https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-65854b4c4-8vzqq_controller-manager.log The machine deletion succeed slightly over 7 min later waiting for the underlying infra to go away: I0320 03:31:02.741284 1 actuator.go:196] Checking if machine ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2 exists I0320 03:31:03.013800 1 controller.go:241] ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2: deleting node "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2" for machine I0320 03:31:03.013840 1 controller.go:388] Node "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2" not found I0320 03:31:03.038198 1 controller.go:255] ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2: machine deletion successful https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-65854b4c4-8vzqq_machine-controller.log This seems to be the case for most of the links above. I'll bump timeout.
This was probably caused by adding the Exists() call before deleting the node object recently. Most likely the VMs have been slow to terminate on Azure for some time.
>This was probably caused by adding the Exists() call before deleting the node object recently. Most likely the VMs have been slow to terminate on Azure for some time. I thought so as well but this is being requeued even before that check. The Delete() call is just legitimately failing and so requeueing since the underlying resource is taking quite a while to go away. E0320 03:30:32.078676 1 actuator.go:79] Machine error: failed to delete machine "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2": failed to delete OS disk: result error: compute.DisksDeleteFuture: asynchronous operation has not completed
In addition to the ci testing, in QE's downstream automated testing, we are experiencing the same test flakiness, mostly caused by timeouts during scaling down. Very reproducible on Azure and OSP. When inspecting these test clusters, I found machine and node can finally get deleted, but they occasionally take longer time than the wait time. This test still appears very flaky in https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-ocp-installer-e2e-azure-serial-4.5&sort-by-flakiness. Setting to assigned to see if more fixes or wait time is needed
Hey Jianwei there's nothing else we can do for this particular failures as this is the time that is taking for the infrastructure to get deleted from azure, so increasing the timeout seems legit here.
By looking at the data from 04-03 to today, the pass ratio has been greatly improved. Moving to verified since timeout is increased.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409'