1812240 – [Feature:Machines][Serial] Managed cluster should [Top Level] [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] : e2e-azure-serial-4.4

Bug 1812240 - [Feature:Machines][Serial] Managed cluster should [Top Level] [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] : e2e-azure-serial-4.4

Summary: [Feature:Machines][Serial] Managed cluster should [Top Level] [Feature:Machin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alexander Demicev
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1824092
TreeView+	depends on / blocked

Reported:	2020-03-10 19:54 UTC by tflannag
Modified:	2020-08-27 22:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1824092 (view as bug list)
Environment:
Last Closed:	2020-08-27 22:35:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24749	0	None	closed	BUG 1812240: Bump timeout for cluster to get back to original size	2020-09-16 06:31:13 UTC

Description tflannag 2020-03-10 19:54:35 UTC

Description of problem:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1280

```

[Feature:Machines][Serial] Managed cluster should [Top Level] [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] expand_less	14m43s
fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s.
Expected
    <bool>: false
to be true
```

Version-Release number of selected component (if applicable):
4.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Parees 2020-03-10 22:15:10 UTC

should be looked into, but not currently a top flake based on 
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-azure-serial-4.4&sort-by-flakiness=

Comment 2 W. Trevor King 2020-03-20 04:49:28 UTC

Also happens on 4.5:

$ curl -s 'https://search.svc.ci.openshift.org/search?name=release-openshift-ocp-.*4.%5b45%5d&search=failed:.*Managed+cluster+should+grow+and+decrease+when+scaling+different+machineSets+simultaneously&maxAge=24h&context=0&type=build-log' | jq -r '. | keys[]'
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1339
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1349
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/279
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/280
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/282
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/301
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/303
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/298
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.4/1296
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/281
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/295

Comment 3 Alberto 2020-03-20 10:11:39 UTC

Some of these seems to be timeout while waiting for the machines to be deleted because of azure cloud taking quite a while https://github.com/openshift/origin/blob/1bd3012a7af33329ac920a589a7880db25fa2eda/test/extended/machines/scale.go#L211

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1339

STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3

Comment 4 Alberto 2020-03-23 15:52:19 UTC

For https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336
Timed out scaling down fromr 6 to 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
Mar 19 09:16:22.250: INFO: Running AfterSuite actions on all nodes
Mar 19 09:16:22.250: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s.
Expected
    <bool>: false
to be true

Although eventually got back to three nodes https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336/artifacts/e2e-azure-serial/nodes/
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1336/artifacts/e2e-azure-serial/machines.json

For https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350 same as above
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
Mar 20 03:30:12.537: INFO: Running AfterSuite actions on all nodes
Mar 20 03:30:12.537: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/machines/scale.go:223]: Timed out after 420.000s.
Expected
    <bool>: false
to be true

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/nodes/

This scaled down started at:
I0320 03:23:12.468046       1 controller.go:301] Too many replicas for machine.openshift.io/v1beta1, Kind=MachineSet openshift-machine-api/ci-op-rry2m48p-0ba00-q75sz-worker-centralus3, need 1, deleting 1
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-65854b4c4-8vzqq_controller-manager.log

The machine deletion succeed slightly over 7 min later waiting for the underlying infra to go away:
I0320 03:31:02.741284       1 actuator.go:196] Checking if machine ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2 exists
I0320 03:31:03.013800       1 controller.go:241] ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2: deleting node "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2" for machine
I0320 03:31:03.013840       1 controller.go:388] Node "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2" not found
I0320 03:31:03.038198       1 controller.go:255] ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2: machine deletion successful
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1350/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-65854b4c4-8vzqq_machine-controller.log

This seems to be the case for most of the links above. I'll bump timeout.

Comment 5 Michael Gugino 2020-03-23 16:02:13 UTC

This was probably caused by adding the Exists() call before deleting the node object recently.  Most likely the VMs have been slow to terminate on Azure for some time.

Comment 6 Alberto 2020-03-23 16:22:44 UTC

>This was probably caused by adding the Exists() call before deleting the node object recently.  Most likely the VMs have been slow to terminate on Azure for some time.

I thought so as well but this is being requeued even before that check. The Delete() call is just legitimately failing and so requeueing since the underlying resource is taking quite a while to go away.

E0320 03:30:32.078676       1 actuator.go:79] Machine error: failed to delete machine "ci-op-rry2m48p-0ba00-q75sz-worker-centralus3-mm2t2": failed to delete OS disk: result error: compute.DisksDeleteFuture: asynchronous operation has not completed

Comment 10 Jianwei Hou 2020-04-02 03:31:05 UTC

In addition to the ci testing, in QE's downstream automated testing, we are experiencing the same test flakiness, mostly caused by timeouts during scaling down. Very reproducible on Azure and OSP. When inspecting these test clusters, I found machine and node can finally get deleted, but they occasionally take longer time than the wait time.

This test still appears very flaky in https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-ocp-installer-e2e-azure-serial-4.5&sort-by-flakiness. Setting to assigned to see if more fixes or wait time is needed

Comment 11 Alberto 2020-04-02 07:01:18 UTC

Hey Jianwei there's nothing else we can do for this particular failures as this is the time that is taking for the infrastructure to get deleted from azure, so increasing the timeout seems legit here.

Comment 13 Jianwei Hou 2020-04-07 02:06:15 UTC

By looking at the data from 04-03 to today, the pass ratio has been greatly improved. Moving to verified since timeout is increased.

Comment 14 Luke Meyer 2020-08-27 22:35:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409'

Note You need to log in before you can comment on or make changes to this bug.