1744908 – [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]

Bug 1744908 - [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]

Summary: [Feature:Machines][Serial] Managed cluster should grow and decrease when scal...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Jan Chaloupka
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:	azure
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-23 08:22 UTC by scheng
Modified:	2019-10-16 06:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:37:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23668	'None'	closed	bug 1744908: [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simu...	2020-08-12 18:50:43 UTC
Github	openshift origin pull 23685	'None'	closed	bug 1744908: [Feature:Machines][Serial] Managed cluster should: increase timeout waiting for nodes to disappear	2020-08-12 18:50:43 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:37:45 UTC

Description scheng 2019-08-23 08:22:30 UTC

Description of problem:
Failed job:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/32

Failed error:
STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3

Version-Release number of selected component (if applicable):
redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2

How reproducible:
Some times

Comment 1 Jan Chaloupka 2019-08-25 19:24:29 UTC

I0823 05:57:02.704074       1 controller.go:205] Reconciling machine "ci-op-4inqf4cw-3a8ca-666zl-worker-centralus1-cz5td" triggers delete
I0823 06:00:14.863085       1 controller.go:239] Machine "ci-op-4inqf4cw-3a8ca-666zl-worker-centralus1-cz5td" deletion successful

The deletion on Azure just takes longer than on AWS. So even 2 minutes is not sufficient.

Comment 3 Jianwei Hou 2019-08-28 06:55:58 UTC

The test appears flaky(passed 6 times, failed 4 times), sometimes it failed due to timeout. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-serial-4.2&sort-by-flakiness= 


STEP: waiting for cluster to get back to original size. Final size should be 3 worker nodes
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
STEP: got 6 nodes, expecting 3
...........

STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
STEP: got 4 nodes, expecting 3
Aug 28 05:55:02.613: INFO: Running AfterSuite actions on all nodes
Aug 28 05:55:02.613: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/machines/scale.go:221]: Timed out after 240.000s.

Comment 4 Jan Chaloupka 2019-08-28 08:14:24 UTC

From https://console.cloud.google.com/storage/browser/_details/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/61/artifacts/e2e-azure-serial/container-logs/test.log:

```
Aug 28 05:55:02.613: INFO: Running AfterSuite actions on all nodes
```

From https://console.cloud.google.com/storage/browser/_details/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/61/artifacts/e2e-azure-serial/pods/openshift-machine-api_machine-api-controllers-564f659496-grhfp_machine-controller.log:

```
I0828 05:51:37.299415       1 controller.go:205] Reconciling machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx" triggers delete
I0828 05:52:39.091238       1 controller.go:302] drain successful for machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx"
I0828 05:55:41.244092       1 virtualmachines.go:242] successfully deleted vm ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx
I0828 05:56:41.832275       1 disks.go:65] successfully deleted disk ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx_OSDisk
I0828 05:56:52.253019       1 networkinterfaces.go:197] successfully deleted nic ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx-nic
I0828 05:56:52.328967       1 controller.go:239] Machine "ci-op-1zzgtx6m-3a8ca-trcpx-worker-centralus3-fvmdx" deletion successful
```

The last node was deleted almost 2 minutes after the timeout. It took smth over 3 minutes to delete the vm resource in Azure. In total, 5m15s to delete the last machine.

So even 6 minutes timeout does not have to enough if two machines are requested to be deleted.

Comment 6 Jianwei Hou 2019-09-03 07:32:46 UTC

The last five tests have all passed. The test is stable and reliable, mark as verified.

Comment 7 errata-xmlrpc 2019-10-16 06:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.