1737134 – [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] failing on e2e-aws 20%

Bug 1737134 - [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] failing on e2e-aws 20%

Summary: [Feature:Machines][Serial] Managed cluster should grow and decrease when scal...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-02 18:44 UTC by Peter Hunt
Modified:	2019-10-16 06:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:34:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23612	'None'	closed	bug 1737134: [Feature:Machines][Serial] Managed cluster should: increase timeout waiting for nodes to disappear	2020-05-26 05:43:24 UTC
Github	openshift origin pull 23625	'None'	closed	Bug 1737134: [Feature:Machines][Serial] Managed cluster should: get machineset role from template labels	2020-05-26 05:43:24 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:34:42 UTC

Description Peter Hunt 2019-08-02 18:44:34 UTC

Description of problem:
[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
failing on e2e-aws intermittently

Version-Release number of selected component (if applicable):
4.2.0

How reproducible:
about half of occurances as of 8/2
https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.2&sort-by-flakiness=10&exclude-non-failed-tests=

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1725259 seems related but is about Azure and not AWS

Comment 1 Dan Mace 2019-08-09 18:28:23 UTC

The summary of this bug refers to an ingress test, but then the description refers to [Feature:Machines] (which still appears on the test grid as flaky), but then in additional info cites an ingress bug that matches the bug summary.

Is this bug about:

  [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
failing on e2e-aws intermittently

or:

  Managed cluster should should expose cluster services outside the cluster test failing on AWS 

If it's about the Machines test, the component needs reassigned to whatever team owns machine sets and the summary fixed. If the bug is about the ingress test, I recommend we close the bug because I see no evidence of flakiness on AWS (please correct me if I'm wrong).

Comment 2 Clayton Coleman 2019-08-13 22:08:55 UTC

Resolving the confusion.

Comment 3 Jan Chaloupka 2019-08-15 10:42:53 UTC

Based on the logs in https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/3322.

1) Both machineset were scaled up:

STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a" from 3 to 2 replicas
Aug 15 03:42:08.361: INFO: >>> kubeConfig: /tmp/admin.kubeconfig
STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b" from 2 to 1 replicas
Aug 15 03:42:08.485: INFO: >>> kubeConfig: /tmp/admin.kubeconfig

2) Machine controller deleted both machines and nodes successfully:

I0815 03:42:08.487601       1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" triggers delete
W0815 03:42:28.610350       1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht": Drain did not complete within 20s
I0815 03:42:51.784629       1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht"
I0815 03:42:52.010220       1 controller.go:225] Deleting node "ip-10-0-131-48.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht"
I0815 03:42:53.247097       1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" deletion successful

I0815 03:42:28.610464       1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" triggers delete
W0815 03:42:48.708860       1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk": Drain did not complete within 20s
I0815 03:43:08.751529       1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk"
I0815 03:43:09.116137       1 controller.go:225] Deleting node "ip-10-0-149-176.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk"
I0815 03:43:10.222152       1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" deletion successful

3) [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] test waits for 60 seconds at most for nodes to be delete

Timestamp recap:

1) Request to delete machines were triggered at 03:42:08.485
2) Both nodes were deleted at 03:43:09.116137

As you can see the difference between nodes being deleted and request to delete them is a bit over 60 seconds (~1 second).

Conclusion:

I did not check logs of other flaking instances though bumping the timeout to 2 minutes will do fine.

Comment 4 Jan Chaloupka 2019-08-15 10:45:00 UTC

It would also help to be a bit more verbose in the logs, e.g.:

diff --git a/test/extended/machines/scale.go b/test/extended/machines/scale.go
index ab25fef..7d3edfa 100644
--- a/test/extended/machines/scale.go
+++ b/test/extended/machines/scale.go
@@ -75,6 +75,7 @@ func getNodesFromMachineSet(c *kubernetes.Clientset, dc dynamic.Interface, machi
                return nil, fmt.Errorf("failed to list worker nodes: %v", err)
        }
 
+       e2e.Logf("Machines found %v, nodes found: %v", machines, allWorkerNodes.Items)
        machineToNodes, match := mapMachineNameToNodeName(machines, allWorkerNodes.Items)
        if !match {
                return nil, fmt.Errorf("not all machines have a node reference: %v", machineToNodes)
@@ -147,14 +148,15 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                                return false
                        }
                        e2e.Logf("node count : %v, expectedCount %v", len(nodes), expectedScaleOut)
+                       notReady := false
                        for i := range nodes {
                                e2e.Logf("node: %v", nodes[i].Name)
                                if !isNodeReady(*nodes[i]) {
                                        e2e.Logf("Node %q is not ready", nodes[i].Name)
-                                       return false
+                                       notReady = true
                                }
                        }
-                       return len(nodes) == expectedScaleOut
+                       return !notReady || len(nodes) == expectedScaleOut
                }
 
                cfg, err := e2e.LoadConfig()
@@ -178,6 +180,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                o.Expect(err).NotTo(o.HaveOccurred())
 
                initialNumberOfWorkers := len(nodeList.Items)
+               g.By(fmt.Sprintf("initial cluster workers size is %v", initialNumberOfWorkers))
 
                initialReplicasMachineSets := map[string]int{}
 
@@ -189,6 +192,8 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                        err = scaleMachineSet(machineName(machineSet), expectedScaleOut)
                        o.Expect(err).NotTo(o.HaveOccurred())
                }
+
+               g.By("checking scaled up worker nodes are ready")
                for _, machineSet := range machineSets {
                        expectedScaleOut := initialReplicasMachineSets[machineName(machineSet)] + 1
                        o.Eventually(func() bool {
@@ -209,6 +214,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                                LabelSelector: nodeLabelSelectorWorker,
                        })
                        o.Expect(err).NotTo(o.HaveOccurred())
+                       g.By(fmt.Sprintf("got %v nodes, expecting %v", len(nodeList.Items), initialNumberOfWorkers))
                        return len(nodeList.Items) == initialNumberOfWorkers
                }, 1*time.Minute, 5*time.Second).Should(o.BeTrue())
        })

Comment 6 Jianwei Hou 2019-08-19 06:12:54 UTC

This test is skipped by https://github.com/openshift/origin/blob/master/test/extended/machines/scale.go#L173 because "machine.openshift.io/cluster-api-machine-role" is removed from a machine in its "spec.selector.matchLabels".

Log:

openshift-tests run all --dry-run|grep -E "\[Feature:Machines\]\[Serial\] Managed cluster should grow and decrease when scaling different machineSets simultaneously \[Suite:openshift/conformance/serial\]" | openshift-tests run -f -
started: (0/1/1) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

Aug 19 13:52:19.329: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
Aug 19 13:52:19.330: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Aug 19 13:52:21.559: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Aug 19 13:52:21.868: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Aug 19 13:52:21.868: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Aug 19 13:52:21.868: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Aug 19 13:52:21.970: INFO: e2e test version: v1.14.0+5f82df0389
Aug 19 13:52:22.066: INFO: kube-apiserver version: v1.14.0+efa06d3
[BeforeEach] [Top Level]
  /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:73
[It] grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
  /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/machines/scale.go:142
Aug 19 13:52:22.071: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
Aug 19 13:52:22.073: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
STEP: fetching worker machineSets
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1a"}
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1c"}
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1d"}
Aug 19 13:52:22.174: INFO: Expects at least one worker machineset. Found none!!!
Aug 19 13:52:22.176: INFO: Running AfterSuite actions on all nodes
Aug 19 13:52:22.176: INFO: Running AfterSuite actions on node 1
skip [github.com/openshift/origin/test/extended/machines/scale.go:173]: Expects at least one worker machineset. Found none!!!

skipped: (3.8s) 2019-08-19T05:52:22 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

0 pass, 1 skip (3.8s)

Comment 7 Jianwei Hou 2019-08-19 06:14:57 UTC

> is removed from a machine in its "spec.selector.matchLabels".

Correction, by machine, I meant to say machineSet.

Comment 8 Jianwei Hou 2019-08-28 06:43:21 UTC

Verified in 4.2.0-0.nightly-2019-08-27-105356, the test if fixed.

Comment 10 errata-xmlrpc 2019-10-16 06:34:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.