Bug 1737134 - [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] failing on e2e-aws 20%
Summary: [Feature:Machines][Serial] Managed cluster should grow and decrease when scal...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 4.2.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard: buildcop
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-02 18:44 UTC by Peter Hunt
Modified: 2019-10-16 06:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:34:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 23612 0 'None' closed bug 1737134: [Feature:Machines][Serial] Managed cluster should: increase timeout waiting for nodes to disappear 2020-05-26 05:43:24 UTC
Github openshift origin pull 23625 0 'None' closed Bug 1737134: [Feature:Machines][Serial] Managed cluster should: get machineset role from template labels 2020-05-26 05:43:24 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:42 UTC

Description Peter Hunt 2019-08-02 18:44:34 UTC
Description of problem:
[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
failing on e2e-aws intermittently

Version-Release number of selected component (if applicable):
4.2.0

How reproducible:
about half of occurances as of 8/2
https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.2&sort-by-flakiness=10&exclude-non-failed-tests=

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1725259 seems related but is about Azure and not AWS

Comment 1 Dan Mace 2019-08-09 18:28:23 UTC
The summary of this bug refers to an ingress test, but then the description refers to [Feature:Machines] (which still appears on the test grid as flaky), but then in additional info cites an ingress bug that matches the bug summary.

Is this bug about:

  [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
failing on e2e-aws intermittently

or:

  Managed cluster should should expose cluster services outside the cluster test failing on AWS 

If it's about the Machines test, the component needs reassigned to whatever team owns machine sets and the summary fixed. If the bug is about the ingress test, I recommend we close the bug because I see no evidence of flakiness on AWS (please correct me if I'm wrong).

Comment 2 Clayton Coleman 2019-08-13 22:08:55 UTC
Resolving the confusion.

Comment 3 Jan Chaloupka 2019-08-15 10:42:53 UTC
Based on the logs in https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/3322.

1) Both machineset were scaled up:

STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a" from 3 to 2 replicas
Aug 15 03:42:08.361: INFO: >>> kubeConfig: /tmp/admin.kubeconfig
STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b" from 2 to 1 replicas
Aug 15 03:42:08.485: INFO: >>> kubeConfig: /tmp/admin.kubeconfig

2) Machine controller deleted both machines and nodes successfully:

I0815 03:42:08.487601       1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" triggers delete
W0815 03:42:28.610350       1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht": Drain did not complete within 20s
I0815 03:42:51.784629       1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht"
I0815 03:42:52.010220       1 controller.go:225] Deleting node "ip-10-0-131-48.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht"
I0815 03:42:53.247097       1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" deletion successful

I0815 03:42:28.610464       1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" triggers delete
W0815 03:42:48.708860       1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk": Drain did not complete within 20s
I0815 03:43:08.751529       1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk"
I0815 03:43:09.116137       1 controller.go:225] Deleting node "ip-10-0-149-176.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk"
I0815 03:43:10.222152       1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" deletion successful

3) [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] test waits for 60 seconds at most for nodes to be delete

Timestamp recap:

1) Request to delete machines were triggered at 03:42:08.485
2) Both nodes were deleted at 03:43:09.116137

As you can see the difference between nodes being deleted and request to delete them is a bit over 60 seconds (~1 second).

Conclusion:

I did not check logs of other flaking instances though bumping the timeout to 2 minutes will do fine.

Comment 4 Jan Chaloupka 2019-08-15 10:45:00 UTC
It would also help to be a bit more verbose in the logs, e.g.:

diff --git a/test/extended/machines/scale.go b/test/extended/machines/scale.go
index ab25fef..7d3edfa 100644
--- a/test/extended/machines/scale.go
+++ b/test/extended/machines/scale.go
@@ -75,6 +75,7 @@ func getNodesFromMachineSet(c *kubernetes.Clientset, dc dynamic.Interface, machi
                return nil, fmt.Errorf("failed to list worker nodes: %v", err)
        }
 
+       e2e.Logf("Machines found %v, nodes found: %v", machines, allWorkerNodes.Items)
        machineToNodes, match := mapMachineNameToNodeName(machines, allWorkerNodes.Items)
        if !match {
                return nil, fmt.Errorf("not all machines have a node reference: %v", machineToNodes)
@@ -147,14 +148,15 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                                return false
                        }
                        e2e.Logf("node count : %v, expectedCount %v", len(nodes), expectedScaleOut)
+                       notReady := false
                        for i := range nodes {
                                e2e.Logf("node: %v", nodes[i].Name)
                                if !isNodeReady(*nodes[i]) {
                                        e2e.Logf("Node %q is not ready", nodes[i].Name)
-                                       return false
+                                       notReady = true
                                }
                        }
-                       return len(nodes) == expectedScaleOut
+                       return !notReady || len(nodes) == expectedScaleOut
                }
 
                cfg, err := e2e.LoadConfig()
@@ -178,6 +180,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                o.Expect(err).NotTo(o.HaveOccurred())
 
                initialNumberOfWorkers := len(nodeList.Items)
+               g.By(fmt.Sprintf("initial cluster workers size is %v", initialNumberOfWorkers))
 
                initialReplicasMachineSets := map[string]int{}
 
@@ -189,6 +192,8 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                        err = scaleMachineSet(machineName(machineSet), expectedScaleOut)
                        o.Expect(err).NotTo(o.HaveOccurred())
                }
+
+               g.By("checking scaled up worker nodes are ready")
                for _, machineSet := range machineSets {
                        expectedScaleOut := initialReplicasMachineSets[machineName(machineSet)] + 1
                        o.Eventually(func() bool {
@@ -209,6 +214,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() {
                                LabelSelector: nodeLabelSelectorWorker,
                        })
                        o.Expect(err).NotTo(o.HaveOccurred())
+                       g.By(fmt.Sprintf("got %v nodes, expecting %v", len(nodeList.Items), initialNumberOfWorkers))
                        return len(nodeList.Items) == initialNumberOfWorkers
                }, 1*time.Minute, 5*time.Second).Should(o.BeTrue())
        })

Comment 6 Jianwei Hou 2019-08-19 06:12:54 UTC
This test is skipped by https://github.com/openshift/origin/blob/master/test/extended/machines/scale.go#L173 because "machine.openshift.io/cluster-api-machine-role" is removed from a machine in its "spec.selector.matchLabels".

Log:

openshift-tests run all --dry-run|grep -E "\[Feature:Machines\]\[Serial\] Managed cluster should grow and decrease when scaling different machineSets simultaneously \[Suite:openshift/conformance/serial\]" | openshift-tests run -f -
started: (0/1/1) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

Aug 19 13:52:19.329: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
Aug 19 13:52:19.330: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Aug 19 13:52:21.559: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Aug 19 13:52:21.868: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Aug 19 13:52:21.868: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Aug 19 13:52:21.868: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Aug 19 13:52:21.970: INFO: e2e test version: v1.14.0+5f82df0389
Aug 19 13:52:22.066: INFO: kube-apiserver version: v1.14.0+efa06d3
[BeforeEach] [Top Level]
  /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:73
[It] grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]
  /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/machines/scale.go:142
Aug 19 13:52:22.071: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
Aug 19 13:52:22.073: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig
STEP: fetching worker machineSets
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1a"}
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1c"}
Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1d"}
Aug 19 13:52:22.174: INFO: Expects at least one worker machineset. Found none!!!
Aug 19 13:52:22.176: INFO: Running AfterSuite actions on all nodes
Aug 19 13:52:22.176: INFO: Running AfterSuite actions on node 1
skip [github.com/openshift/origin/test/extended/machines/scale.go:173]: Expects at least one worker machineset. Found none!!!

skipped: (3.8s) 2019-08-19T05:52:22 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]"

0 pass, 1 skip (3.8s)

Comment 7 Jianwei Hou 2019-08-19 06:14:57 UTC
> is removed from a machine in its "spec.selector.matchLabels".

Correction, by machine, I meant to say machineSet.

Comment 8 Jianwei Hou 2019-08-28 06:43:21 UTC
Verified in 4.2.0-0.nightly-2019-08-27-105356, the test if fixed.

Comment 10 errata-xmlrpc 2019-10-16 06:34:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.