Description of problem: [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] failing on e2e-aws intermittently Version-Release number of selected component (if applicable): 4.2.0 How reproducible: about half of occurances as of 8/2 https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.2&sort-by-flakiness=10&exclude-non-failed-tests= Additional info: https://bugzilla.redhat.com/show_bug.cgi?id=1725259 seems related but is about Azure and not AWS
The summary of this bug refers to an ingress test, but then the description refers to [Feature:Machines] (which still appears on the test grid as flaky), but then in additional info cites an ingress bug that matches the bug summary. Is this bug about: [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] failing on e2e-aws intermittently or: Managed cluster should should expose cluster services outside the cluster test failing on AWS If it's about the Machines test, the component needs reassigned to whatever team owns machine sets and the summary fixed. If the bug is about the ingress test, I recommend we close the bug because I see no evidence of flakiness on AWS (please correct me if I'm wrong).
Resolving the confusion.
Based on the logs in https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/3322. 1) Both machineset were scaled up: STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a" from 3 to 2 replicas Aug 15 03:42:08.361: INFO: >>> kubeConfig: /tmp/admin.kubeconfig STEP: scaling "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b" from 2 to 1 replicas Aug 15 03:42:08.485: INFO: >>> kubeConfig: /tmp/admin.kubeconfig 2) Machine controller deleted both machines and nodes successfully: I0815 03:42:08.487601 1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" triggers delete W0815 03:42:28.610350 1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht": Drain did not complete within 20s I0815 03:42:51.784629 1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" I0815 03:42:52.010220 1 controller.go:225] Deleting node "ip-10-0-131-48.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" I0815 03:42:53.247097 1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1a-5djht" deletion successful I0815 03:42:28.610464 1 controller.go:205] Reconciling machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" triggers delete W0815 03:42:48.708860 1 controller.go:298] drain failed for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk": Drain did not complete within 20s I0815 03:43:08.751529 1 controller.go:302] drain successful for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" I0815 03:43:09.116137 1 controller.go:225] Deleting node "ip-10-0-149-176.ec2.internal" for machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" I0815 03:43:10.222152 1 controller.go:239] Machine "ci-op-4mlicwcq-ce7d8-mqsjq-worker-us-east-1b-4khwk" deletion successful 3) [Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] test waits for 60 seconds at most for nodes to be delete Timestamp recap: 1) Request to delete machines were triggered at 03:42:08.485 2) Both nodes were deleted at 03:43:09.116137 As you can see the difference between nodes being deleted and request to delete them is a bit over 60 seconds (~1 second). Conclusion: I did not check logs of other flaking instances though bumping the timeout to 2 minutes will do fine.
It would also help to be a bit more verbose in the logs, e.g.: diff --git a/test/extended/machines/scale.go b/test/extended/machines/scale.go index ab25fef..7d3edfa 100644 --- a/test/extended/machines/scale.go +++ b/test/extended/machines/scale.go @@ -75,6 +75,7 @@ func getNodesFromMachineSet(c *kubernetes.Clientset, dc dynamic.Interface, machi return nil, fmt.Errorf("failed to list worker nodes: %v", err) } + e2e.Logf("Machines found %v, nodes found: %v", machines, allWorkerNodes.Items) machineToNodes, match := mapMachineNameToNodeName(machines, allWorkerNodes.Items) if !match { return nil, fmt.Errorf("not all machines have a node reference: %v", machineToNodes) @@ -147,14 +148,15 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() { return false } e2e.Logf("node count : %v, expectedCount %v", len(nodes), expectedScaleOut) + notReady := false for i := range nodes { e2e.Logf("node: %v", nodes[i].Name) if !isNodeReady(*nodes[i]) { e2e.Logf("Node %q is not ready", nodes[i].Name) - return false + notReady = true } } - return len(nodes) == expectedScaleOut + return !notReady || len(nodes) == expectedScaleOut } cfg, err := e2e.LoadConfig() @@ -178,6 +180,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() { o.Expect(err).NotTo(o.HaveOccurred()) initialNumberOfWorkers := len(nodeList.Items) + g.By(fmt.Sprintf("initial cluster workers size is %v", initialNumberOfWorkers)) initialReplicasMachineSets := map[string]int{} @@ -189,6 +192,8 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() { err = scaleMachineSet(machineName(machineSet), expectedScaleOut) o.Expect(err).NotTo(o.HaveOccurred()) } + + g.By("checking scaled up worker nodes are ready") for _, machineSet := range machineSets { expectedScaleOut := initialReplicasMachineSets[machineName(machineSet)] + 1 o.Eventually(func() bool { @@ -209,6 +214,7 @@ var _ = g.Describe("[Feature:Machines][Serial] Managed cluster should", func() { LabelSelector: nodeLabelSelectorWorker, }) o.Expect(err).NotTo(o.HaveOccurred()) + g.By(fmt.Sprintf("got %v nodes, expecting %v", len(nodeList.Items), initialNumberOfWorkers)) return len(nodeList.Items) == initialNumberOfWorkers }, 1*time.Minute, 5*time.Second).Should(o.BeTrue()) })
This test is skipped by https://github.com/openshift/origin/blob/master/test/extended/machines/scale.go#L173 because "machine.openshift.io/cluster-api-machine-role" is removed from a machine in its "spec.selector.matchLabels". Log: openshift-tests run all --dry-run|grep -E "\[Feature:Machines\]\[Serial\] Managed cluster should grow and decrease when scaling different machineSets simultaneously \[Suite:openshift/conformance/serial\]" | openshift-tests run -f - started: (0/1/1) "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" Aug 19 13:52:19.329: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig Aug 19 13:52:19.330: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Aug 19 13:52:21.559: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready Aug 19 13:52:21.868: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) Aug 19 13:52:21.868: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. Aug 19 13:52:21.868: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start Aug 19 13:52:21.970: INFO: e2e test version: v1.14.0+5f82df0389 Aug 19 13:52:22.066: INFO: kube-apiserver version: v1.14.0+efa06d3 [BeforeEach] [Top Level] /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:73 [It] grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial] /Users/houjianwei/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/machines/scale.go:142 Aug 19 13:52:22.071: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig Aug 19 13:52:22.073: INFO: >>> kubeConfig: /Users/houjianwei/kubeconfig STEP: fetching worker machineSets Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1a"} Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1c"} Aug 19 13:52:22.174: INFO: Labels map[string]interface {}{"machine.openshift.io/cluster-api-cluster":"jhou1-hxfnd", "machine.openshift.io/cluster-api-machineset":"jhou1-hxfnd-worker-ap-northeast-1d"} Aug 19 13:52:22.174: INFO: Expects at least one worker machineset. Found none!!! Aug 19 13:52:22.176: INFO: Running AfterSuite actions on all nodes Aug 19 13:52:22.176: INFO: Running AfterSuite actions on node 1 skip [github.com/openshift/origin/test/extended/machines/scale.go:173]: Expects at least one worker machineset. Found none!!! skipped: (3.8s) 2019-08-19T05:52:22 "[Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" 0 pass, 1 skip (3.8s)
> is removed from a machine in its "spec.selector.matchLabels". Correction, by machine, I meant to say machineSet.
Verified in 4.2.0-0.nightly-2019-08-27-105356, the test if fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922